Web Scraping Company Press Release + (Beginner) Text Analysis with Python

print(pr_string)

NLTK or Natural Language Toolkit is an enormous library which holds the lion’s share of text processing capabilities in Python. While text or linguistic processing encompasses several exciting new applications in today’s modern world we will only use it for the most simplest task: determining word frequencies in a press release.

import nltk# create tokenizer
text_tokenizer = nltk.tokenize.RegexpTokenizer(r'w+')# now get tokens
pr_tokens = text_tokenizer.tokenize(pr_string)print(pr_tokens[:50])
# get Frequency Distribution of tokens
pr_fdist = nltk.FreqDist(pr_tokens)# show 10 most common words in press release
pr_fdist.most_common(10)
# import stopwords 
from nltk.corpus import stopwords# create list of stopwords in English
stopwords_list = set(stopwords.words('english'))# tokens with stopwords removed
tokens_new = [w for w in pr_tokens if w not in stopwords_list]print(tokens_new)
# number of words in text without stopwords
print(len(tokens_new)) # ---> should return 298# create Frequency Distribution with new tokens
pr_fdist_new = nltk.FreqDist(tokens_new)# get 10 most common words
print(pr_fdist_new.most_common(10))
# plot Frequency Distribution of top 30 words
pr_fdist_new.plot(30);

I hope that this article helps you fulfil your simple web scraping needs or at the very least make you curious about the topics mentioned. I am always engulfed by self-doubt as to whether I should write anything at all since there are so many talented writers on Medium already. However, since reading other people’s work has helped me improve and grow in my coding journey I hope that mine will do the same. Happy Learning :)

1 2 3 4

Share