Day 097
DNB and Typing
d4b 33% Sun 07 Apr 2019 04:24:36 PM CEST d4b 33% Sun 07 Apr 2019 04:26:35 PM CEST d4b 56% Sun 07 Apr 2019 04:28:28 PM CEST d4b 61% Sun 07 Apr 2019 04:30:24 PM CEST d4b 28% Sun 07 Apr 2019 04:32:21 PM CEST d4b 44% Sun 07 Apr 2019 04:34:27 PM CEST d4b 22% Sun 07 Apr 2019 04:36:19 PM CEST d4b 39% Sun 07 Apr 2019 04:38:14 PM CEST
Quotes
“Wherever you are, make sure you’re there.” — Dan Sullivan
Diploma
Classifying by parts of speech
nltk.download()
downloads everything needed.
nltk.word_tokenize('aoethnsu')
returns the tokens. From [https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b](This article). For parts of speech it’s nltk.pos_tag(tokens)
.
The tokenizer for twitter works better for URLs (of course). Interestingly it sees URLs as NN. And - this is actually fascinating - smileys get tokenized differently!
('morning', 'NN'),
('✋', 'NN'),
('🏻', 'NNP'),
EDIT: nltk.tokenize.casual might be just like the above, but better!
EDIT: I have a column with the POS of the tweets! How do I classify it with its varying length? How can I use the particular emojis as another feature?
Ideas
POS + individual smileys might be enough for it to generalize! TODO test TODO: Maybe first do some much more basic feature engineering with capitalization and other features mentioned here:
Word Count of the documents – total number of words in the documents Character Count of the documents – total number of characters in the documents Average Word Density of the documents – average length of the words used in the documents Puncutation Count in the Complete Essay – total number of punctuation marks in the documents Upper Case Count in the Complete Essay – total number of upper count words in the documents Title Word Count in the Complete Essay – total number of proper case (title) words in the documents Frequency distribution of Part of Speech Tags: Noun Count Verb Count Adjective Count Adverb Count Pronoun Count
Resources
textminingonline.com has nice resources on topic which would be very interesting to skim through! Additionally flair is a very interesting library not to reinvent the wheel, even though reinventing the wheel would be the entire point of a bachelor’s thesis.
This could work as a general high-levent intro into NLP? Also this.