Feature Engineering Methods for Text Data
Text data is in forms of words, phrases, sentences and documents. A set of documents is corpus as we know.
-
Pre-processing To clean up text data, here are some points: 1.1. Removing tags: like HTML tags
1.2. Removing accented characters: like é to e
1.3. Removing special characters: punctual tokens
1.4. Stemming and lemmatization
1.5. Removing stopwords -
Processing To convert text to number, here are some points: 2.1. Count Based: Bag-of-Word (N-Gram)
2.2. Word Frequence Based: TF-IDF
2.3. Word embedding: word2vect
2.4. Topic Modelling: LDA