Feature Engineering Methods for Text Data
Text data is in forms of words, phrases, sentences and documents. A set of documents is corpus as we know.
- 
    Pre-processing To clean up text data, here are some points: 1.1. Removing tags: like HTML tags 
 1.2. Removing accented characters: like é to e
 1.3. Removing special characters: punctual tokens
 1.4. Stemming and lemmatization
 1.5. Removing stopwords
- 
    Processing To convert text to number, here are some points: 2.1. Count Based: Bag-of-Word (N-Gram) 
 2.2. Word Frequence Based: TF-IDF
 2.3. Word embedding: word2vect
 2.4. Topic Modelling: LDA