Thisblog postwas originally posted on MonkeyLearn by Rodrigo Stecanella
Since the beginning of the brief history of Natural Language Processing (NLP), there has been the need to transform text into something a machine can understand. That is, transforming text into a meaningful vector (or array) of numbers. The de-facto standard way of doing this in the pre-deep learning era was to use a bag of words approach.
Bag of words
The idea behind this method is very simple, though very powerful. First, we define a fixed length vector where each entry corresponds to a word in our pre-defineddictionary of words. The size of the vector equals the size of the dictionary. Then, for representing a text using this vector, we just count how many times each word of our dictionary appears in the text and we put this number in the corresponding vector entry.