The Beginner’s Guide to Text Vectorization

Posted by Divya Susarla on 9/27/17

This blog post was originally posted on MonkeyLearn by Rodrigo Stecanella

Since the beginning of the brief history of Natural Language Processing (NLP), there has been the need to transform text into something a machine can understand. That is, transforming text into a meaningful vector (or array) of numbers. The de-facto standard way of doing this in the pre-deep learning era was to use a bag of words approach.

Bag of words

The idea behind this method is very simple, though very powerful. First, we define a fixed length vector where each entry corresponds to a word in our pre-defined dictionary of words. The size of the vector equals the size of the dictionary. Then, for representing a text using this vector, we just count how many times each word of our dictionary appears in the text and we put this number in the corresponding vector entry.

Read more from this blog post here.

Topics: Machine Learning, natural language processing, text vectorization