

But if you compare the vectors that one hot encoders generate from these sentences, the only thing you would find is that there is no word match between both phrases. Our intuition tells us that they are basically the same. Our algorithm should be able to understand that the information in that sentence is very similar to the information in “apes consume fruits”. However, imagine that we’re trying to understand what an animal eats from analyzing text on the internet, and we find that “monkeys eat bananas”. So simple, and yet it works! Machine learning algorithms are so powerful that they can generate lots of amazing results and applications. And yes, you guessed right: the one for Banana.

Imagine our entire vocabulary is 3 words: Monkey, Ape and Banana. Then, you define the vector of the i-th word as all zeros except for a 1 in the position i. You count how many words there are in the vocabulary, say 1500, and establish an order for them from 0 to that size (in this case 1500). In machine learning, this is usually defined as all the words that appear in your training data. The most straightforward way to encode a word (or pretty much anything in this world) is called one-hot encoding: you assume you will be encoding a word from a pre-defined and finite set of possible words. What does it mean to represent a word? And more importantly, how do we do it? If you are asking yourself those questions, then I’m glad you’re reading this post. But wait, don’t celebrate so fast, it’s not as easy as assigning a number to each word, it’s much better if that vector of numbers represents the words and the information provided.

That’s called text vectorization and you can read more of it in this beginner's guide. Sounds great! But there’s a challenge that jumps out: we, humans, communicate with words and sentences meanwhile, computers only understand numbers.įor this reason, we have to map those words (sometimes even the sentences) to vectors: just a bunch of numbers. In Natural Language Processing we want to make computer programs that understand, generate and, more generally speaking, work with human languages.
