One of our goals with the Feedzie Blog is to share how we accomplish things at Feedzie.com and where we are going from there. It is somewhat the story coming out of our kitchen. You eat what we cook but don't you wonder how we cook? Today we will start with how we categorize feeds and their published content.
Mapping Unknown to Known
Can we teach machines how to understand and interpret what any given story is telling about? Not exactly ... but we can ... to a certain extent. We can map the content to predefined concepts that can be interpreted by a machine and then let it interpret those concepts rather than actual words. It's a clever trick. If you don't know it, think of it as a different thing that you already know.
In case of Feedzie.com, we programmed our modules to learn which words represent which concepts and then make use of that data to categorize feeds and their published content. Mapping unknown to known cannot be done without loosing any information and the lost data depends on how we select those concepts onto which we map the actual content.
Selecting concepts, that is, categories
For categorization task, concepts are categories and based on how you want to use the outcome, you can select different set of categories. Since our job is to categorize blog posts, we think of posts as news segments and came up with 8 categories which are also mainly used to categorize news on the Internet. Those are business, entertainment, life, politics, science, sports, technology and misc. The plan is to map the words of a given story to these categories and then try to detemine which category is dominant for a given story. First 7 categories seem ok but what about the last one? The miscellaneous?
Every word has something to say but not all of them means something on its own. In other words, existance of some words in a story doesn't imply anything about what that story is all about. They somewhat connect other words that give the actual meaning to that story. Think of prepositions. Sounds like miscellaneous, doesn't it? So, this 8th category somewhat helps other 7 categories cover up the conceptual space without leaving any word outside.
If you want to define your own categories, you may not want to use that "miscellaneous category." That is perfectly normal. This is the way we map the real world into categories. Yours can be different. In fact, except the 8th one, we used 7 major categories but in real life, it is believed to be around 50-60 conceptual categories.
Task to accomplish : Learning
So where are we? Ok, we are trying to teach our machine to categorize given content and we just determined our categories. Now it is time to show our machine what is what and let it learn the relations between words and categories. To do that, we provided training data that consists of more than a million stories along with the data telling which story is in which category.
Since we select small set of categories, the real problem behind learning is not to choose how to design our learning strategy, it can be as simple as counting occurances. However, the real problem is finding large enough training data so that our training module can see all the words in high enough cases.
When we feed our training module this training data, it looks at which words occurs in which categories and extracts statistical relations out of that. It simply counts how many times it occurs in one category and then normalizes with the total occurance and finds the percentage. After finishing up training, it should be able to tell which words are most likely related to which categories, such as word 'Google' 70% of the times occured in 'Technolgy' category hence it is most likely related to 'Technolgy' category. Pretty simple.
After longs hours of traning and couple of naps(remember we are talking about millions of stories), our trained module knows how to map words to categories and find the dominant category by considering all the words of a given story.
Where are we at categorization?
So far what we told you seems like a mathematical calculations, and it is actually, simple as counting jumping sheeps. But the outcome is not that abstract because--after all--assigned categories are fairly good enough to describe the content and you can easily and effectively use them to optimize search results and to understand the content of a feed at one glance, like the way shown below.
So, we started with statistical models and ended up with content-oriented tool that you can make use of it to define what you are looking for in a better way. That is our job, to develop tools to let you define what you have in mind.
Failing at categorization
There can be various factors that fails us at categorization task. Those can start from the beginning by not selecting the right set of categories, or a result of a small set of training data. Even if your learning strategy is good and training data is fat, the story you have to categorize may not have enough words to determine its category. Or more dramatically, it can be in another language. Not fair huh? Welcome to the real world, my statistical friend.
Being better
Recall that we only defined 7 major categories. It is a good start but sure it can be improved. How? we can either increase that number or determine subcategories for each category, which help us focus on more specific content and let our training module learn more specific relations.
As we increase the number of dimensions onto which we map actual words, we should be able to preserve more information during that conversion, which leads to better understanding of relations between words and concepts. This also gives you more options at optimizing your search by allowing you to specify what you have in mind in more detailed ways.
In fact, if you think further, you can easily see that tagging is also another form of categorization where each tag corresponds to a category, which makes those calculations a real mess. But it is a subject of another post. Lets give you a break.
|