Processing Words Is Easy, Processing Content Is Hard
Have you ever received an email with an advertisement for something unsavory followed by a paragraph of seemingly nonsense text? The reason for the extra text was the spammer was trying to get past one of the more common email spam filters known as Bayesian Spam Filtering. The process of adding text is called “poisoning” the filter, and it’s yet another tactic in the ongoing war between legitimate content providers and spammers. I was asked at Londonvasion 2008 whether I felt that there would ever be an effective way of dealing with human spammers. My comment at the time was that the best defense against spammer posts (human or otherwise) is an active and effective moderator team. Could this sort of algorithm be adoped as an anti-spam technique for board posts? Yes, I believe it could. To the best of my knowledge nobody has yet tried to do that for phpBB2 (my google-fu may have failed me, but I did look). I would be very interested to hear of such a project if it exists.
The problem with this and other anti-spam techniques is that it’s based on words rather than content. This may seem like splitting hairs… after all, isn’t my content made up of words? Yes, yes it is. And that’s the problem. Confused yet? I hope so, because it gets worse from here.
Words are Words, Content is Combinations of Words
Simply put, to catch and prevent spam posts in any sort of programatic fashion the code has to be smart enough to understand content and not just words. Examine for a moment the following two phrases:
Fruit flies like a banana.
Time flies like an arrow.
I didn’t make these up; the combination of these two phrases appears quite often in discussions about language or pattern recognition. The two phrases are nearly identical. Each has five words. In fact, the middle three words in each phrase are essentially identical. Yet the phrases mean something completely different. In one phrase the word “flies” is a noun (an object) and in the other it’s a verb. The word “like” is used in two different ways. If I were to examine these two sentences word by word I would probably conclude that there is a high degree of correlation between the two. In fact, there is very little.
Here’s another example that I saw recently. I could not remember it exactly, but it was something like this:
Bank of New Zealand floods customer inboxes.
New Zealand river floods, overflows bank.
Again, if I were to look at an individual word comparision these two sentences look very similar. They each contain the words (or forms of the words) “new”, “zealand”, “river”, “flood”, and “bank” in the sentence. When I first saw the example (which I cannot find at the moment) there were some other similar words as well. In order to properly differentiate these two sentences I have to go beyond word analysis and do a context or content analysis.
And what about this headline:
Hacker penetrates Paris Hilton
Is that an article about a security flaw in a hotel network? Or a pornography video?
Unstructured Data
Unstructured data analysis is becoming more and more interesting to corporations for a wide variety of reasons. None of them are related to fighting spam. Other than hiring an army of readers, how is a company to know what is being said about it on the web? There are sites like epinions.com and resellerratings.com that allow people to log on and post reviews about various products. There are newsgroups hosted by Yahoo! and Google where people can log on and post complements or complaints. There are blogs, discussion boards, and “sucks” sites. There are legitimate news articles or press releases. There are social networking sites. In short, there is a flood (heh) of information on the web, and very little of it is structured. If programmers at billion dollar companies are struggling with how to manage that information, what are we as phpBB MOD authors supposed to do?
I have often talked about my “big board” on this site. The board is an independent discussion board related to the products from a company named Business Objects (which recently was acquired by SAP). One of the products that Business Objects bought in 2007 was a company called Inxight which was a result of yet another Xeroc PARC research product. This product is designed to process unstructured data and perform content recognition. They have a fairly high-level demo online; I have included a link at the end of this post. The demo is light on specifics but it does show how the product can scan unstructured data like a press release and extract the important concepts and data points.
Anti-spam Application
And now I am finally getting back to the idea presented in the first paragraph: can we use word analysis to combat spammers on our boards? I think that the answer is “not yet” because we don’t have algorithms that are sophisticated enough to manage context. There are a number of anti-spam MODs in various stages that look at words, but to my knowledge there aren’t any that analyze the context of the words. A collection of words taken separately might indicate spam, but when reviewed in context they might be a perfectly valid post.
In other words, it’s not enough to identify words, I also need to identify how those words are used.
Related Posts
Another application for content analysis is a “related posts” MOD. There are a number of these for phpBB2. One I read used the phpBB2 search tables to identify common words by frequency across topics. Another used a special database index on the topic title only. To be honest, if posts are related because of common word usage, in my opinion they are flawed. If the posts are related because of relevance… that’s something I would be interested in. I did some experiments with a related posts MOD of my own and ultimately never completed the project due to my lack of satisfaction with the algorithms I could find or come up with on my own. I would like to revisit this idea again in the future.
Conclusion
The bottom line is that blogs and boards have one very important thing in common: the data is nearly completely unstructured. I say “nearly” because with blogs we have categories, and a board has a specific category -> forum -> topic hierarchy in place. But outside of that, the content provided by a board post may have nothing to do with anything else on the board. Does that make it spam? It’s hard to say. That’s why we still need good moderators for our boards.
Time flies like an arrow. Fruit flies like a banana. My board can’t tell the difference, can yours?
Related Links

