What's So Good About Bayesian Filters?
- By Admin istrator
- Published 04/27/2006
- Anti Spam
Admin istrator
View all articles by Admin istratorYou may or may not have heard about Bayesian filtering until this point. I must admit that I only found out about it not so long ago and I was initially sceptical about the what people were saying about it.
"What was this advanced spam filtering method that people seem to be talking about and what's so 'advanced' about it?"
Up until that point, I had been content to use popular anti-spam solutions server-side and client-side in order to handle my spam (not that I receive much in the way of spam any more - I now tend to collect it for research!) Having many email addresses publicly accessible over the Web meant they were subject to harvesting by spammers. It became essential for me to seek out software that could perform the laborious task of logging into each account, identifying and removing any spam.
Although the task of identifying a spam email isn't a particularly difficult or long one (consider how long you have to look at an email before deciding whether or not it's spam) for a person to carry out, when you start thinking about the total amount of time you spending wading through spam you may begin to wonder why we don't have an
Traditional Filtering
More traditional filtering techniques rely on being able to identify spam by scoring emails according to specific characteristics such as the domain from which it's sent, the IP address of the mail server, the title of the email and the number of 'spam-like' keywords in the email. These methods usually rely on a maintained reference source whether it is a blacklist of known spammers, centralised databases of known spam emails or program updates. Even if these reference sources are updated regularly it still leaves the door open for spam that doesn't exhibit enough of any known characteristics in order to be classified as spam.
Keyword filtering seeks to bridge the gap between static (albeit maintained) reference sources and independent, dynamic filtering that can adapt to new types of spam. However, the keywords, like the centralised sources, can't always be accurately used as a means of spam identification for individual users. By that I mean, there's no way for the people responsible for maintaining the centralised sources to distinguish between a spam email and an email that just contains enough spam-like characteristics in order for it to be classified as such. This false identification of spam is usually considered more undesirable than spam. It's pointless to use spam filters to save time if they force you to spend time ensuring that the filters haven't falsely identified legitimate email as spam.
