Avoid Spam - http://www.avoidspam.co.uk
What's So Good About Bayesian Filters?
http://www.avoidspam.co.uk/articles/8/1/Whats-So-Good-About-Bayesian-Filters/Page1.html
Admin istrator
 
By Admin istrator
Published on 04/26/2006
 
The latest buzzword in the world of anti-spam is 'Bayesian', but do you know what it's all about?

What's So Good About Bayesian Filters

You may or may not have heard about Bayesian filtering until this point. I must admit that I only found out about it not so long ago and I was initially sceptical about the what people were saying about it.

"What was this advanced spam filtering method that people seem to be talking about and what's so 'advanced' about it?"

Up until that point, I had been content to use popular anti-spam solutions server-side and client-side in order to handle my spam (not that I receive much in the way of spam any more - I now tend to collect it for research!) Having many email addresses publicly accessible over the Web meant they were subject to harvesting by spammers. It became essential for me to seek out software that could perform the laborious task of logging into each account, identifying and removing any spam.

Although the task of identifying a spam email isn't a particularly difficult or long one (consider how long you have to look at an email before deciding whether or not it's spam) for a person to carry out, when you start thinking about the total amount of time you spending wading through spam you may begin to wonder why we don't have anything that can perform the same checking for us i.e. in the same manner in which we would do it. We generally tend to identify spam by a number of different factors. The common factor shared by almost all spam is it's commercial nature. After all, most spam is sales pitch. Sales pitch is often very different from the daily language that is used by the average person (unless he/she works in sales!)

Traditional Filtering

More traditional filtering techniques rely on being able to identify spam by scoring emails according to specific characteristics such as the domain from which it's sent, the IP address of the mail server, the title of the email and the number of 'spam-like' keywords in the email. These methods usually rely on a maintained reference source whether it is a blacklist of known spammers, centralised databases of known spam emails or program updates. Even if these reference sources are updated regularly it still leaves the door open for spam that doesn't exhibit enough of any known characteristics in order to be classified as spam.

Keyword filtering seeks to bridge the gap between static (albeit maintained) reference sources and independent, dynamic filtering that can adapt to new types of spam. However, the keywords, like the centralised sources, can't always be accurately used as a means of spam identification for individual users. By that I mean, there's no way for the people responsible for maintaining the centralised sources to distinguish between a spam email and an email that just contains enough spam-like characteristics in order for it to be classified as such. This false identification of spam is usually considered more undesirable than spam. It's pointless to use spam filters to save time if they force you to spend time ensuring that the filters haven't falsely identified legitimate email as spam.


What's So Good About Bayesian Filters [continued]

Non-Bayesian filtering may be good for known spam, but it falls weak when:

  1. Updates can't be accessed meaning that your filter reference data isn't current.
  2. New spam is encountered that hasn't yet reached the centralised reference - will your filter be able to identify it as spam?
  3. Spam contains random content making each occurence different from any other - may get around filters that rely on unique 'fingerprints' of spam.
  4. Good email is falsely identified as spam due to filters that can't be personalised for every individual using them - this means that the reference dataset needs to be as large and encompassing as possible.

Bayesian filtering steps in to tackle these weaknesses by:

  1. Removing the need for program updates except for bug fixes and extra functionality - the filters automatically update themselves.
  2. Removing the need for a centralised reference source - filters are tuned to filter the spam that individual users receive.
  3. Scoring email by statistical analysis of email received by individual users.
  4. Being able to recognise the characteristics that make up 'good email' as well as spam.

Adaptable

Bayesian filtering works by calculating and adapting its filters to an individual user. It does this by looking at the email you have already received and classed as either good email or spam. By analysing these two groups of emails, Bayesian filters can assign scores to content and use this scoring data to analyse any new emails that are put through the filter. Because this scoring data is vital in order to classify emails, some incorrect decisions can be made i.e. good email identified as spam and spam identified as good email (false positives and false negatives respectively) when an 'untrained' filter is first put into use.

Training Bayesian filters is often just a case of clicking a button in order to swap the group to which an email has been placed. Some Bayesian filters are supplied pre-trained and so will be effective out-of-the-box.


What's So Good About Bayesian Filters [continued]

How it works

As an example of how Bayesian filters work, let's say that we have two well-established groups of several hundred emails; one group is all of our emails that have been identified as good (either by the filter or by us in the training process), the other contains all of our emails that have been identified as spam (again either by the filter or by us in the training process).

Let's assume that the word 'diploma' appears only in the bad email group and never in the good email group. Let's also assume that the word 'afternoon' occurs only in the good email group. The Bayesian filter could then surmise that the probability of any email containing the word 'diploma' being spam to be very likely and assign a high score of almost 1 (100% probability of being spam) to it. Likewise, the probability of any email containing the word 'afternoon' being spam would be quite low and so an appropriately low score of almost 0 (0% probability of being spam) would be assigned.

This is a very simple example to illustrate a point. Filtering will actually be carried out taking many other characteristics into account before coming to a final decision e.g. good content scores as well as spam content scores, information contained in the email headers, etc.

Any time an incorrect decision is arrived at, a manual correction by the user can be made and the filter automatically adjusts itself accordingly. Therein lies one of the biggest benefits of Bayesian filtering, being able to easily adapt the filter according to your own experience.


What's So Good About Bayesian Filters [continued]

Initially, the performance from relatively untrained Bayesian filters may be less than that of other filtering techniques due to a lack of reference data, but once a useful number of good and bad emails have been processed through the filter and any required corrections carried out, the Bayesian filter can provide over 99% accuracy in spam filtering as has been demonstrated by the likes of Paul Graham.

As spammers find new ways to circumvent the filters that we use it's important that we stay ahead of them by ensuring our efforts to avoid and block their communications are adaptable. The Bayesian approach is simple, cost effective, auto-adaptive, can be very effective and is just the sort of thing that spammer's nightmares are made of!