So What Makes a Good Spam Filter Anyway?
by Alan Hearnshaw
Spam Filters. Most of us knowwe need one. Some of us know we need a better one, but how many stop to think whatactually makes a good spam filter in the first place?
This is not just arhetorical question. It is a question that many users – and many developers - donot ask, and consequently, it largely remains unanswered.
Maybe this couldbe better answered by defining here the qualities of the perfect spam filter.We’ll call our perfect spam filter the “SpamSplatter 3000”. Here are some ofthe defining qualities of “SpamSplatter 3000”
- It requires zero interaction from the user.
- It produces zero false positives (good messages identified as bad) and zero false negatives (bad messages identified as good).
- It is transparent – that is, you only ever see good messages and never need even be aware that spam exists.
That’s it. Notmuch of a shopping list is it?
Of course,“SpamSplatter 3000” hasn’t been invented yet (and if it does, I want a piece ofthe action), but it does give us a frame of reference when looking for the bestfilter we can find.
Let’s take eachpoint in turn:
It requires zerointeraction from the user
There are twokinds of filters that come near to this ideal currently: Bayesian Filters and CommunityFilters.
Bayesian filters strip messages down to small “word bites”, or tokens and maintain a database containing listsof good and bad tokens. When a new message is encountered, the filter stripsthis message down to tokens, compares it to the database, and applies a formulabased on the British scientist Alan Bayes’ formula for probability calculation.
Over time, the Bayesian filter “learns” the characteristics of spam messages.
Community Filters simply work on a voting system wherebyevery user that receives a spam message “votes” it as spam. This information isstored on a central server and when enough votes are received the message isbanned from all users in the community.
As can be seen,the user interaction from these types of filters is mainly limited to two buttonoperation – correcting wrongly identified messages – and the more accurate thefilter, the less those buttons are used.
OK, so that’spretty good. Not exactly zero interaction, but if the filter is accurateenough, then it should be pretty near. That brings us to point two:
It produces zero falsepositives or negatives
This is the areain which most spam filter development is concentrating and things are gettingpretty good nowadays. It is not at all unusual to see an efficient modernfilter achieve accuracy of 96% or better. It is, of course, far better to havea false negative than a false positive if you are ever going to tear yourselfaway from the killed mail folder!
Of course, bydefinition, community filters cannot reach 100% accuracy as someone has to begetting the spam to be voting it as such!
Theoretically, aBayesian filter may be able to eventuallyget quite close to 100% accuracy, so at least there is hope there.
Content basedfilters (those that look for certain words, phrases or other indicators in amessage to identify it as spam), will almost certainly not get much higheraccuracy figures than the best of them can achieve today. Adapting to changing spamrequires new filters to be created on an ongoing basis.
And finally, wecome to the holy grail of spam filtering:
It is transparent
Strangely enough,not enough work seems to be done in trying to achieve this goal. Some of thebest filters on the market today identify spam with impressive accuracy andthen simply place them in a “killed mail” folder for your later perusal.
Now, forgive me ifI’m missing something here, but isn’t the point to save you having to wadethrough the junk mail? Isn’t that what you bought the filter for? With the“SpamSplatter 3000”, you don’t need to do that.
As we haven’tachieved 100% accuracy yet (and probably never will), the only way to free usfrom checking the killed mail folder is a challenge/response system. This iswhere a message is automatically sent back to the sender requiring them to takesome action for their message to actually be delivered.
Some systems tendto go overboard with the challenge/response system. These systems - oftencalled “Whitelist” systems - block messages from anyone that isn’t in theuser’s friends list. Guaranteed 100% effective, but too drastic a measure formost users.
Now, it seems thatthe most intelligent use of this system would be to send challenges only tomessages that were flagged as “questionable”. Good message can be delivered,definite spam can be deleted and questionable ones would earn themselves achallenge message.
So, to sum up,let’s rewrite the qualities of our perfect filter and get a shopping list ofwhat to look for while we wait for the “SpamSplatter 3000” to arrive:
- Simple, minimal setup and maintenance.
- Extremely low rate of false positives and as few false negatives as possible.
- A transparent “fail-safe” mechanism whereby the victims of those false positives can force the message through to you.
It’s simplereally. Now, who’s going to build me this “SpamSplatter3000”…?
About The Author
Alan Hearnshaw isthe owner of http://www.WhichSpamFilter.com,a site which provides weekly in-depth spam filter reviews, anti-spam help andguidance, user ratings and a community forum.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
More How To Articles
More By Developer Shed