Excellent points. There is definitely R&D to be done in sharing Bayes info. Just as antivirus is able to LiveUpdate the braindead easy to define viruses, so should Spam Software. Regrettably, one of the key points of Bayes is that it is individualized. A common 'Bayes' DB is somewhat more difficult.Bayesian filters have had some amazing successes. The problem we (the company I work for) continue to have, and the reason we continue to choose SA, is that training a thousand users on how to use a Bayes system is pretty much impossible (and we're small compared to many!) Assuming that I give you (I'm do not believe it, but will give it for the sake of argument) that Bayes is the best theoretical solution, the Bayes folks have a problem in implementation. Training users is not easy; think about training your mother or grandmother but multiply by 1000.
This is why two features exist, both which I think are components of any good Bayesian solution:
1. User groups. ... 2. A merge tool. ...
Indeed, many are working on such a solution. We have a similar system in production for our users, and have commented on similar ideas for the SA system. Regrettably, the number of IP addresses is actually fairly large in terms of tracking spam status. And the variety of ways that spam can be transmitted complicates matters. Nevertheless, a bug has been opened at SA to attack the IP addresses that spammers use. Also note that a number of high profile anti-spam DNS services have been DoS'ed into oblivion (a couple in the last 2 months). So whatever solution needs to be resilient (either by having a holy ton of bandwidth, or peer to peer).Global tools are also an invaluable asset to fighting spam. We're working on a magical blacklisting tool that will capture source ips from incoming spam...when a threshhold is exceeded, all incoming messages from that source ip are marked/learned as spam for all users (system wide) for whatever time period we specify.
You've done better than us. How have you managed to train your users to forward the email as the full email, incl all headers, etc? We've found most forwarded messages do not include all headers, and therefore forwarded messages train the spam database with semi legit emails (i.e. headers are legit because they are forwarded).Note, however, that the learning process does not need to be tech-savvy. For example, we specifically sculpted our tool to be brain dead easy for grandma. You get your mail like normal, and if you get a spam you forward it to grandma-spam@yourdomain.com. There are even tools such as SpamSource (for Outlook) that can make this process a simple click of a button. The signature mechanism we use stores the original tokenset in binary format in a temporary database on the server (or in the form of message attachments), which our tool will then use to relearn the message as spam.
I love it, increasing spam protection is great. My perspective is that filtering 90% of spam for 1000 users (via SA, or whatever) is better than filtering 99% of spam for 1 user. Yes, the individual number is better in terms of percentage, however by doing the whole group of users, we block several hundred to a few thousand spam messages a day. It remains a difficult problem.Anyhow, my point is, we're trying to improve the ease-of-use factor, which is a big reason tools like SA are still useful...out-of-the-box functionality...however that doesn't necessarily mean heuristics are not obsolete from a scientific perspective. I think we're getting to a point where enough tools exist to make a deployment just as easy, and hopefully if things continue at the rate they're going, companies like yours that require this level of ease will be able to use Bayesian solutions
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature