Pages

Tuesday, October 6, 2009

Content filtering based upon Statistics

Statistical filtering, once set up, does not require any maintenance as such, and relies upon the response of the user to the incoming messages .The user receives the messages and marks them as spam or non spam and the filtering software learns from these judgments. The statistical filter does not reflect the software author's biases or the administrator's biases as such but it reflects the user's biases as to content. A biochemist who researches upon Viagra will not have messages containing the word “Viagra” flagged as spam because “Viagra” may show up in his or her legitimate messages. Spam messages containing the word “Viagra” may be considered as spam by any ordinary filter. Statistical filters should not just filter the messages by content but also by the transport mechanism through which they are sent. Typically statistical filtering techniques take single words in the calculation in deciding whether the message is flagged as spam or not but a more powerful calculation can be made by taking two or more words into consideration.

Software programs that implement statistical filtering include SpamBayes, Bogofilter, DSPAM and the e-mail programs like Mozilla Thunderbird, Mailwasher, and later revisions of SpamAssassin. Another interesting project is CRM114, which hashes phrases and performs Bayesian classification on the phrases. A free mail filter POPFile is also available which sorts email in as many categories as you may want (family, coworker, friends, spam and whatever) with Bayesian filtering.