Daniel Shiffman wrote a bit of code for Bayesian Filtering. Bayesian Filtering is being used to classify spam and ham based on the content. Recently, Twitter has been getting a lot of spam. I was wondering how Bayesian Filtering works with Twitter spam. Some Twitter data is available at: http://www.public.asu.edu/~mdechoud/datasets.html under creative common’s license.

Modified the code to process in Hadoop and let the system run. The results were not very encouraging. Because the datasets for email spam and twitter spam could be different. Hopefully, we have more spam and ham words available to classify twitter spam. Till then, this idea gotta wait.