Assigned: November 15, 2017
Due: December 6, 2017
In this assignment, you will create a Naive Bayes classifier for detecting e-mail spam, and you will test your classifier on a publicly available spam dataset using 5-fold cross-validation.
The Spambase data set consists of 4,601 e-mails, of which 1,813 are spam (39.4%). The data set archive contains a processed version of the e-mails wherein 57 real-valued features have been extracted and the spam/non-spam label has been assigned. You should work with this processed version of the data. The data set archive contains a description of the features extracted as well as some simple statistics over those features.
To estimate the generalization (testing) error of your classifier, you will perform cross-validation. In k-fold cross-validation, one would ordinarily partition the data set randomly into k groups of roughly equal size and perform k experiments (the "folds") wherein a model is trained on k-1 of the groups and tested on the remaining group, where each group is used for testing exactly once. The generalization error of the classifier is estimated by the average of the performance across all k folds.
While one should perform cross-validation with random partitions, for consistency and comparability of your results, you should partition the data into 5 groups as follows: Consider the 4,601 data points in the order they appear in the processed data file. Group 1 will consist of points {1-920}, Group 2 will consist of points {921-1840}, Group 3 will consist of points {1841-2760}, Group 4 will consist of points {2761-3680}, Group 5 will consist of points {3681-4601}. Finally, Fold k will consist of testing on Group k a model obtained by training on the remaining k-1 groups.
The 57 features are real-valued, and one can model the feature distributions in simple and complex ways. For our assignment, model the features as simple Boolean random variables. Consider a threshold using the overall mean value of the feature (available in the Spambase documentation), and simply compute the fraction of the time that the feature value is above or below the overall mean value for each class. In other words, for feature fi with overall mean value mui, estimate
and use these estimated values in your Naive Bayes predictor, as appropriate.
To avoid any issues with zero probabilities, if any of the probability values are 0, simplly replace it with the small value .0014 to avoid multiplying by 0.