Uncategorized

Using the Naive Bayes Classifier for Statistical Analysis

The Naive Bayes Classifier takes a look at a set of information and analyzes it, giving an estimation on how to classify the information. The paper takes a look at using this classification technique to determine if characteristics are met in the content of an email to classify it as spam. The classifier itself is based on Bayes Theorem which is used in statistics. For the statistics side of things, the theorem looks at a user and a set of indicators to determine if the user has a chance at something. I have seen it used mostly in medicine where a person can put descriptors such as family history, height, weight, and age together to find out what the chance is of them having a certain disease.

This is the same type of prediction that will occur with spam filtering. There are certain parts of an email that will be the same every time. These will be things like the subject, bc, cc, or addresses that are a part of the message of the email. There will be a body which contains the actual message of the email itself. There can also be an attachment that will come through with the email that can be checked. Some mail clients themselves scan the attachments for viruses before they are sent through. Gmail is one of the email clients that scans attachments before they are received. If it appears to be harmful, the user will be notified and then have the option to receive it. Microsoft email like Outlook automatically blocks them as part of their system and they have to be allowed to go through if there is an attachment. These include images that appear in the body of the email.

The problem with an approach like this is that all elements of the email need to be put together to be analyzed. This needs a sort of sanitization process that would clean up the inputs of the emails before they can be analyzed. This sanitization process would clean up all of the parts of the email so that they can be analyzed correctly. Sometimes the subject will be a dead giveaway that an email is spam. This sanitization can apply weights to certain parts of an email that usually contains spam. This includes the attachments. The first thing the analyzer needs is to do is go through each of the elements.

The elements will each have a list of the words that are commonly used in spam emails. Everyone has received some type of spam email that asks for money. This is usually a dead giveaway that the email is spam. Another thing that has cropped up over the years is the email asking people to wire money. Wiring money is a quick way for attackers to receive funds untraced. There is no information tying them to this wired account. So using this well known example, one could say that any email asking for wired funds is spam. Wired itself cannot be added to the list, it would have to be more specific as a bunch of emails about wiring in a house would be marked as spam. This would have to be analyzed as “wire money” or “wire funds”.

A phrase such as the one above can now be added to the sanitization list. This sanitization list can be built up using phrases along with words. If an email contains a high amount of words from either of the lists, it can be built up as spam. There is no straight probability in the Naive Bayes Classifier as it is working towards a score of the chance that the email is spam. This would take some training by the system to analyze email based on their actual scores. Getting feedback from users can also help adjust the scores so that good emails are not filtered out. The user can give feedback on the emails themselves if they are found in the spam folder. The system can then analyze the Naive Bayes Classifier score so that if it was borderline the score can be adjusted. If it is something that is way outside of the range of the typical spam email score, it can be thrown out as an outlier.

This would be something related to wiring money. If someone received a personal email from a friend that is asking them to wire money, that might not be spam but in most cases this would need to still be listed as spam. One person who receives a legitimate wiring email does not outweigh the hundreds who receive fake emails.

This classifier becomes much more powerful if it is put on a server in the middle to analyze all emails that are going through to each account. The paper calls this an indiscriminate attack because there is no user focus to the email. This style is one that needs to be analyzed through traffic. A common email like this is one where there is an alert to a user that their password has expired. It may come from a phishing site pretending to come from another legitimate one. This score can be added against the status quo of other large newsletter emails. The sender account would have to be analyzed too so that large emails are not filtered out.

An example of this would be emails sent from an .edu, .gov, .mil, etc. extension email. These emails are from verified schools or government companies that may send large email blasts at one time. If the Naive Bayes Classifier only looks at the large volume, the classifier will be placing a large number of good emails into the spam filter. This can also include the case of learning so that the classifier can take well known company emails into account when deciding if they should be filtered out or not. This could also be done by listing certain accounts as verified accounts. This is something that Twitter does to show that an account is run by a real person.

By having a real verified account, these emails would not be placed on the spam list. They would need to have turned in more than just the email name. This is because another company could spoof this address and get listed as a real account. This could then be used to send spam email. This would render the classifier useless. The way to get around this is by collecting more information from the verified account including the server address and name that is sending the emails along with packet information. These could then be scanned through a depth scanner to make sure that they are real. This would be a much better implementation as it would be a lot harder to spoof by spammers.

The last attack is the hardest to protect against. That is because this type of attack pretends to be a legitimate company in their email. This was mentioned briefly above and is one that would take extra analyzing. Like the list against the email, this one would need to take well known accounts and check certain things against a legitimate email. This is where the verified email sets up a system that can be adapted again. The way this would theoretically work is that the email list would be compared against additional information before a score is granted.

When calculating the score, the list of words add to the score and run the vector of the list against the baseline for both words. This is where the training set is important. The baseline score will go ahead and set the standard for the spam filter to go forward. There will be a point score awarded to each word so that when it occurs it will raise the score. Some words are safe words and would not raise the score of the email. There will be a threshold of the final score which would either flag the email as spam or not. The paper uses the Boolean system to count points of the email. This means that it will use a yes or no system to calculate points.

When using a system like this, all words will be listed as equal. Since we are using scores, it will not matter if the exact phrase of words match when calculating the points. For example, I previously mentioned the key of “wire money”. I concluded that this was a term that would immediately flag the email as spam. This is because the phrase is very specific when looking at money scams. With our system though, we are looking at both words independently. Since we are using a yes or no system, each time the word comes through the mail it would be assigned a score of one. Since we have two words, the points for these words would be two.

This would make it seem that the score would be higher if an email contained both of these words. That is what the algorithm is used for. This algorithm looks at the points used by the email and compares it against the words used in total. There will be a count for both words in the email and the flagged words. This would give a high or low score for the total email. This means that the weight of the email in terms of spam will determine if it will be flagged as spam or not. There will be a score determined and the lower the score, the lower chance the email is spam. If the score is high, there will be a high chance that the email contains spam.

Now looking at the math aspect of the classifier, we are looking at a pair of data. For the spam filter, we have the set of spam keywords and we also have safe words. The user should be able to save safe words to the list in case a name or something they commonly message about shows up as a keyword. We begin to look at the distribution from the pair into the conditional distribution of the second value. This means that the more frequently the word on the list or words from the list show up, the higher the distribution will be.

From a graph standpoint, think of creating a very dense graph. The dense graph means that the graph is more likely to be contained in the area. If the spam has a very dense distribution, there is going to be a high chance of the graph. From this we are using a pair of:

X | Y = Value ~ Probability(Value).

Looking at the distribution of the value creates a probability of the distributed value indicating one of the features. So from here we can take the classifier and split it up based on the good keywords and the spam keywords which gives us:

Probability(Y = Value | X = Value).

In the program, each value will have a weight and given set of numbers. If the value is high with good keywords and very low for spam, that means that there is a very low chance of spam. If the opposite is true and there is a high density of spam words and a low density of good words, then the email will probably count as spam. This can also be user sensitive as the user can be given the option to add words to the safe and spam lists. If the user always gets email about a certain topic they have no interest in, they can add it to the list. This will increase the accuracy of the training list.

The purpose of more keywords is to help lower the weight of the calculation. This lowers the significance of the values and will create less false flags. The training set will determine the baseline for both values and the added keywords will help create lower miscalculations. There will be some crossover where words may appear on both lists. This could appear with the word wire. Wire could be on the list for its definition of wiring money. The user could also add it to the safe list because they are an electrician. The problem with this is that it will assign a value of true to both lists.

This initially does no harm, but as the lists increase in size, the chance of double words also increases. What happens next is the weight of the classifier increases. When the weight increases, the potential value of both the good and bad keywords also increases. To combat this the training set will have to be calculated again to once again increase the baseline values for the calculation.

The test from the paper was carried out on thousands of emails. These emails were preprocessed beforehand and only focused on the content of the email. This means that the classifier itself could be improved by the header functions mentioned that would help increase the accuracy of the data rendered. There were an equal amount of good and bad emails to test for in the training. The results of the training showed that Naive Bayes Classifier was able to label the emails correctly.

The increase of the attack does not negatively affect the performance of the classifier, especially because the weight can be adjusted after the training is set. The standard Naive Bayes with no training is very close to the spam diagnosis, but does not perform as well on the targeted attacks. With the training set enacted, the success rate is much higher. This is due to the active keyword base being improved by user feedback and input.

The conclusion is that even if the attacker knows the word base being used to determine the spam, the weight adjustments can be made almost immediately on the end of the user to improve the classifier. When the degree of the attack is more specific and less random, the classifier performs quite well. It is when the unknown attack is thrown into the mix that the results may not end up as well as expected.


The Naive Bayes classifier had use as a spam projector because it gives an unbiased score. This is useful in projections for human statistics because the baseline just looks at the facts. This is mostly used for medical diagnosis but can also have more of a novelty use. Since graduation is upcoming, one might want to look at all of the aspects of a certain location when deciding on where to move.

The United States is such a large country that each region has different attributes. This makes it tough for someone who is trying to look for a job to also have to research to find out if they will fit into the area they are moving. Just like the spam filter, one can build a list of attributes that will assign points for each region. Each region will then be assigned a score on whether or not those activities or attributes can be found in that area. This can be broken down even more by state or by cities.

This can be even broken down even further by the regions and the cities of each state. These attributes can then be assigned to the states and these regions. This can lead to a very complex system. When the spam keywords start to get scored, they are only getting looked at from one perspective. This yes or no system that can assign whether or not an email is spam is very straightforward. This straightforward approach can also be used to tackle this more complex problem.

Since the United States is such a complex and large country, it doesn’t matter if multiple areas receive high scores from the user. The idea behind the program is that the user will input a list of items that are important to them. This list of keywords will be run against different regions of the United States. For example, if a person really likes the ocean, they will put ocean on their list. There are only a limited amount of states that actually touch the ocean and offer that as an attraction. This means that those states will get a true value on the boolean scale. This results on a score of one. The states that do not touch the ocean will receive a score of zero based on their assignment of false.

Since state regions are similar in certain parts of the country, similar regions will also receive certain scores. This is expected, because it gives the user more options to look for. It helps them search for jobs because it checks off all of the attributes that the user is looking for. If someone is looking for a warm location near the ocean, states in the south east along with those on the west coast can offer that. This would also eliminate the states that are in the north or in the midwest.

Now we can remove some of the regions from the states that received high scores. This means that a southern state like South Carolina, that is warm and has an ocean, also has a mountain region that would need to be eliminated from the score. The mountains are a few hours away from the ocean. This would deter the user from using the system if they are receiving low scores.

The accuracy comes from the list of the regions of each state. This is something that would have to be developed over time. The attributes need to be added to each of the regions of the state. For this to work, we need to build out sets of data for each state region.

The easiest way to do this in Python is to just import .txt files. Python has some built in libraries that make text and string manipulation easy. What we are trying to do is build a word count base to compare the input list against.

To test the program, I decided to focus on one state. Since I currently live in North Carolina, I built my test from that. North Carolina is split into three distinct regions. The coastal region which connects to the ocean and is flat and offers most coastal options. The second region is the piedmont. The piedmont hosts fertile ground with a lot of the agriculture and large business. The third region is the mountains, which offers lots of outdoor options like skiing and hiking.

The one downside with using this approach from what I could find is that for each iteration of the word, it has to be included in the list. So for example, if a user types in skiing instead of ski, it will not be counted. I build out my lists as the following.

Mountain (NC Mountains):

kayaking rivers mountain cabin spa whitewater rapids seasons vacation mountains fall colors color snow snowy skiing ski resort spring crisp air outdoors nature kayaking hiking trails camping zip lining country snowboard peaks rocks fly fishing trail beauty

Coastal (NC Coast and Beaches):

beautiful nature beaches beach island ocean seashores small towns pristine coast relaxation lighthouses beaches rivers sounds tee golf courses course seafood exquisite beauty wild horses native wildlife picturesque history civil war sand

Piedmont (NC Piedmont):

music greats cities cosmopolitan feel charm arts nightlife dining wine beer exotic animals natural habitats zoo handcrafted pottery upwind spas outdoors forest suburbs corporations business technology highway colleges young

Since this is only a test, there does not need to be a lot of extra detail put into these lists for the demonstration. For actual application, the list will need to be more detailed as this is the basis for a complete breakdown of all regions. Now the Naive Bayes classifier is going to complete the calculations to give the probability points for each region.

The program takes input from the user after a quick introduction. This creates data forms for each region and the user list. These lists are created using the basic Python data structures as recommended from their documentation. Python also has its own Bayes package which I used as a guide to build out my calculations. (NaiveBayes) This puts the values at a decimal level where the higher amount signifies the best chance of a match.

These datasets are calculated against one another using the Bayes formula found in the Python package. That formula was then isolated on its own and brought in with the basic Python word counting found in the documentation. The data is stored in a data folder and then each region will be kept in its own folder. So the basic structure is as follows. data > nc-piedmont, nc-coastal, nc-mountain. Since each United States state has a two letter code, I set that as the prefix. This prefix will let me order regions so that they can appear alphabetically. This makes the program easily scalable and able to add more features as more folders can be added.

Once the folders are added, the math needs to be run for each section again. I didn’t have time to make it more robust, but there can be some improvements on the calculations that let it run over for each folder. This would make it work so that extra lines would not have to be added to the program. All that would need to happen is for folders to be added into the data folder.

In conclusion, the classifier has a lot of room for improvement especially in the code and reusability of it. It is scalable and accommodates adding more features. The classifier does identify the keywords correctly and might also work better with some more filters and string manipulation other than the regular expression word split that is used. The all lower case filter may also want to be removed as some more specific words are bound to show up, but I think for this example it works quite well.

With an increased word list, some of the weights will go down as the specific attack on the coastal region reached as high as 15 when I think trying to keep the weight down to 0–1 is a much better scale for the user. These could also be manipulated into a percentage from their decimal form, giving the user a matched percentage of compatibility for the region search.

Screenshot of the program selecting the coastal region
Screenshot of the program selecting the piedmont region
Screenshot of the program selecting the mountain region

Python Code Built from NaiveBayes and Python Docs:

import re
import math
import sh

def list_setup(attrib):
    attrib = attrib.lower()
    return re.split("\W+", attrib)

def wordcount(text):
    totalcount = {}
    for word in text:
        totalcount[word] = totalcount.get(word, 0.0) + 1.0
    return totalcount

words_from_text = {}

word_counts = {
    "nc-coastal": {},
    "nc-mountain": {},    
    "nc-piedmont": {}
}

list_search = {
    "nc-coastal": 0.,
    "nc-mountain": 0.,
    "nc-piedmont": 0.
}

files = []

for attributes in sh.find("data"):
    attributes = attributes.strip()
    if not attributes.endswith(".txt"):
        continue
    elif "nc-coastal" in attributes:
        region = "nc-coastal"
    elif "nc-mountain" in attributes:
        region = "nc-mountain"        
    elif "nc-piedmont" in attributes:
        region = "nc-piedmont"
    else:
     sys.exit(0)
    files.append((region, attributes))    
    list_search[region] += 1
    attrib = open(attributes).read()
    text = list_setup(attrib)
    counts = wordcount(text)
    for word, count in list(counts.items()):
        if word not in words_from_text:
            words_from_text[word] = 0.0
        if word not in word_counts[region]:
            word_counts[region][word] = 0.0
        words_from_text[word] += count
        word_counts[region][word] += count

print ("Welcome to the United States Region Finder")
print ("Please enter a list of your favorite things separated by a space:")

user_attributes = input("")
text = list_setup(user_attributes)
counts = wordcount(text)

piedmont_count = (list_search["nc-piedmont"] / sum(list_search.values()))
mountain_count = (list_search["nc-mountain"] / sum(list_search.values()))
coastal_count = (list_search["nc-coastal"] / sum(list_search.values()))

coastal_bayes = 0.0
mountain_bayes = 0.0
piedmont_bayes = 0.0

for x, y in list(counts.items()):
    if x not in words_from_text:
        continue

    selected_word = words_from_text[x] / sum(words_from_text.values())
    selected_word_piedmont = word_counts["nc-piedmont"].get(x, 0.0) / sum(word_counts["nc-piedmont"].values())
    selected_word_mountain = word_counts["nc-mountain"].get(x, 0.0) / sum(word_counts["nc-mountain"].values())
    selected_word_coastal = word_counts["nc-coastal"].get(x, 0.0) / sum(word_counts["nc-coastal"].values())

    if selected_word_piedmont > 0:
        piedmont_bayes += math.log(y * selected_word_piedmont / selected_word)
    if selected_word_mountain> 0:
        mountain_bayes += math.log(y * selected_word_mountain / selected_word)
    if selected_word_coastal > 0:
        coastal_bayes += math.log(y * selected_word_coastal / selected_word)

print("North Carolina - Coastal  Region :", math.exp(coastal_bayes + math.log(coastal_count)))
print("North Carolina - Mountain Region :", math.exp(mountain_bayes + math.log(mountain_count)))
print("North Carolina - Piedmont Region :", math.exp(piedmont_bayes + math.log(piedmont_count)))

Resources

  • NC Coast and Beaches — North Carolina Travel & Tourism. (n.d.). Retrieved December 14, 2015, from http://www.visitnc.com/coast
  • NC Mountains — North Carolina Travel & Tourism. (n.d.). Retrieved December 14, 2015, from http://www.visitnc.com/mountains
  • NC Piedmont — North Carolina Travel & Tourism. (n.d.). Retrieved December 14, 2015, from http://www.visitnc.com/piedmont
  • NaiveBayes 1.0.0 : Python Package Index. (n.d.). Retrieved December 14, 2015, from https://pypi.python.org/pypi/NaiveBayes
  • Peng, J., & Chan, P. (n.d.). Revised Naive Bayes classifier for combating the focus attack in spam filtering. 2013 International Conference on Machine Learning and Cybernetics.
  • Python 3.5.1 documentation. (n.d.). Retrieved December 14, 2015, from https://docs.python.org/3/

Leave a Reply

Your email address will not be published. Required fields are marked *