Efficient Detection of SPAM messages and SPAM zombies in the Internet using Naïve-Bayesian and Sequential Probability Ratio Test (SPRT)

-The Internet is a global system of interconnected computer networks that provides the communication to serve billions of users worldwide. Compromised machines in the internet allows the attackers to launch various security attacks such as DDoS, spamming, and identity theft. Compromised machines are the one of the major security threat on the internet. In this paper we address this issue by using Naïve-Bayesian and SPRT to automatically identify compromised machines in a network. Spamming allows the attackers to recruit the large number of compromised machines to generate the SPAM messages by hiding the identity, these compromised machines commonly known as spam zombies. We used Naïve-Bayesian and manual methods to detect the SPAM messages and used SPRT technique to identify the spam zombies from the SPAM messages. We proved that the Naïve-Bayesian approach minimizes the error rate, false positives and false negatives compared to the manual approach in the process of detecting SPAM message. Our evaluation studies based on one day email trace collected in our organization network that shows Naïve-Bayesian and SPRT are the effective and efficient systems in automatically detecting SPAM messages and compromised machines in a network.



Network level technique: Detection of spam messages at network level is very difficult. The approaches used in this technique are, Domain verification and the Challenge Response Systems. Domain verification technique uses sender name, domain, and route information obtained by the SMTP to filter the messages. Example is one cannot send the message to the incoming route in a network path. Whereas the Challenge Response Systems technique probes the sender by asking questions to ensure that the sending side is not an automated bot. This technique minimizes the huge number of messages sent from the automated bots.  User level technique: the two popular methods used at this level are content based and parameter based (white and blacklists). Content based approach discriminates the genuine message (ham) from spam based on the number of spam words exists in the mail content and weightage of the spam words. Parameter based approach uses the basic parameters of the mail for discrimination.  Policy based technique: This is not a technical approach. Through pricing e-mail, accusing the spammers by law, some spamming can be reduced.
The efficient detection of spam message improves the performance of the E-mail application and providers, by minimizing the false positives and false negatives. But spam message detection alone does not provides the efficient solution for this problem. Detection of spam zombies along with the spam messages improves the performance and which restricts the zombies from being send the spam messages in future. In this paper we define an efficient approach to detect the spam messages and spam zombies. We divided the entire process into two phases, in the first phase we are detecting the spam messages using manual and Naïve Bayesian spam filter and in the second phase SPRT is used to detect the spam zombies.

II. RELATED WORK
In spam classification, the existing technologies related to spam categorization are very complicated and gives poor results when compared to Naive-bayesian spam filter.
Choi et.al proposed a technique to detect the bots based on the DNS queries generated. Based on the similarity in the group activity of the DNS traffic the bots are detected in this paper. In [6] the botnets are detected based on the passive analysis on flow data. M a y 1 5 , 2 0 1 3 Xie et al. developed an effective tool named DBSpam to detect proxy-based spamming activities in a network relying on the packet symmetry property of such activities [8]. We intend to identify all types of compromised machines involved in spamming, not only the spam proxies that translate and forward upstream non-SMTP packets (for example, HTTP) into SMTP commands to downstream mail servers as in [5].
BotHunter uses the IDS trace [3] to detect the bots by comparing the inbound intrusion alarms with the outbound communication patterns. SPRT algorithm focuses on any spamming activity unlike BotHunter which depends on specifics of malware infection process.
An anomaly-based detection system named BotSniffer [4] identifies botnets by exploring the spatial-temporal behavioral similarity commonly observed in botnets. It focuses on IRC-based and HTTP-based botnets. In BotSniffer, flowsare classified into groups based on the common server thatthey connect to. If the flows within a group exhibit behavioral similarity, the corresponding hosts involved are detected as being compromised. BotMiner [4] is one of the first botnet detection systems that are both protocol and structure independent. In BotMiner, flows are classified into groups based on similar communication patterns and similar malicious activity patterns, respectively. The intersection of the two groups is considered to be compromised machines. Compared to general botnet detection systemssuch as BotHunter, BotSniffer, and BotMiner, SPOT is a light weight compromised machine detection scheme, by exploring the economic incentives for attackers to recruit the large number of compromised machines. A powerful statistical method, Sequential Probability Ratio Test has been successfully applied in many areas of networking security, such as portscan activities [9], proxy-based spamming activities [10], MAC protocol misbehavior in wireless networks.

Zombies or Bots
The term bot or zombie [11] [12] is originated from the word robot. Bot is a compromised system that is controlled by a botmaster or attacker. The attacker identifies the compromised systems in the internet to generate huge traffic or spam messages towards the target machine. The compromised machine refers the machine running with no antivirus software or older versions of antivirus software. Bot may inject or send traffic to the target machine but in turn it may be a part of the substantial traffic flow attacking to the target machine. Generally botmaster gains the control over an individual bot if the security levels of the bot are very low. Likewise bot groups also searches for the poor security entry points (hosts or victims) to destruct the other hosts. What Bots do


Uses to propagate the malware,  Consumes the network traffic  Harvest the usernames and the passwords  Uses p2p networks to propagate the malware  Uses spam messages to get the incentives Types of the Bots: AgoBot:It is the most commonly used bot, where more than 500 versions are available. It was developed in C++ and installs the source code directlyin GPL. The advantage of this bot is ,it provides easy commands and scanners addition by extending the CCommandHandler and CScanner class and allows the users to add their own methods to them .It is very hard to do reverse engineering. This acts as a protocol that uses other than the IRC and used for packet sniffing and sorting the traffic. SDBot:It was developed in c language. Since it is written with poor coding many attackers attracted to use it. The advantage of this is very easy to understand and easy to create the new bots. 3.2 Botnet or zombie armies Botnet [11][12] [13] is a collection of compromised computers that are controlled by a botmaster through commands to forward malicious code, viruses, or spam.Botmaster sends the commands through IRC channels or using other private tools to the bots to execute their commands in the bots automatically. The compromised machines or bots do not know that some other computer is controlling that system. Normally home based, less secured systems will be hosted for these purposes. Bot master gets the complete control over the botnet so that he consumes the complete bandwidth of the network for generating and flooding the spam messages towards the target machine to earn good incentives. The simple commands can scan for the other bots to populate the botnet or can pose threats to the victim machines to fulfill its desire.

Spam Messages
Spam messages are the special purpose messages [2] to attract the users and to make the user to listen to their fraud words. Like normal message, it also contains the structure of header and body. Where header contains the details of the message and the body contains the content of the message including the subject. Each spam message have some unique properties that distinguishes with the other messages, which includes Empty To: filed, missingTo: field, more number of recipients in the Cc: filed, no or suspected message ID's,BCc: filed exists, same but fake addresses usage in From: and To: fields etc.

Spam Filtering
The process of discriminating genuine messages from the spam messages is known as Spam filtering [14]. Spam Filtering methods are broadly classified into two types namely content based methods and parameter based methods. M a y 1 5 , 2 0 1 3

Content based filtering:
This method classifies the genuine messages and spam messages by considering the body of the message.Now a days the spammers are very intelligent and they are using fake identities to send spam messages instead of using their original identity. This Content based algorithms classifies the messages efficiently even the spammer uses the fake identities. Some of the content based spam filtering algorithms is Bayesian, SVM, KNN, Naïve-Bayes etc. The advantage of this method is it provides efficient results even the spammer uses ip spoofing. Drawback of this method is it takes more time compared to parameter based methods. Parameter based filtering: This method discriminates the genuine messages from spam messages by considering the parameters of the message. Parameters of a message include From, To, Received by, Subject, IP address, port addresses etc. The advantage of this method is very easy to implement and no need to open a message for getting parameters. It takes very less time for processing compared to content based methods. It fails when the spammer uses the fake identity or uses ip spoofing. Some of the parameter based filtering techniques are Blacklists, White lists, Challenge/Response methods etc. Blacklists are the ip addresses/domain names/e-mail addresses of the real time spammers. In real world many black lists are available. These are called Real time Black Lists(RBL). By comparing the RBL one can filter out the spammer messages undoubtedly. In the same way white lists are the ip addresses/domain names/e-mail addresses of the trusted parties. To maintain these two techniques by an independent user is a tough job while most of the organizations follow them.

Content Based Spam filtering Methods
Content based spam filtering [15] uses the body of the message for discriminating genuine messages from the spam messages. This method extracts the spam words from the body of the message and based on the threshold calculated for the spam words the messages are discriminated. The words are defined as the spam words by training the system with spam messages and weight is calculated for each word. Naïve-Bayesian Algorithm: It is a probability based algorithm to discriminate genuine message from spam message. Naive bayes [16] approach is slight modification to the bayesian algorithm. It is the most effective spam filtering content based algorithm. More detailed description is given in 3.6. K-Nearest Neighbors: This method of classification is based on the distance measure among the messages. The distance is measured based on the features between the messages like Euclidean distance measurement. This method doesn't need any training phase, so the incoming messages will be directly measured with the available sample messages. Bayesian Classifier: One of the most popular spam filters available today is bayes classifier, which is based on the probabilistic method of classification. The discrimination process is carried out based on the features extracted from the collection of previous spam messages of same kind. It means that, the previously classified spam words of spam message occur in the present spam message more frequently than in the normal message. In this method each word probabilities are calculated in the training phase. After calculating the word probabilities, the message will be classified based on the combination of the words that are commonly presented in the training and in the given message.
We define two classes of messages namely genuine message (HAM) and spam messages (SPAM). Probability distribution function is used to define the classification of these messages.
WhereP(x/c) is the probability of the message with feature x from the class c. p(c) is the a-priori probability of class c and p(x) is the a-priori probability of message (x).
The probability of the messages is calculated using the above equation. Normally SPAM message contains more probability (>0.8) than a normal message.

Naïve Bayesian spam filter
This spam filter [16] uses the principles of Bayesian spam filter for discriminating HAM from the SPAM. Normal Bayesian with the independent features assumption in the calculation of conditional probability is called the Naïve Bayesian spam filter. Here the features in the text is considered independent among each other even though there is an inter dependency. For example, in a phrase "HURRY UP" the two words HURRY and UP is interdependent. That is after the occurrence of the word HURRY there are more chances to come the word UP than any other word. Fortunately the practical results are shown that the dependency will not play a major role in the decision of the SPAM classification.

SPRT (Sequential Probability Ratio Test)
This is a statistical approach [17] of testing between two hypothesis one is null and the other is alternative hypothesis (H 0 , H 1 ).Based on the observations, a one dimensional random walk will be moved between two boundaries(A, M a y 1 5 , 2 0 1 3 B).Whenever the variable value touches either of the boundaries the test is stopped and the corresponding result is considered according to the boundary. This can be illustrated as following Λ n ≤ A ⟶ Accept H 1 and stop the test Λ n ≥ B ⟶ Accept H 0 and stop the test, A < n < ⟶ Take additional observation and continue the test.
The boundaries are calculated by the user-desired false positives (α) and the false negatives (β). False positive is the case where the algorithm accepts H 1 when H 0 is true. False negative is the case where the algorithm accepts H 0 when H 1 is true. Λ n = ln Pr X 1 , X 2, X 3 , … , X n H 1 Pr X 1 , X 2, X 3 , … , X n H 0 Λ n is the nth observation which is calculated in the below form: Here X i is the Bernoulli variable, which are independent and identically distributed. We can denote the probability of an observation coming from the H 0 as θ 0 and H 1 as θ 1 Pr Means if the observation is X i = 1, ln So true and user desired false positives and negatives are almost similar. Another important computation that can be obtained from the SPRT test is the number of observations to reach a machine compromised or normal. We can compute the average number of observations by The number of observations will depend on the user defined values of the four parametersθ 0 ,θ 1 , α, β. System administrators has to be very careful when providing these 4 parameter values.

IV. SPAM MESSAGES &SPAM ZOMBIE DETECTION
We have divided the entire detection process into two phases. The first phase defines the detection of SPAM messages and the second phase defines the detection of attack source or SPAMS zombies.

Phase I: SPAM Message Detection
We propose two methods for detecting SPAM messages namely Manual method and Naïve-Bayesian method.

Manual Method:
In manual method the system is first trained with the SPAM messages. In the training phase the system extracts the tokens or words from the SPAM message and calculates the weight for each word. The weight for the frequently occurred words usually has the maximum weight compared to the normal words and these words are considered as the SPAM words. The SPAM count or weight for each word is stored in the database. M a y 1 5 , 2 0 1 3 In the detection phase, the manual method extracts the tokens or words from the incoming message and checks SPAM weight from the database for each word. The average weight of all the tokens or words from the incoming message exceeds the defined threshold value then the message is classified as the SPAM message otherwise it is HAM or genuine message. Algorithm Training phase:


Take the collection of SPAM messages.  Extract the words or tokenize each and every message.  Calculate the weight for each word or token.  Store the results in the database. Detection phase:


Extract the words or tokenize each and every message.  Extract the weights of the words or tokens from the database.  Calculate the average weights of the words extracted from the message.  Define the threshold value.  Check the average weight with the threshold value.

Naïve-Bayesian classifier
The naïve bayes filtering contains two phases of processing which includes training phase and the classification phase. In training phase, the equal number of SPAM and normal messages is trained to get the probabilities of the each and every occurred word in the message and these are stored as a reference for lateral retrieval purpose. In the classification phase, the words are extracted from each message to calculate spammicity and hammicity based on the previously calculated word probabilities. Spammicity defines the probability of SPAM and hammicity defines the probability of HAM or genuine message. In training phase, Probability of a word or token is calculated by the formula Where the spammicity of a token is, _ is the number of spam messages that contain this token. _ is the number of ham messages that contain this token.
In classification phase, the message total spammicity and hammicity are calculated by this formula Where S message is the spammicity of a message.hammicity can be calculated by the product of the hammicity of the each and every token.
Here H message is the hammicity of a message. Hammicity can be directly calculated by using the spammicity of a message. Since spammicity and hammicity of a message are opposite in nature, the total probability will be 1(because of two classes). Algorithm Training phase:


Take equal number of spam and HAM or legitimate messages.  Extract the words or tokenize each and every message.  Calculate the spammicity ( ) and hammicity (1-) of each token.  Store the results in the database. Classification phase:


Extract the words or tokenize each and every message.  Retrieve the probabilities of the words or tokens.  Calculate the total spammicity (S message ) of the message.

Phase-II (ZOMBIE DETECTION)
In the first phase of detection, we are using Manual and Naïve-Bayesian methods for detecting SPAM messages. Detection of SPAM messages increases the performance of the system, but it is not the final solution for the problem because the attacker again uses the same machine for transmitting the SPAM messages. Instead of detecting the SPAM message alone, it is better to detect the source of the SPAM messages, so that the machine can block the accepting of messages in future. Detection of the SPAM zombies blocks the compromised systems from being transmit the SPAM messages .In this section we proposed an approach for detecting the zombies based on the IP addresses. M a y 1 5 , 2 0 1 3 The first phase of detection discriminates the SPAM and HAM messages. The HAM messages are directly accepted and SPAM messages are stored in the spam folder. Now the detection of SPAM zombies is done based on the source IP addresses of the SPAM messages identified. Threshold is defined based on the type of network for identifying the SPAM zombies, if the network is busy network then large threshold is used otherwise threshold is maintained based on the traffic volume.
In spam zombie detection, the above explained SPRT algorithm is used. Here the hypothesis is either compromised machine (H 0 ) or normal machine (H 1 ). The observations are the messages generated by the machines. The random walk of a variable (Λ) is based on the messages generated by machines (M).
In a network, a compromised machine will generate more number of SPAM messages than a normal machine. That is the probability of a SPAM message coming from a compromised machine is more than (the probability a spam message coming from) a normal machine. Λ n = 0.

15:else 16:
Test continues with new observations. 17:end if The explanation for the above algorithm is as follows. First IP address of a sending message machine is recorded, then the system administrator sets the 4 parameters according to his network conditions. Each message coming from the machine is an observation in the random walk over two boundaries A and B.
In that, if observation is a spam message (X i = 1), ln

V. EXPERIMENTAL RESULTS
In our organization, we are maintaining three mail servers and 25000 users utilize the services from that servers. Every day 2, 00,000 to 3, 00,000 mails are transmitting in the network. Because of the SPAM messages the performance of the network is decreased, to overcome the limitation of the network a spam filter is deployed that is based on the content based Naïve-Bayesian in each and every machine. We tested the spam filter accuracy for messages passed through the server and got the results around 99% accuracy for false positives and 95% accuracy for false negatives.
In this process, first every message is send through the spam filter to categorize the message as SPAM or HAM, then the results were passed to the SPRT algorithm for zombie detection. This algorithm classified the machines as compromised or normal based on the given false positives and false negatives parameters. We fixed the false positives and false negatives as 0.01 and 0.01for our network and the threshold for detecting the zombies is maintained as three. That is when a machine sends three spam messages is identified as a compromised machine. M a y 1 5 , 2 0 1 3 First we tested the accuracy of spam filter that is based on the naïve Bayesian algorithm for 3, 00,000 messages. Above 2, 95,000 were successfully categorized by the filter. In that more than 55,000 messages were SPAM messages.
In naïve Bayesian algorithm, we have created the database with 2, 00,000 tokens. To create a token database we trained the system with 5, 00,000 SPAM and HAM messages. Each and every token is stemmed and got the spammicity and hammicity of the token. In the classification phase the message is tokenized and checked with the database for spammicity and hammicity probabilities. For example the token "free" has the spammicity probability as 0.996 and hammicity probability as 0.003 (approximated to 4 decimalfraction). For the tokens that are not available in the database are assigned as 0.5 (neutral) probabilities. By identifying the message as SPAM or HAM, we have updated the database for lateral retrievals. Finally by calculating the average probabilities of spammicity and hammicity of a message we are classifying the message as SPAM or HAM.The classification of messages results are shown in the figure 5.1.

Fig 1: Naïve-Bayesian classification results
Here classified type 0 means HAM or normal message and classified type 1 means SPAM message. In the figure statistics of the classified messages are shown. To test the messages we considered the LingSpam corpus of messages. We got 98% of accuracy, false positives as 0.01% and false negatives as 0.02% for our approach.
To detect a compromised machine we (System administrator) have defined four parameters which includes normal machine SPAM messages sending probability ( 0 ), compromised machine SPAM messages sending probability ( 1 ), false positive and false negative rates. Normally the range of the 4 parameters is like this: 0 , is 0.1 -0.2, 1 , is 0.8 -0.9, false positives are around 0.05 -0.001. 0 , Value is from 0.1-0.2 which means that the chance of getting SPAM messages is 10-20 percent from a normal machine. We applied Manual and Naïve-Bayesian approaches separately for 50,000 messages, 1, 00,000 messages and 3, 00,000 messages respectively for our organization network. The messages were passed through SPRT algorithm to record the IP addresses of each sending machine. We have calculated the error rate for each input message group and also calculated the false positives and false negatives for the same.

VI. CONCLUSION & FUTURE WORK
SPAM messages are the major problem for the internet users. In this paper we proposed naïve-bayesian approach to detect the SPAM messages in the internet and we extended our research to detect the source of the SPAM

False Positives
Mathematical Naïve-Bayesian M a y 1 5 , 2 0 1 3 messages called as SPAM Zombies using sequential probability ratio test (SPRT). We proved that the combination of naïve-bayesian and SPRT improves the efficiency of the system by minimizing the error rate, false positives and false negatives in the detection process of SPAM messages and the SPRT improves the accuracy of detecting the SPAM Zombies from the SPAM messages defined by the naïve-bayesian approach. We proposed content based approach for detecting the SPAM messages and this provides an efficient and accurate detection results, but the processing time is proportionate to the number of messages. Parameter based approaches takes relatively less processing time. In future we are planning to implement the same with parameter based approach.