Malware Classification using API System Calls
Malware causes are increasing both in numbers and fatality. Hackers design malware to compromise systems security mostly confidentiality, integrity, and availability. Malware elimination techniques exist but the malware must be detected first. Malware detection techniques still have weaknesses of high false positive/negatives rates. The emergency of polymorphic malware has made the situation worse. Recent studies have shown data mining to be promising in identifying malware by analysing API calls. However, in this approach, a file is detected as malicious or not. It is not classified on to which malware class it belongs. This makes its elimination harder as elimination schemes are mostly class based. Classification as a post detection process is important if the malware is to be eliminated from the system. We make an experimental study on use of data mining approach to classify malware using 4-gram API system calls. We use a dataset of 552 Windows Portable Executables (PE) with their corresponding API calls. The PEs were executed in a windows 7 virtual environment using the Cuckoo sandbox. Relevant 4-gram API call features are extracted using Term Frequency-Inverse Document Frequency (TF-IDF). Gaussian Naive Bayes, SVM, Random Forest, and Decision Trees were used to train and test the data. We show that the technique is successful with accuracy between 92% and 96.4%. There are internal variations in accuracy with SVM and Decision Trees performing best and Gaussian Naive Bayes performing worst.