Anti-virus software is a piece of software that attempt to identify, neutralize or eliminate malicious software. In the past, “anti-virus” software is only focus on computer virus detection and elimination. However, nowadays, anti-virus software is going to protect computers from many threats, besides virus, the software also try to identify worms, phishing attacks, trojan horses and other malware. In general, there are two different kinds of approaches to accomplish this, file content examination and suspicious behavior identification. In following series of entries, I would like to focus on one of the algorithms that is used for the file content examination.
In other words, file content examination is a virus pattern matching. Before going into detail what pattern matching is, first we need to know what is a computer virus. A general definition, virus is a program that recursively and explicitly copies a possibly evolved version of itself, in order to get permitted to execute code and write to memory. A virus is self-automated programs that, against the user’s wishes, making copies of themselves to spread themselves to new targets. One of the way that virus uses to replicate itself, is to add a JUMP code at the beginning of a victim and the core part of the virus at the end of the victim. Once a user tries to start a infected file, the program will first execute the JUMP code, this code will skip the original procedures in the file, and execute the virus in this executable.
As mentioned above, you may discover that most of the virus using this naive way will have the same infected executable. As a result, the most popular and naive virus detection scheme is to perform pattern matching. You can take the infected file as a search text, and now we have some patterns of known viruses, let us call this virus dictionary. This virus detection scheme is trying to pick an executable file each time, and going to check whether the patterns in virus dictionary is contained in the executable. If so, the executable is said to be infected.
Let us define the problem in a formal way.
Given a search text S, and a set of patterns PS = {P0, P1, …, Pl}, determine whether any pattern occurred in the search text.
If you are familiar with algorithms, the first algorithm that you may use to tackle this problem is Knuth–Morris–Pratt algorithm (KMP). The KMP algorithm runs in O(n+m) time, where n and m is the length of the search text and length of a pattern respectively. Let z be the number of patterns in the set PS, then the total time it takes to check all patterns is O(z(n+m)), because each time you need to preprocess the pattern before the matching procedure can be started and do the pattern matching based on the result of the preprocess. If z is very large, it is not acceptable. It does, of course, there are many new viruses release everyday, the size of virus dictionary keeps increasing. So how can we make this detection possible? The algorithm designed by Alfred V. Aho and Margaret J. Corasick may help to solve this problem (In short, the algorithm is called AC algorithm in the following paragraphs).
The aim of AC algorithm is going to solve the exact problem described above. The algorithm first construct a trie for the given patterns, then construct an automaton from this trie. The trie of the patterns can easily be built by inserting patterns one by one. The algorithm then adds some backward edges to the trie, the aim of backward edges are to handle the case when mismatch character occur. After this preprocessing procedure, the matching procedure become trivial and strict forward. In the next post, we will discuss on what a trie is and how to construct a trie, by providing a set of patterns.
Should you have any questions or comments, or you discover a mistake I have made, please do not hesitate to tell me.
Reference:
- The Art of Computer Virus Research and Defense, Peter Szor
- wiki – Computer Virus
- wiki – Antivirus Software
- wiki – Aho-Corasick algorithm
- Alfred V. Aho
Filed under: algorithm, virus , aho-corasick, antivirus, pattern matching, virus



