hac.’s Weblog

Icon

Just a place to share my life

Aho-Corasick – Trie

From wikipedia, a trie is an ordered tree data structure that is used to store an associative array where the keys are usually strings, or a sequence of values. It is quite different from the general tree data structure, instead of storing the key associated with the node, the position of a node in the trie shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with an empty string. Values are normally not associated with every node, only with leaves and some inner nodes that happen to correspond to keys of interest.

Keep in mind that we are going to solve the multi-pattern matching that introduced in previous blog entry. The example below will be used to explain the idea of each procedure in the following description.

Given string S = “USHERS” and
set of patterns PS = {“HE”, “HIS”, “SHE”, “HERS”}, determine if any pattern in the set PS can be located in string S.

The AC algorithm is going to construct a trie for the set of patterns, and modify the trie to be an automaton. Before introducing how to construct this trie, first, by the definition above, we can easily draw the trie of the set of patterns PS as shown below.

Trie of the set of patterns PS

Trie of the set of patterns PS

The double circle above is representing a terminal node, which means by traversing from root to this node, the labels on the path is a pattern in the set. In other words, the nodes in single circle is some internal node. The arrow in the graph without precedent node is pointing to the root node. For convenient, let’s define one more notation, L(u) is the label of a node u that is concatenation of edge labels on the path from root to u.

According to above introduction, the construction of a trie is trivial. For each pattern Pi you are going to insert into the trie. Starting the process from root, and follow the path labeled by characters of the pattern Pi. However, there are some scenarios during the insertion that needs to take consideration. In general, there are two possible cases when the pattern reach a node,

1. The node has an edge with label matching the target character of the pattern. In this scenario, the algorithm obviously can follow the label to next node.
2. The node does not has an outgoing edge with the correct label. In this case, a new branch is added to the current node, and move to this new node. Basically, the algorithm will insert the remaining characters of Pi as a subtree of the current node, without check these two cases again, ie. After inserting “HE” to the trie, the algorithm is then going to insert “HERS”. When it reached the terminal node of “HE”, there is no outgoing edge from the node with label “R”. Then the algorithm will add a subtree with two nodes and two edges with label “R”, “S” respectively, but not add edge “R” and going to check the cases again.

Before going deep into the implementation of the construction procedure, the structure of nodes should be considered. When implementing a trie, each node obviously needs to store its outgoing edges. On the other hand, there are two kinds of nodes in the trie, a terminal node and internal node. Terminal node is the node that representing itself is the end of a pattern, internal node is other nodes in the trie. At this moment, assume that both string and patterns in the problem contains only ASCII values. Then the structure of a node can be designed as follow.

struct Node {
  Node *go[256];
  bool out;
};

To minimize the useless memory, we can use a hash table to store the edges instead of array. The following is the pseudo code for constructing the trie.

void insert(Node *root, char *pattern, int n) {
  Node *node = root;
  for (int i = 0; i < n; ++i) {
    *(pattern+i) -= 'A';
    if (node->go[*(pattern+i)]==NULL) {
      node->go[*(pattern+i)] = new Node();
    }
    node = node->go[*(pattern+i)];
  }
  node->out = true;
}

Node *root = new Node();
for (int i = 0; i < k; ++i) {
  scanf("%s", pattern[i]);
  insert(root, pattern[i], strlen(pattern[i]));
}

The procedure runs in O(n) time, where n is the total length of all patterns.

In the next entry, we will discuss how to construct the automaton for AC algorithm by given a trie.

Reference:

Related:

Filed under: algorithm , ,

Aho-Corasick – Intro

Anti-virus software is a piece of software that attempt to identify, neutralize or eliminate malicious software. In the past, “anti-virus” software is only focus on computer virus detection and elimination. However, nowadays, anti-virus software is going to protect computers from many threats, besides virus, the software also try to identify worms, phishing attacks, trojan horses and other malware. In general, there are two different kinds of approaches to accomplish this, file content examination and suspicious behavior identification. In following series of entries, I would like to focus on one of the algorithms that is used for the file content examination.

In other words, file content examination is a virus pattern matching. Before going into detail what pattern matching is, first we need to know what is a computer virus. A general definition, virus is a program that recursively and explicitly copies a possibly evolved version of itself, in order to get permitted to execute code and write to memory. A virus is self-automated programs that, against the user’s wishes, making copies of themselves to spread themselves to new targets. One of the way that virus uses to replicate itself, is to add a JUMP code at the beginning of a victim and the core part of the virus at the end of the victim. Once a user tries to start a infected file, the program will first execute the JUMP code, this code will skip the original procedures in the file, and execute the virus in this executable.

As mentioned above, you may discover that most of the virus using this naive way will have the same infected executable. As a result, the most popular and naive virus detection scheme is to perform pattern matching. You can take the infected file as a search text, and now we have some patterns of known viruses, let us call this virus dictionary. This virus detection scheme is trying to pick an executable file each time, and going to check whether the patterns in virus dictionary is contained in the executable. If so, the executable is said to be infected.

Let us define the problem in a formal way.

Given a search text S, and a set of patterns PS = {P0, P1, …, Pl}, determine whether any pattern occurred in the search text.

If you are familiar with algorithms, the first algorithm that you may use to tackle this problem is Knuth–Morris–Pratt algorithm (KMP). The KMP algorithm runs in O(n+m) time, where n and m is the length of the search text and length of a pattern respectively. Let z be the number of patterns in the set PS, then the total time it takes to check all patterns is O(z(n+m)), because each time you need to preprocess the pattern before the matching procedure can be started and do the pattern matching based on the result of the preprocess. If z is very large, it is not acceptable. It does, of course, there are many new viruses release everyday, the size of virus dictionary keeps increasing. So how can we make this detection possible? The algorithm designed by Alfred V. Aho and Margaret J. Corasick may help to solve this problem (In short, the algorithm is called AC algorithm in the following paragraphs).

The aim of AC algorithm is going to solve the exact problem described above. The algorithm first construct a trie for the given patterns, then construct an automaton from this trie. The trie of the patterns can easily be built by inserting patterns one by one. The algorithm then adds some backward edges to the trie, the aim of backward edges are to handle the case when mismatch character occur. After this preprocessing procedure, the matching procedure become trivial and strict forward. In the next post, we will discuss on what a trie is and how to construct a trie, by providing a set of patterns.

Should you have any questions or comments, or you discover a mistake I have made, please do not hesitate to tell me.

Reference:

Filed under: algorithm, virus , , , ,