hac.’s Weblog

Icon

Just a place to share my life

Aho-Corasick – Trie

From wikipedia, a trie is an ordered tree data structure that is used to store an associative array where the keys are usually strings, or a sequence of values. It is quite different from the general tree data structure, instead of storing the key associated with the node, the position of a node in the trie shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with an empty string. Values are normally not associated with every node, only with leaves and some inner nodes that happen to correspond to keys of interest.

Keep in mind that we are going to solve the multi-pattern matching that introduced in previous blog entry. The example below will be used to explain the idea of each procedure in the following description.

Given string S = “USHERS” and
set of patterns PS = {“HE”, “HIS”, “SHE”, “HERS”}, determine if any pattern in the set PS can be located in string S.

The AC algorithm is going to construct a trie for the set of patterns, and modify the trie to be an automaton. Before introducing how to construct this trie, first, by the definition above, we can easily draw the trie of the set of patterns PS as shown below.

Trie of the set of patterns PS

Trie of the set of patterns PS

The double circle above is representing a terminal node, which means by traversing from root to this node, the labels on the path is a pattern in the set. In other words, the nodes in single circle is some internal node. The arrow in the graph without precedent node is pointing to the root node. For convenient, let’s define one more notation, L(u) is the label of a node u that is concatenation of edge labels on the path from root to u.

According to above introduction, the construction of a trie is trivial. For each pattern Pi you are going to insert into the trie. Starting the process from root, and follow the path labeled by characters of the pattern Pi. However, there are some scenarios during the insertion that needs to take consideration. In general, there are two possible cases when the pattern reach a node,

1. The node has an edge with label matching the target character of the pattern. In this scenario, the algorithm obviously can follow the label to next node.
2. The node does not has an outgoing edge with the correct label. In this case, a new branch is added to the current node, and move to this new node. Basically, the algorithm will insert the remaining characters of Pi as a subtree of the current node, without check these two cases again, ie. After inserting “HE” to the trie, the algorithm is then going to insert “HERS”. When it reached the terminal node of “HE”, there is no outgoing edge from the node with label “R”. Then the algorithm will add a subtree with two nodes and two edges with label “R”, “S” respectively, but not add edge “R” and going to check the cases again.

Before going deep into the implementation of the construction procedure, the structure of nodes should be considered. When implementing a trie, each node obviously needs to store its outgoing edges. On the other hand, there are two kinds of nodes in the trie, a terminal node and internal node. Terminal node is the node that representing itself is the end of a pattern, internal node is other nodes in the trie. At this moment, assume that both string and patterns in the problem contains only ASCII values. Then the structure of a node can be designed as follow.

struct Node {
  Node *go[256];
  bool out;
};

To minimize the useless memory, we can use a hash table to store the edges instead of array. The following is the pseudo code for constructing the trie.

void insert(Node *root, char *pattern, int n) {
  Node *node = root;
  for (int i = 0; i < n; ++i) {
    *(pattern+i) -= 'A';
    if (node->go[*(pattern+i)]==NULL) {
      node->go[*(pattern+i)] = new Node();
    }
    node = node->go[*(pattern+i)];
  }
  node->out = true;
}

Node *root = new Node();
for (int i = 0; i < k; ++i) {
  scanf("%s", pattern[i]);
  insert(root, pattern[i], strlen(pattern[i]));
}

The procedure runs in O(n) time, where n is the total length of all patterns.

In the next entry, we will discuss how to construct the automaton for AC algorithm by given a trie.

Reference:

Related:

Filed under: algorithm , ,