Monday, February 23, 2015

TRIE data structure

TRIE, another Search Tree!

Trie is an infix of the word reTRIEval, and pronounced as TREE but to distinguish it from the tree some people also pronounce it as TRY. As the name suggests it is an efficient retrieval or searching data structure. 

The root node represents an empty string (""), if you go down one level it represents all single character words. Similarly, at level 10, trie represents words or prefixes of 10 letters/characters. It is a search tree where each vertex or node represents a single word or a prefix (but each node actually stores a single letter or character). 

Let's take a sample dictionary of 6 words.
Dictionary, D = {a, an, out, the, them, their}

You can use Hashing or Binary search tree to implement a dictionary,  but here we will use trie to implement it.

Below are trie representations of the above dictionary, D. Both left as well as right trie represents the same dictionary (two different representations). If you want to understand the trie the right one is more suitable and left one represents trie from the implementation point of view. Leaf nodes (colored in blue) represents the last node of a word so 'e' of the word 'the' will be the leaf node. The leaf node signals end of a word, but it can get extended further (like 'a' and 'an'). So every node which is the last character of the word will be the leaf node. The implementation can be achieved by a boolean flag.

Two TRIE representation of the same dictionary

Possible operations:
addWord(String): adds a single word to the trie
search(String): search if the given word or prefix exists in the trie
countWords(String): number of words that match in trie with the given word
countPrefixes(String): number of words that have given the prefix


To make it simple, let's consider only English alphabets (a-z); and also make case insensitive. This means each node can have maximum of 26 children. So it's just a generalization of Binary Search Tree.

Java implementation this post, here

Trie implementation provides operations like add a word and find a word. If you understand the fundamentals, implementing other methods is not going to be difficult. 

Analyzing Performance

Cost of looking up a word or prefix in a vocabulary/dictionary implemented using trie depends only on the number of characters in the word/prefix. It doesn't depend on the size of the dictionary. This is an important distinction from the HashMap/HashTable based dictionary implementations.  As shown in the above diagram the cost of finding the word 'the' requires just 3 steps. 

Runtime complexity of lookup = O(length of word)
String comparison is considered as a single operation, so above can be generalized as O(1).

Trie implementation requires more memory than an array or tree. This is because at each level, the node takes space for 26 characters but not all of them will be used. But the space overhead gets compensated with the constant time search performance. 

keep coding !!!

No comments:

Post a Comment