Finding new words with trie data structure

Like the unworldly linguist that I am, I spent the election of 2018 making a bot that announces to the world everytime a new word occurs in Dagens Nyheter. You can find it on Twitter under the handle @nya_ord_i_dn. In addition to the word (which is tweeted as it occurs in the article), the bot provides a short concordance and a URL to the article in which it occurs.

The bot is heavily inspired by the New New York Times @NYT_first_said. However, mine has a fancier underlying data structure. Behind this modest app, there is a trie (also known as a radix tree or a prefix tree). A trie is an ordered tree-like data structure that is used in applications such as auto-completion to store a dynamic set or associative array where the keys are usually strings. The name “trie” comes from its use in retrieval because search complexities can be brought to optimal limit. With a trie, we can perform lookup in O(k) time where k is the size of the key. Given a lexicon of [“el”, “elegans”, “elegant”, “elegi”, “elektricitet”, “elektro” and “elektrod”], this is what the trie representation looks like:

Visual representation of a trie with a limited vocabulary.

All entries originate from the root node (marked by three circles in red and black). Leaf nodes represent word endings and are marked in red. Given a target, we can search for an exact match by traversing down the tree from the root node, like this:

def search(self, word):
     if not word or word == "":
          return False
     node = self.root
     for char in word:
          if char in node.next:
               node = node.next[char]
          else:
               return False
     return node.isleaf

The lookup time is dependent only on the length of the target string. In my Python 3 implementation, nodes are connected via defaultdicts.

The trie is not strictly necessary to perform the task at hand. It could have been done nearly as efficiently with a hash map for a while, because most complex data structures are optional when n is limited, and n is almost always limited. But language is not. Language is infinite and I intend for this trie to keep growing. Because of this, a highly specialized data structure becomes critical. Hence, the trie.

The words, so far, are content words, of course: nouns (“garnskulptur”), verbs (“över-används”) and adjectives (“saudikritisk”) dominate. The vast majority of the words are compounds (“sop-app”) or neologisms (“swifties”). Also frequently present are misspellings (“abortklinker”) and unusual inflections (“överläppar”).

Leave a Reply

Your email address will not be published. Required fields are marked *