Like the unworldly linguist that I am, I spent the election of 2018 making a bot that announces to the world everytime a new word occurs in Dagens Nyheter. You can find it on Twitter under the handle @nya_ord_i_dn. In addition to the word (which is tweeted as it occurs in the article), the bot provides a short concordance and a URL to the article in which it occurs.
teknikpopulist
— Nya ord i DN (@nya_ord_i_dn) October 5, 2018
The bot is heavily inspired by the New New York Times @NYT_first_said. However, mine has a fancier underlying data structure. Behind this modest app, there is a trie (also known as a radix tree or a prefix tree). A trie is an ordered tree-like data structure that is used in applications such as auto-completion to store a dynamic set or associative array where the keys are usually strings. The name “trie” comes from its use in retrieval because search complexities can be brought to optimal limit. With a trie, we can perform lookup in O(k) time where k is the size of the key. Given a lexicon of [“el”, “elegans”, “elegant”, “elegi”, “elektricitet”, “elektro” and “elektrod”], this is what the trie representation looks like:
All entries originate from the root node (marked by three circles in red and black). Leaf nodes represent word endings and are marked in red. Given a target, we can search for an exact match by traversing down the tree from the root node, like this:
def search(self, word):
if not word or word == "":
return False
node = self.root
for char in word:
if char in node.next:
node = node.next[char]
else:
return False
return node.isleaf
The lookup time is dependent only on the length of the target string. In my Python 3 implementation, nodes are connected via defaultdicts.
The trie is not strictly necessary to perform the task at hand. It could have been done nearly as efficiently with a hash map for a while, because most complex data structures are optional when n is limited, and n is almost always limited. But language is not. Language is infinite and I intend for this trie to keep growing. Because of this, a highly specialized data structure becomes critical. Hence, the trie.
The words, so far, are content words, of course: nouns (“garnskulptur”), verbs (“över-används”) and adjectives (“saudikritisk”) dominate. The vast majority of the words are compounds (“sop-app”) or neologisms (“swifties”). Also frequently present are misspellings (“abortklinker”) and unusual inflections (“överläppar”).