The Written Work Corpus

The written work corpus is a manually created data set of named occurrences of written works (books) in Swedish news texts. It’s intended use is for named-entity recognition (NER) tasks. The data set consists of 175 articles from the culture section of Dagens Nyheter (DN Kultur).

All included articles contain at least one mention of a written work. The articles were randomly selected from a pool of 11007 articles published during a five year period from 2013-08-01–2018-08-01, i.e. all articles published in the culture section during this time. The data set includes ~250 000 words and ~1500 occurrences of written works (a definition of written work can be found below). The total token count of written work entities is 7126. In my application, it’ll serve as a gold standard in an evaluation task but it is probably large enough to be used as a training set.

It is annotated using the BILOU annotation scheme where B is ‘beginning’; I is ‘inside’; L is ‘last; O is ‘outside’ and U is ‘unit’. The corpus is tab delimited and tokenized using the NLTK tokenizer with a Swedish language model. See example:

Men     		O
hans    		O
största 		O
roman			O
,			O
”			B-BOOK
Anna			I-BOOK
Karenina		I-BOOK
”			L-BOOK
,			O
handlar		        O
trots			O
detta			O
om			O
en			O
kvinna		        O
som			O
tar			O
sitt			O
öde			O
i			O
egna			O
händer		        O
.			O

Translation: But despite this, his greatest novel, ‘Anna Karenina’, is a story about a woman who controls her own destiny.

A link to the GitHub repository can be found here.

Some notes on the annotation

So what is a written work? I define it as any named occurrence of a written work. Some examples: fiction, non-fiction, dissertation, essays, comic books, poems and articles. Subtitles of written works are always annotated as written works. See:

Och		O
den		O
ovanciterade	O
Främling	I-BOOK
på		I-BOOK
tåg		I-BOOK
från		O
2012		O
,		O
vars		O
undertitel	O
sammanfattar	O
projektet	O
tämligen	O
utmärkt	        O
:		O
jag		I-BOOK
med		I-BOOK
vissa		I-BOOK
avbrott	        I-BOOK
dagdrömde	I-BOOK
och		I-BOOK
rökte		I-BOOK
mig		I-BOOK
runt		I-BOOK
Amerika	        I-BOOK
.		O

Translation: And the above quoted ‘Stranger on a Train’ from 2012, whose subtitle sums up the project quite brilliantly: ‘Daydreaming and Smoking Around America with Interruptions’.

Titles in foreign languages are also annotated as written works. To the best of my recollection, there are titles in English, French, German, Italian, Jiddisch and Russian present in the corpus. An example:

Chris			O
Kraus			O
sitter		        O
i			O
sin			O
lånade	        	O
lägenhet		O
och			O
studerar		O
den			O
svenska	        	O
utgåvan		        O
av			O
sin			O
kultklassade	        O
roman			O
”			B-BOOK
love			I-BOOK
Dick			I-BOOK
”			L-BOOK
.			O

Translation: Chris Kraus is sitting in her borrowed apartment, studying the Swedish edition of her cult novel ‘I love Dick’.

Lexicalization of titles

Since I have chosen an inclusive definition of written work it might be more fruitful to define what’s not a written work in this context. An archive is not a book. A magazine or a newspaper is not a book. An occurrence of the word “Hamlet”, for example, is not automatically a written work. If the context implies that the reference is made to the character Hamlet as opposed to the written work “Hamlet”, the occurrence is tagged as ‘O’. This is most often indicated by the presence or absence of quotation marks. However, some works are famous to the point of lexicalization. Hamlet needs no introduction and neither does Harry Potter. In these cases, individual judgments answering the question “does this refer to the (fictional) character or the book” has been made. To the best of my ability, I have tried to stay true to the intention of the writer. See example:

Trots			O
att			O
ryktena		        O
om			O
mer			O
Harry			B-BOOK
Potter		        L-BOOK
snabbt	        	O
dementerades	        O
tycks			O
inte			O
fansen		        O
vara			O
särskilt		O
besvikna		O
.			O

Translation: Despite the fact that the rumours of more Harry Potter were quickly denied, the fans do not seem especially disappointed.

This may or may not be a sound assumption. Since I am the only contributor to this project, no inter-annotator agreement validating these principles can be made. Feel free to contribute here for further development of the corpus.

Naming scheme overlap

The Harry Potter example relates to another challenge in this domain. Namely, the naming overlap of cultural artefacts. A book, for example, can also be a movie or a play (or both). Take a look at this example:

“Redan nu finns ‘Game of Thrones’ som målarbok och i januari nästa år släpps målarboken baserad på 1990-talsserien ‘Buffy och vampyrerna’”.

Translation: There’s already a ‘Game of Thrones’ coloring book and next year in January a coloring book based on the 1990’s TV show ‘Buffy the Vampire Slayer will be released.

The semantic concept of “Game of Thrones” spans over several modalities: it is a series of books by George R. R. Martin but it’s also a television show. This, in turn, has been adapted into a coloring book. Because of these fuzzy boundaries, an evaluation where any given algorithm generates a false positive on “Game of Thrones” in the sentence above seems unfair (depending on the application). So in order to be able to take this into account, I’ve enriched the corpus with an additional set of cultural artefacts. The tag set includes: ART, GAME, MOVIE, MUSIC, PLAY, RADIO and TV. ART refers to visual works of art; GAME to board games/computer games/video games; MOVIE refers to feature films and documentaries; MUSIC to any musical work; PLAY refers to dramatic plays and performance pieces on stage; RADIO refers to radio shows or podcasts; TV refers to television shows. Keep in mind still that the main focus of the corpus is written works. The occurrences of other cultural artefacts are therefore sparse.

Leave a Reply

Your email address will not be published. Required fields are marked *