Semantic transcription: from XML-TEI and Beyond

With digitization of documents, the problems of encoding, transcription and annotation have become an everyday problem for Digital Humanists. Some solutions have been proposed, and some brought more semantic to the data, eventually offering more perspective for better usage of the collected data (better text mining, topic modeling, information retrievability for example).

Concerning textual documents, the Text Encoding Initiative consortium has offered to the Digital Humanities community a great standardized framework for text encoding with guidelines and even XML schema since 1994. Although this is a great framework – as can show the bigger and bigger adoption of XML-TEI in the recent Digital Humanities projects -, this framework can not fit the needs of every project, especially when the project is not just about text encoding but needs more interpretation. Yet XML is flexible enough to let the scholars make clever adaptations to it, or get some inspiration from it. In this blog post, we will take a look at some recent ideas tackling challenges like customizing TEI for classical Japanese documents, integrating the notion of time in text corpora, and bringing annotations for Speech, thought and writing representation.

Adapting TEI for classical Japanese textual documents

Kawase, Ichimura and Ogiso presented this year a short paper ([1]) about their project of encoding early modern Japanese texts, like the Sharebon (dating back to 1798). They previously had investigated the use of a pure XML-TEI system to help in the process of annotating the text. However, the traditional XML-TEI is not able to represent well the needed pieces of information at the character level nor able to store enough information in the annotations.

In Sharebon, lots of dakuten (voice marks on the character) are missing, lots of hiraganas and katakanas are misused according to nowadays Japanese grammar rules (hiraganas and katakanas form two distinct “alphabets” in japanese), and repetitions symbols (odoriji) are omitted.

Due to those problems, the system also needed to be extended to be able to provide more information on the interpretations of the text by scholars. It was decided then to extend the ruby annotation system to support the annotation of the text on both sides for example. Such an annotation system is especially crucial to old Japanese documents, where the characters may have a really different signification than now.

The following figure (from [1]) show examples of disambiguation using ruby annotations, either to show missing dakuten, use a katakana instead of a hiragana, or specifying a missing odoriji.

Example of disambiguation through ruby annotations (from [1])

The following figure (from [1]) show the usage of ruby annotations to display the reading of characters and annotations.

ruby annotation on each side to display the reading of kanjis characters (from [1])

The XML-TEI standard seemed to be a very good start for this project as only a limited set of modifications needed to be added to be able to annotate decently an ancient Japanese document.

Taking more than one dimension into account: a project on European integration treaties

Armaselu and Allemand presented this year a short paper ([2]) about their project to better tackle the complexity of the European treaties from a Digital Humities perspective.
Beside to be quite hard to understand for someone who is not a lawyer, the European treaties are extremely complex because of the need for every text to be translated in every language spoken in the European Union. The treaties are also difficult to analyze because some treaties may modify other treaties. And everything becomes extremely difficult when some treaty never really get applied.
The following figure (from [2]) presents the typical structure of a treaty (alinea within paragraph, embedded into an article from a sub-section…) that must be represented by an adapted annotation system.

representation of the structure of a treaty with TEI (from [2])

Because of those problems, a good annotation system for European treaties should make clear the relations between the different texts: it should make it possible to store information about subsequent modifications by other text, citations, amendment, and linguistics revisions at least.

Thanks to additions to the XML-TEI standard, Armaselu and Allemand were able to capture in their system the structure of a treaty, and also to be able able to model the relations between the treaties at a fragment level (moe precisely than just at a treaty level). Their system is able to represent complex relations and implications like the ones on the following figure (from [2]) where several modifications are applied to different linguistic versions of different treaties, and modifications implied by some treaties on others.

example of modeled relations between treaties (from [2])

They are now working on adding the possibility to make more precise references, and to precise the nature of modifications.

Speech, thought and writing representation system based on XML-TEI

Brunner presented this year at DH Archive 2014 a long paper about building an XML annotation system for Speech, thought and writing representation (ST&WR). ST&WR tries to capture more details than a normal analysis would do: with ST&WR, it is about transcribing the feelings, thoughts, and disambiguating the information about a character as perceived during a reading.
ST&WR is really close to literary analysis, and therefore, an annotation system for ST&WR must be precise enough to depict the force of a statement, or the ambiguities in a sentence: it must be way more precise than what XML-TEI is.

The system developed by Brunner was developed based on the GATE natural language processing framework, and use the XML markup language. The annotation system permits to differentiate direct speech, from indirect speech, direct thoughts, and so on, giving 12 main categories each associated to a specific XML tag (direct_speech, direct_thought, etc.).
The system also let the transcriber to precise the degree of ambiguity of a text chunk, or if it can be considered as metaphoric, pragmatic or non factual.
The following figure (from [3]), displays some of the attributes that can be added to the XML tags for more precision.

some attributes that can be embedded in XML tags (from [3])

After the design of this annotation system based on GATE, Brunner is currently trying to make his system comply with the TEI guidelines, which would help in making it easier to adopt and benefit from the tools already developed for TEI.

As we have seen with this blog post, the XML-TEI standard is still really fertile: it is a very versatile framework easy to adapt to special needs thanks to the use of the XML markup language.
XML-TEI can be the starting base for a Digital Humanities project, and be adapted easily by the addition of XML tags or properties.
Not detailed in this blog post, is also the fact that TEI has been an inspiration to other initiatives: for example MEI ([6], [7]) is the bringing to Music annotation the same ideas as TEI for textual documents annotation.
Beyond the examples shown in this blog post, it is likely that we will see more and more Digital Humanities projects adopting such semantic annotation systems, in more and more versatile ways. If you are interested in XML-TEI, it is likely that just like in 2014, the DHArchive 2015 conference will host some workshops about it (there were several of them this year, see [4], [5]). We hope that we will see you there.

References:

[1] Kawase, Ichimura, Ogiso. Problems in encoding documents of early modern japanese, in Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Paper-934.xml

[2] Armaselu, Allemand. Building a multi-dimensional space for the analysis of European integration Treatis. An XML-TEI scenario, in Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Paper-338.xml

[3] Brunner. An XML annotation system for Speech, Thought and Writing Representation, in Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Paper-374.xml

[4] Ciula, Czmiel, Mylonas, Rahtz, Cummings, Syd. Hacking with the TEI, in Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Workshops-913.xml

[5] Bodard, Franzini, Stoyanova, Tupman. Introducing the Epidoc collaborative: TEI XML and tools for encoding classical source texts, in Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Workshops-811.xml

[6] The music encoding initiative, provides guidelines for Music encoding and annotations. Accessible to http://www.music-encoding.org/ retrieved October 21, 2014.

[7] Beer, Bohl, Seuffert. Annotations in digital music edition – concepts, processes and visualisation of annotations in Digital Humanities Lausanne ’14 Conference Archive, retrieved October 21, 2014, from http://dharchive.org/paper/DH2014/Panel-90.xml