NIF-1.0

This page provides the specification of NIF 1.0. For general information, Use Cases and the rationale behind NIF see the About page

NIF 2.0 Draft

NIF 2.0 supersedes this document and can be implementd. Please look here:
http://persistence.uni-leipzig.org/nlp2rdf/
This document – the specification of NIF 1.0 – will remain here for historic reasons exactly as it is.

NIF 1.0

The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.

The core of NIF consists of a vocabulary, which can represent Strings as RDF resources. A special URI Design is used to pinpoint annotations to a part of a document. These URIs can then be used to attach arbitrary annotations to the respective character sequence. Employing these URIs, annotations can be published on the Web as Linked Data and interchanged between different NLP tools and applications.

NIF 1.0 in a nutshell

NIF consists of the following three components:

  1. Structural Interoperability : URI recipes are used to anchor annotations in documents with the help of fragment identifiers. The URI recipes are complemented by two ontologies (String Ontology and Structured Sentence Ontology), which are used to describe the basic types of these URIs (i.e. String, Document, Word, Sentence) as well as the relations between them (subString, superString, nextWord, previousWord, etc.).
  2. Conceptual Interoperability: The Structured Sentence Ontology (SSO) was especially developed to connect existing ontologies with the String Ontology and thus attach common annotations to the text fragment URIs. The NIF ontology can easily be extended and integrates several NLP ontologies such as OLiA for the morpho-syntactical NLP domains, the SCMS Vocabulary and DBpedia for entity linking, as well as the NERD Ontology (below for details on the ontologies).
  3. Access Interoperability: A REST interface description for NIF components and web services allows NLP tools to interact on a programmatic level.

NIF-1.0 is stable and can be implemented. The experience and feedback collected during this implementation will be collected as NIF-2.0-draft. This specification is complemented by the information in this WordPress CMS, which contains documentation on how to integrate NLP tools and adapt them to NIF. Also reference implementations are available on the and as blog posts in the Implementations category. A allows to test the web services
An overview of the architecture can be found in the next section.

Architecture Overview

Structural Interoperability

Structural Interoperability is concerned with how the RDF is structured to represent annotations.

NIF-1.0 URI Recipes

NIF-1.0 currently supports 2 recipes: one offset-based and one hash-based. The principal task of the presented URI recipes is to address a part of a document and assign a URI to it, that can serve as an annotation anchor. A URI assigned in this way must be unique and it should be possible to identify the substring in the document with the information contained in the URI.

NIF specifies how to create an identifier for uniquely locating arbitrary substrings in a document.

Running Example. In July 2011, the website http://www.w3.org/DesignIssues/LinkedData.html was a html document consisting of 25482 characters. The term “Semantic Web” occurs 9 times at different position. If we want to annotate the substring “Semantic Web” in the section “The four rules” in the sentence “If it doesn’t use the universal URI set of symbols, we don’t call it Semantic Web.”, a fragment identifier such as #Semantic%20Web would definitely be insufficient and lack the necessary information to distinguish the different occurrences. It would not be unique.

General

A NIF-1.0 URI is made up of a prefix and the actual identifier. To annotate web resources it is straight-forward to use the existing URI as the basis for the prefix. NIF-1.0 does not dictate the use of the ‘/’ or the ‘#’ and considers it part of the prefix. The following guidelines should be met for NIF-1.0 URI recipes:

  1. As NIF is designed to have a client server architecture, the client must be able to dictate the prefix to the server. All components must therefore have a parameter prefix which determines how the produced NIF URIs will start.
  2. The NIF-URIs should be produced by a concatenation of the prefix and the identifier.
  3. The NIF component must use the prefix of the client transparently without any corrections.
  4. (non-normative) For practical reasons, it is recommended that the client uses a prefix that ends on # or /

Running Example. Recommended prefixes for http://www.w3.org/DesignIssues/LinkedData.html are:

NIF Recipe: Offset-based URIs

NOTE: This recipe is compatible with the position and range definition of RFC 5147 (Especially Section 2.1.1 ) and builds upon it in terms of encoding and counting character positions. The syntax and everything else is different though. See the design choice section for a discussion.

The Offset-based URIs are constructed from 4 parts separated by an underscore “_”:

  1. an identifier, in this case the string “offset”
  2. the begin index
  3. the end index (The indexes are counting the gaps between characters starting with 0)
  4. a human readable part consisting of the first 20 (or less, if the string is shorter) characters of the addressed string, urlencoded according to RFC 3986 (using ‘%20′ instead of ‘+’ for white space).

Running Example.http://www.w3.org/DesignIssues/LinkedData.html#” serves as the prefix. The rest is “offset_14406_14418_Semantic%20Web“. The human readable part should match the output produced by the following Java code:

String first20Chars =  (anchoredPart.length() < 20) ? anchoredPart.substring(0, 20) : anchoredPart ;
String prefix = "http://www.w3.org/DesignIssues/LinkedData.html#"
String uri;
//variant A
uri = prefix + URLEncoder.encode(first20Chars , "UTF-8").replaceAll("\\+","%20");
//variant B
URI uri = new URI("http", "www.w3.org","/DesignIssues/LinkedData.html",first20Chars);
uri = uri.toASCIIString();

or by this PHP method:

$prefix = "http://www.w3.org/DesignIssues/LinkedData.html#"
$uri = $prefix.rawurlencode(substr($anchoredPart,0,20)

and together the final URI will look like this:

http://www.w3.org/DesignIssues/LinkedData.html#offset_14406_14418_Semantic%20Web

NIF Recipe: Context-Hash-based URIs

The greatest disadvantage of the offset-based recipe is that it is not stable w.r.t. changes in the document.
In case of a change of the document (insertion or deletion), all offset-based NIF-URIs after the position the change occurred become invalid. The hash-based recipe is designed to remain more robust against insertion and deletion. Some additional implication of the Context-Hash-based URIs can be found in the Design Choice Section below. The hash-based URIs are constructed from 5 parts separated by an underscore “_”:

  1. an identifier, in this case the string “hash”
  2. the context length (number of character to the left and right used in the message for the hash-digest)
  3. the overall length of the addressed string
  4. the message digest, a 32 character HEXDIGIT md5 hash created of the string and the context. The message M consists of a certain number C of characters (see 2. context length above) to the left of the string, a bracket ‘(‘, the string itself, another bracket ‘)’ and C characters to the right of the string: “leftContext(String)rightContext”
  5. a human readable part consisting of the first 20 (or less, if the string is shorter) characters of the addressed string, urlencoded according to RFC 3986 (using ‘%20′ instead of ‘+’ for white space).

Running Example. This example uses a context length of 4, the digest therefore is:

md5(" it (Semantic Web).

The resulting URI is this:
http://www.w3.org/DesignIssues/LinkedData.html#hash_4_12_79edde636fac847c006605f82d4c5c4d_Semantic%20Web

NIF-1.0 URI Recipe Overview

String Ontology

The String Ontology is a vocabulary to describe Strings and builds the foundation for the NLP Interchange Format. It has a class String and a property anchorOf to anchor URIs in a given text and describe the relations between these string URIs. The ontology can be found here:

The available properties are descriptive. Only some properties are actually required for NIF, all others are optional and it is even discouraged to use them per default as they will cause quite a high amount of logical assertions (there are many transitive properties). Overall, only these terms are primarily important:

  • http://nlp2rdf.lod2.eu/schema/string/OffsetBasedString
  • http://nlp2rdf.lod2.eu/schema/string/ContextHashBasedString
  • http://nlp2rdf.lod2.eu/schema/string/Document
  • http://nlp2rdf.lod2.eu/schema/string/anchorOf
  • http://nlp2rdf.lod2.eu/schema/string/subString

Structured Sentence Ontology

The Structured Sentence Ontology (SSO) is built upon the String Ontology and additionally provides classes for three basic units: Sentences, Phrases and Words. Properties such as sso:nextWord and sso:previousWord can be used to express the relations between these units. Furthermore properties are provided for the most common annotations such as the data type properties for stem, lemma, statistics, etc. The ontology can be found here:

Normative requirements

The URI recipes of NIF are designed to make it possible to have zero overhead and only use one triple per annotation, such as in the following example:

 ld:  .
 str:  .
 revyu:  .
ld:offset_14406_14418_Semantic%20Web rev:hasComment "Hey Tim, good idea that Semantic Web!" .

Some additional triples are added, however, to ease programmatic access.

1. All URIs created by the above-mentioned URI recipes should be typed with the respective OWL Class for the recipe (str:OffsetBasedString or str:ContextHashBasedString)

This produces one additional triple per generated URI:

 ld:  .
 str:  .
 revyu:  .
ld:offset_14406_14418_Semantic%20Web rev:hasComment "Hey Tim, good idea that Semantic Web!" .
ld:offset_14406_14418_Semantic%20web rdf:type str:OffsetBasedString .

2. In each returned NIF model there should be at least one uri that relates to the document as a whole and either references the page with the property str:sourceUrl or includes the whole text of the document with str:sourceString

The definition of Document in NIF is closely tied to the request issued to annotate it. So each piece of text that is sent to a service is treated as a document. This produces three additional triples per Document (or request).

 ld:  .
 str:  .
ld:offset_0_25482_%3Chtml%20xmlns%3D%22http%3A%2F%2F rdf:type str:OffsetBasedString .
ld:offset_0_25482_%3Chtml%20xmlns%3D%22http%3A%2F%2F rdf:type str:Document .
ld:offset_0_25482_%3Chtml%20xmlns%3D%22http%3A%2F%2F str:sourceUrl  .

3. For each document in a NIF model all other strings, that use NIF URIs should be reachable via the subStringTrans property by inference. Either str:subString must be used for this or a sub property thereof

Programatically, it might be required to iterate over the strings contained in the document. This means that they have to be connected via a property. To achieve this requirement, either str:subString must be added between the URIs (where appropiate) or another property that is ardfs:subPropertyOf str:subString . Example of such sub properties are sso:word, sso:firstWord, sso:lastWord, sso:child.

 ld:  .
 str:  .
ld:offset_0_25482_%3Chtml%20xmlns%3D%22http%3A%2F%2F rdf:type str:Document .
ld:offset_0_25482_%3Chtml%20xmlns%3D%22http%3A%2F%2F str:subString ld:offset_14406_14418_Semantic%20web

Design Choices and Future Work

This section explains the rationale behind the design choices of this section. Discussion and proposals are vital and possible via the comment section below or as written on the Get Involved page.

  • The additional brackets “(” and “)” in the Context-Hash-based URIs were introduced to make the hash more distinguishable. If there is a sentence “The dog barks.” and the contxt size is too big, i.e. 10, then “The”, “dog” and “barks” would have the same hash. Compare md5("The dog barks.") vs. md5("The (dog) barks.") vs. md5("The dog (barks).")
  • Note that Context-Hash-based URIs are unique identifiers of a specific string only if the context size is chosen sufficiently large. If a, for example, complete sentence is repeated in the document, parts of the preceding and/or subsequent sentences are to be included to make the reference to the string unique. However, in many NLP applications, a unique reference to a specific strings is not necessary, but rather, all word forms within the same minimal context (e.g., one preceding and one following word) are required to be analysed in the same way. Then, a Context-Hash-based URI refers uniquely to a word class, not one specific string. Using a small context, a programmer can therefore refer to a whole class of words rather than an individual word, e.g., she can assign every occurring string “the“ (with one preceding and one following white space as context) the part-of-speech tag “DT”: md5(“ (the) “); The resulting URI is http://www.w3.org/DesignIssues/LinkedData.html#hash_md5_1_5_8dc0d6c8afa469c52ac4981011b3f582_the”
  • The two ontologies should not be included into the NIF output per default (via owl:import). Some axioms will produce a lot of entailed triples, especially the transitive properties. It is generally better to just write some additional client code to add the required annotations, such as nextWordTrans or sub and superString than inferring them by a reasoner such as Pellet. SPARQL CONSTRUCT is the way to go in this case.
    CONSTRUCT {?word1 sso:nextWordTrans ?word2}
    WHERE {?word1 sso:nextWord ?word2};

    executed once and

    CONSTRUCT {?word1 sso:nextWordTrans ?word2}
    WHERE {?word1 sso:nextWordTrans ?word2};

    executed N-1 times (N is the number of words of the longest sentence) might just serve the same purpose and will be a lot faster.

  • At the moment only two URI Recipes were included, which operate on text in general. Text is defined as either a sequence of characters or simply everything that can be opened in a text editor. This also entails that HTML, XML, Source Code, CSS and everything else (except e.g. binary formats) can be addressed with NIF-URIs. This is the most general use case and in the future more content-specific URI Recipes such as XPath/XPointer for XML might be included.
  • The explicit typing of the URIs with the class OffsetBasedString is useful for determining the type of the URI and before parsing it. Of course, the explicit typing is redundant, because the information is already included in the URI. This redundancy might be removed in the future.
  • The compatibility problem RFC 5147 originates in the dilemma that normally fragment ids of URIs are media-type specific. RFC 5147 is valid for plain text(i.e. text without markup), which is disjoint with html. Achieving full compatibility comes with additional (unnecessary) ballast, while providing hardly any advantages for NLP tools. See the full discussion here. Final email.

Conceptual Interoperability

Conceptual Interoperability is concerned with which ontologies and vocabularies are used to represent the actual annotations in RDF. We divided the different output of tools into different NLP domains. For each domain a vocabulary was or will be chosen that serves the most common use cases and facilitates interoperability. In simple cases a property has been created in the Structured Sentence Ontology. In the more complex cases fully developed linguistic ontologies already existed and were reused.
This part of the specification is also extended on the fly, i.e. everything in this section is stable (backward-compatible), but additional NLP domains might be appended, any time. New NLP domains can be requested or proposed here.

Golden rule of conceptual interoperability

Here is the golden rule of interoperability with respect to a reference ontology:

Alongside the local annotations used by the tool, minimal information must be added to be able to disambiguate the local annotations with the help of a reference ontology for data integration and interoperability.

NLP Domains

Begin/end index and context

Adding extra triples for indexes and context produces redundant information as the begin and end index as well as the context can be calculated from the URI. Properties for describing context and begin and end index can be found in the String Ontology.

 ld:  .
 str:  .
 xsd:  .
ld:offset_14406_14418_Semantic%20web str:beginIndex "14406"^^xsd:int  .
ld:offset_14406_14418_Semantic%20web str:endIndex "14418"^^xsd:int  .
ld:offset_14406_14418_Semantic%20web str:leftContext " it " .
ld:offset_14406_14418_Semantic%20web str:leftContext4 " it " .
ld:offset_14406_14418_Semantic%20web str:beginIndex ".

Lemmata, stems, stop words, etc.

Lemma and stem annotations are realized as simple data type properties in the Structured Sentence Ontology.
A class is used for stopwords (sso:StopWord)

 ld:  .
 sso:  .
ld:offset_14406_14414_Semantic sso:stem "Semant"  .
ld:offset_14406_14414_Semantic sso:lemma "Semantic"  .
ld:offset_14406_14414_Semantic a sso:StopWord  .

Part of speech tags

Annotations for part of speech tags must make use of .
OLiA connects local annotation tag sets with a global reference ontology. Therefore it allows to keep the specific part of speech tag at a fine granularity, while at the same time having a coarse grained reference model. The RDF output must contain the original tag as plain literal using the property sso:posTag as well as a link to the respective OLiA individual in the annotation model for the used tag set using the property sso:oliaLink.

Here is an example using the Penn tag set:

 ld:  .
 sso:  .
 penn:  .
ld:offset_14406_14414_Semantic sso:posTag "JJ"  .
ld:offset_14406_14414_Semantic sso:oliaLink penn:JJ  .

The extended RDF output can additionally contain the following: 1. all classes of the OLiA reference ontology (especially from olia.owl and olia-top.owl) that belong to this individual can be copied and added to the str:String via rdf:type 2. the rdfs:subClassOf axioms between these classes must be copied as well.
All necessary information has to be copied for conciseness and self-containedness. If information is copied in this way it is unnecessary for the client to include the three OLiA reference ontologies:
olia-top.owl, olia.owl, system.owl
In total, these three models amount to almost 1000 OWL classes. An overview of OLiA can be found .

Here is an example of the extended output:

 ld:  .
 sso:  .
 penn:  .
 olia-top:  .
 olia:  .
ld:offset_14406_14414_Semantic sso:posTag "JJ"  .
ld:offset_14406_14414_Semantic sso:oliaLink penn:JJ  .
ld:offset_14406_14414_Semantic rdf:type olia:Adjective .
ld:offset_14406_14414_Semantic rdf:type olia-top:MorphosyntacticCategory .
ld:offset_14406_14414_Semantic rdf:type olia-top:LinguisticConcept .
olia:Adjective rdfs:subClassOf olia-top:MorphosyntacticCategory.
olia-top:MorphosyntacticCategory rdfs:subClassOf olia-top:LinguisticConcept.

Currently, there are some small disadvantages, which might be fixed in the next NIF version:

  • The ontologies might still change a little now and then and are unversioned. In the next NIF version, we will provide stable releases.
  • The semantics are not exactly modelled as instances of str:String are intuitively not instances of olia-top:LinguisticConcept. Normally you would model these classes as disjoint, but the amount of RDF triples is reduced the way it is currently done. Note that this saves an extra pattern in each SPARQL query.
  • Also for the sake of conciseness subClassOf axioms from OLiA are repeated. This might formally be considered as ontology highjacking. Again, it is more concise than importing 1000 OWL classes.

Syntax Analysis

in development

Topic Models

An ontology draft for Topic models is available here:

Named Entity Recognition and Entity Linking

For describing Named Entity Recognition (NER), NIF uses one property from the .
For entity linking, the property http://ns.aksw.org/scms/means must be used.

 ld:  .
 scms:     .
 dbpedia:     .
ld:offset_22849_22852_W3C scms:means dbpedia:World_Wide_Web_Consortium .

For entity typing the Named Entity Recognition and Disambiguation Ontology (http://nerd.eurecom.fr/ontology/) must be used alongside the local annotation types.

 

Here are three examples for Spotlight, Zemanta and Extractiv to showcase how important a reference ontology is, even for such a trivial annotation as organization. Note the three different spellings “Organisation”, “organization”, “ORGANIZATION”;

 dbo:     .
 nerd:     .
ld:offset_22849_22852_W3C rdf:type dbo:Organisation .
ld:offset_22849_22852_W3C rdf:type nerd:Organization .
 zemanta:      .
 nerd:     .
ld:offset_22849_22852_W3C rdf:type zemanta:organization .
ld:offset_22849_22852_W3C rdf:type nerd:Organization .
 extractiv:     .
 nerd:     .
ld:offset_22849_22852_W3C rdf:type extractiv:ORGANIZATION .
ld:offset_22849_22852_W3C rdf:type nerd:Organization .

Access Interoperability

Most aspects of access interoperability are already tackled by using the RDF standard. This section contains several smaller additions, which further improves interoperability and accessibility of NIF Components. They are normative, but play a minor importance compared to the Structural and Conceptual Interoperability. While the first two allow general interoperability, Access provides easier integration and off-the-shelf solutions.

Workflow

NIF itself is a format, which can be used for import and export of data from and to NLP tools. Therefore NIF enables to create ad-hoc workflows following a client-server model or the SOA principle. In this approach, the client is responsible for implementing the workflow. The diagram below shows the communication model. The client sends requests to the different tools either as text or RDF and then receives an answer in RDF. This RDF can be aggregated into a local RDF model. Transparently, external data in RDF can also be requested and added without any additional formalism. For acquiring and merging external data from knowledge bases, the plentitude of existing RDF techniques (such as Linked Data or SPARQL) can be used.

Interfaces

Currently the main interface is a wrapper that provides the NIF Web service. Other interfaces, such as CLI or a Java interface (using Jena) are easily possible and can be provided in addition to a Web service. The Web service must:

  1. be stateless
  2. treat POST and GET the same
  3. accept the parameters in the next section

The component may have any number of additional parameters to configure and fine tune results.

Parameters

Note that to send the parameters to a Web service they must be url encoded first.

These parameters must be used for each request (required):

  • input-type = "text" | "nif-owl"
    Determines the content required for the next parameter input, either plain text or RDF/XML in NIF
  • input = text | rdf/xml
    Either the text or RDF in XML format in NIF

These parameters can be used for each request (optional):

  • nif = "true" | "nif-1.0"
    If the application already has a web service, use this parameter to enable NIF output (allows to develop NIF in parallel to existing output, legacy support). It is recommend for a client to always send this with each request.
  • format = "rdfxml" | "ntriples" | "turtle" | "n3" | "json"
    The RDF serialisation format. Standard RDF frameworks generally support all the different formats.
  • prefix = uriprefix
    A prefix, which is used to create any URIs. Therefore the client should ensure, that the prefix is valid, when used at the start of the uri, e.g. http://test.de/test#. If input-type=”nif-owl” was used, then the server must not change existing uris and only use the prefix when creating new URIs. It is recommended that the client matches the prefix to the previously used prefix. If the parameter is missing it should be substituted by a sensible default (e.g. the web service uri).
  • urirecipe = "offset" | "context-hash"
    the urirecipe that should be used (default is “offset”)
  • context-length = integer
    If the given uri recipe is context-hash, the client can determine the length of the context with it. The default must be 10.
  • debug = "true" | "false" This options determines, if additional debug messages and errors should be written within the RDF. See also the next Section.

Errors

For easier debugging, errors and messages have to be written as RDF output. If a fatal error occurs (i.e. the component was unable to produce any output), the RDF output should contain a message using the underspecified error ontology provided by NLP2RDF: http://nlp2rdf.lod2.eu/schema/error/
Errors can be given any URI or blank node, must be typed as error:Error, must specify whether they are fatal or not and must contain an error message. Here is an example:

 error:     .
ld:error_1 rdf:type error:Error
ld:error_1 error:fatal "1"^^xsd:boolean .
ld:error_1 error:hasMessage  "Wrong input parameter ..., could not parse ontology" .

If the debug parameter is set, the component is allowed to mix error messages with the RDF that contains the annotations:

 error:     .
ld:error_1 rdf:type error:Error
ld:error_1 error:fatal "0"^^xsd:boolean .
ld:error_1 error:hasMessage  " Sentence 5 and 6 had a low confidence value for sentence splitting. " .
ld:error_2 rdf:type error:Error
ld:error_2 error:fatal "0"^^xsd:boolean .
ld:error_2 error:hasMessage  "Could not add extended output for OLiA." .

Leave a Reply