About

NLP2RDF is a Community project bootstrapped by LOD2 that is developing the NLP Interchange Format (NIF) . NIF aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. . The output of NLP tools can be converted into RDF and used in the LOD2 Stack or other graph databases. Currently a NIF 2.0 Version is created and an overview is provided if you click on Documentation – NIF 2.0 in the menu.

This project is open for anyone to join and there are several ways to get involved. For slides and further reading please have a look at the publications page.

Rationale

The original motivation of creating NIF was quite simple. In order to integrate NLP tools into the LOD2 stack they were required to produce RDF. Instead of writing an individual RDF wrapper for each tool, it made perfectly sense to create a common format, which was expressive enough to potentially cover all NLP tools. Furthermore instead of creating a conceptual mapping between the output of the tools, several linguistic ontologies already existed and could be reused to unify tag sets and other NLP dialects. Although NIF is being generalized to provide additional benefits the main rationale is the integration of NLP tools into the LOD2 stack.

NIF Use Cases

NIF aims to improve upon certain disadvantages commonly found in NLP frameworks, processes and tools. Its basic advantages are the way how annotations are represented and what kind of annotations are used. These two aspects combined provide structural interoperability as well a conceptual interoperability.

Here are some claims, which we believe can be made about NIF (i.e. we have not yet found evidence that indicate otherwise respective technical feasibility):

  1. NIF provides global interoperability. If an NLP tool incorporates a NIF parser and a NIF serializer, it is compatible with all other tools, which implement NIF.
  2. NIF achieves this interoperability by using and defining a most common denominator for annotations. This means that some standard annotations are required to be used. On the other hand NIF is flexible and allows the NLP tools to add any extra annotations at will.
  3. NIF allows to create tool chains without a large amount of up-front development work. As the output of each tool is compatible, you can try and test really fast, whether the tools you selected actually produce what you need to solve a certain task.
  4. As NIF is based on RDF/OWL, you can choose from a broad range of tools and technologies to work with it:
    • RDF makes data integration easy: URIs, LinkedData
    • OWL is based on Description Logics (Types, Type inheritance)
    • Availability of open data sets (access and licence)
    • Reusability of Vocabularies and Ontologies
    • Diverse serializations for annotations: XML, Turtle,
      RDFa+XHTML
    • Scalable tool support (Databases, Reasoning)
    • Data is flexible and can be queried / transformed in many ways

Comparison to UIMA and Gate

NIF is almost completely orthogonal to frameworks such as Gate and UIMA. Per definitionem it is a format that represents NLP output, while Gate and UIMA are software frameworks for NLP. Here is a rough guideline, when to use NIF and when not:

Use UIMA and Gate, when:

  • You need to annotate a really high amount of text on a daily basis.
  • You already know, which tools and annotations you need and there are already adapters and plugins for UIMA or Gate .
  • You want to solve few specialized task, such as identifying keywords or find certain facts. For this you are planning one custom application and you do not have any additional requirements for RDF or interoperability.

Use NIF, when:

  • You are using the LOD2 Stack
  • The rest of your data is already in RDF
  • You want to query your text documents with SPARQL
  • You are not sure which tools to use and want to first try them and test the results.
  • You have a fixed text collection (or a low daily throughput) and want to unlock the implicit meaning. The text can be processed once, saved as RDF and then transformed easily or queried in a triple store.
  • You need annotations for several languages (multilingualism) in a uniform way

Definitely refrain from trying to build a scalable application that uses RDF/OWL as an internal data format. RDF and OWL are great for flexibility, reasoning and data integration, but NOT performance.

Rather consider using UIMA and Gate and then serialise the output as NIF.

Furthermore NLP2RDF provides:

  • documentation
  • reference implementations of NIF
  • collaboration platform
  • tutorials / example source code
  • mailing list for questions and support
  • possible to join on https://nlp2rdf.org

Leave a Reply