Apache Jena TDB CRUD operations

In this tutorial we explain Apache Jena TDB CRUD operations with simple examples. The CRUD operations are implemented with the Jena programming API instead of SPARQL. We provide deeper understanding of the internal operations of the TDB triple store and show some tips and tricks to avoid common programming errors.

The example code is available on Github.

1. What are Apache Jena TDB CRUD operations?

Apache Jena is an open source Java framework for Semantic Web and Linked Data applications. It offers RDF and SPARQL support an Ontology API and Reasoning support as well as triple stores (TDB and Fuseki).

CRUD operations is an abbrevation for create, read, updat and delete and represents the most basic database operations. The same operations are available for triple stores and are shown in this tutorial for TDB.

2. Install Apache Jena and TDB

  1. You can download and add the required libraries manually and add them to your Java Build Path. I recommend to download the full Apache Jena framework to use the Jena API later on. You can download it here.
  2. If you use Maven add the following to your dependencies:
  3. We use the latest stable release which is 2.13.0 at the moment. Do not forget to update your Maven project afterwards.

 3. Writing Java class for TDB access

  1. We create a class called TDBConnection. In the constructor we already initialize the TDB triple store with a path pointing to a folder to be stored. We need a Dataset which is a collection of named graphs or an unamed default graph.
  2. If you have an ontology you want to store and manipulate you can use the following function to load it into the store. The begin and end functions mark transaction, which we strongly recommend to use throughout your application. It speeds up read operations and protects the data against data corruption, process termination or system crashes. You basically store multiple named models (namend graphs) in the dataset. You can store one default graph (no name).
  3. If we do not want to load an ontology or model we can build it from scratch using an add method.
  4. Moving on with reading stored triples. We store the results in a List of Statements.
  5. For removing triples we use the following function.
  6. The update method can be realized by removing and adding the new triple.
  7. Finally we want to close the triple store if we finished our transactions

Now we can move on to write a small test application.

4. Write a test application for the TDB Connection

  1. If you are familiar with JUnit tests in Java, you can use the following code. We add some triples to two named graphs (named models), check the size of the result and remove some triples.
  2. If you do not want to use JUnit you can simply add the code to a main function.

5. Tips for developing with Jena TDB

  1. In your TDB storage folder you will find a file called nodes.dat, after initializing the TDB store. There you can check if your triples were inserted. Of course it gets complicated in a bigger graph, but it is kept mostly in plain text. Make use of the search function.
  2. If you delete triples and wonder why they are still kept in the nodes.dat, but do not show up when reading via the API, this is related to the TDB architecture.

6. TDB architecture

TDB uses a node table which maps RDF nodes to 64 bit integer Ids and the other way around.  The 64 bit integer Ids are used to create indexes. The indexes allow database scans which are required to process SPARQL queries.

Now if new data is added, the TDB store adds entries to the node table and the indexes. Removing data only affects the indexes. Therefore the node table will grow continuously even if data is removed.

You might think that is a terrible way to store data, but there are good reasons to do so:

  1. The integer Ids contain file offsets. In order to accelerate inserts, the node table is a sequential file. The Id to node lookup is a fast file scan. If data gets deleted from the node table, you have to recalculate and rewrite all file offsets.
  2. Now if data is deleted, we do not know how often a node is used without scanning the complete database. Consequently we do not know which node table entry should be deleted. A workaround would add complexity and slow down update and delete operations.

Anyways, in our experience the majority of operations on a triple store are inserts and reads. If you ever have the trouble of running out of disk space, you may read the whole affected graph and store it from scratch while deleting the original one. Of course depending on the size, this may slow down the triple store as well.

If you have questions or problems, feel free to comment.

Facebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.