Full text search in SPARQL using Apache Lucene

In this tutorial we explain how you can perform a full text search in SPARQL using Apache Lucene and Apache Jena-text. Lucene is a very performant text search engine and can be used to index full text in RDF triples.

The example code is available on Github.

1. Create Maven project

We recommand to use maven to solve JAR dependencies automatically. Otherwise you have to download and integrate many JARs manually. Create a Maven project and to add the following dependecies to your pom.xml:

We use Apache Jena 2.11.0. Using Jena 2.13.0 results in an unknown class exception with the following code, we will resolve this in another post.

2. Example data

We use a small turtle file called data.ttl to create a Lucene index and perform a simple text search. We store it in the root folder of the project.

3. Code example explained

Starting of with creating an indexed dataset:

If you set the parameters tdbPath or lucenePath to null, the dataset will be non persistant and kept only in memory. The indexedProperty is the property pointing to the full text you want to index / query. With respect to our data we use:

Remember to use the full URI and do not abbreviate the prefix like ta:hasLongText. This is a different syntax used e.g. in turtle files.

Now we load the data.ttl into the created dataset:

If you used persistant storage, you should see the specified folders being created and filled with data.

Finally we can query the loaded data with the following code:

As you can see we want to search the term “wonderful” in our dataset.

4. Complete example

We put the functions above into a working code example:

After running this example you should see the following console output:

The result consists only of triples containing the searched term “wonderful”.

5. Additional information

  1. You can limit the returned results with:
  2. You can use wild card search using “?” (exactly one character), “+” (at least one character), “*” (zero to infinte characters)
  3. Why not just use REGEX Filters? Lucene is a really fast search engine, the index lookup ist alot faster then applying the REGEX Filters to every triple.
  4. You can analyze the Lucene index (if stored persistantly) using the luke-with-depth.jar. This can give you addtional information if you have problems with your index. Download and start the JAR file in the command line:

    There you can see how many terms are indexed, with what frequency they occure etc.

    Lucene Index View
    Lucene Index View

     

If you have questions or problems, feel free to comment and ask.

Facebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.