Full text search in SPARQL using Apache Lucene

In this tutorial we explain how you can perform a full text search in SPARQL using Apache Lucene and Apache Jena-text. Lucene is a very performant text search engine and can be used to index full text in RDF triples.

The example code is available on Github.

1. Create Maven project

We recommand to use maven to solve JAR dependencies automatically. Otherwise you have to download and integrate many JARs manually. Create a Maven project and to add the following dependecies to your pom.xml:

<dependency>
	<groupId>org.apache.jena</groupId>
	<artifactId>apache-jena-libs</artifactId>
	<type>pom</type>
	<version>2.11.0</version>
</dependency>

<dependency>
	<groupId>org.apache.jena</groupId>
	<artifactId>jena-text</artifactId>
	<version>1.0.0</version>
</dependency>

We use Apache Jena 2.11.0. Using Jena 2.13.0 results in an unknown class exception with the following code, we will resolve this in another post.

2. Example data

We use a small turtle file called data.ttl to create a Lucene index and perform a simple text search. We store it in the root folder of the project.

@prefix ta: <http://www.tutorialacademy.com/jenatext#> .

ta:subject1 ta:hasLongText "The Tutorial Academy is a wonderful place for tutorials!" .
ta:subject2 ta:hasLongText "The Tutorial Academy offers wonderful tips and tricks for programming!" .
ta:subject3 ta:hasLongText "The Tutorial Academy is great!" .

3. Code example explained

Starting of with creating an indexed dataset:

public static Dataset createIndexedDataset( String tdbPath, String lucenePath, String indexedProperty ) 
{
	Dataset graphDS = null;
	
	if( tdbPath == null )
	{
		System.out.println( "Construct an in-memory dataset" );
		graphDS = DatasetFactory.createMem();
	}
	else
	{
		System.out.println( "Construct a persistant TDB based dataset to: " + tdbPath );
		graphDS = TDBFactory.createDataset( tdbPath );
	}

    // Define the index mapping 
    EntityDefinition entDef = new EntityDefinition( "uri", "text", ResourceFactory.createProperty( URI, indexedProperty ) );
    Directory luceneDir = null;
    
    // check for in memory or file based (persistant) index
    if( lucenePath == null )
    {
    	System.out.println( "Construct an in-memory lucene index" );
    	luceneDir =  new RAMDirectory();
    }
    else
    {
    	try 
    	{
    		System.out.println( "Construct a persistant lucene index to: " + lucenePath );
			luceneDir = new SimpleFSDirectory( new File( lucenePath ) );
		} 
    	catch (IOException e) 
    	{
			e.printStackTrace();
		}
    }
    
    // Create new indexed dataset: Insert operations are automatically indexed with lucene
    Dataset ds = TextDatasetFactory.createLucene( graphDS, luceneDir, entDef ) ;
    
    return ds ;
}

If you set the parameters tdbPath or lucenePath to null, the dataset will be non persistant and kept only in memory. The indexedProperty is the property pointing to the full text you want to index / query. With respect to our data we use:

<http://www.tutorialacademy.com/jenatext#hasLongText>

Remember to use the full URI and do not abbreviate the prefix like ta:hasLongText. This is a different syntax used e.g. in turtle files.

Now we load the data.ttl into the created dataset:

public static void loadData( Dataset dataset, String file )
{
	System.out.println( "Load data ..." );
    long startTime = System.currentTimeMillis();
    dataset.begin( ReadWrite.WRITE );
    try 
    {
        Model m = dataset.getDefaultModel();
        RDFDataMgr.read(m, file);
        dataset.commit();
    }
    finally
    { 	
    	dataset.end(); 
    }
    
    long finishTime = System.currentTimeMillis() ;
    double time = finishTime - startTime;
    System.out.println( "Loading finished after " + time + "ms" );
}

If you used persistant storage, you should see the specified folders being created and filled with data.

Finally we can query the loaded data with the following code:

    public static void queryData( Dataset dataset )
    {
    	System.out.println("Query data...") ;
        
        String prefix = "PREFIX ta: <" + URI + "> " + 
        				"PREFIX text: <http://jena.apache.org/text#> ";
        
        String query = "SELECT * WHERE " +
        			   "{ ?s text:query (ta:hasLongText 'wonderful') ." + 
        			   "  ?s ta:hasLongText ?text . " +  
        			   " }"; 

        long startTime = System.currentTimeMillis() ;
        
        dataset.begin( ReadWrite.READ ) ;
        try 
        {
            Query q = QueryFactory.create( prefix + query );
            QueryExecution qexec = QueryExecutionFactory.create( q , dataset );
            QueryExecUtils.executeQuery( q, qexec );
        }
        finally 
        {
        	dataset.end() ; 
        }
        
        long finishTime = System.currentTimeMillis();
        
        double time = finishTime - startTime;
        System.out.println( "Query finished  after " + time + "ms" );

    }

As you can see we want to search the term “wonderful” in our dataset.

4. Complete example

We put the functions above into a working code example:

package com.tutorialacademy.jena.text;

import java.io.File;
import java.io.IOException;

import org.apache.jena.query.text.EntityDefinition;
import org.apache.jena.query.text.TextDatasetFactory;
import org.apache.jena.query.text.TextQuery;
import org.apache.jena.riot.RDFDataMgr;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.store.SimpleFSDirectory;

import com.hp.hpl.jena.query.Dataset;
import com.hp.hpl.jena.query.DatasetFactory;
import com.hp.hpl.jena.query.Query;
import com.hp.hpl.jena.query.QueryExecution;
import com.hp.hpl.jena.query.QueryExecutionFactory;
import com.hp.hpl.jena.query.QueryFactory;
import com.hp.hpl.jena.query.ReadWrite;
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ResourceFactory;
import com.hp.hpl.jena.sparql.util.QueryExecUtils;
import com.hp.hpl.jena.tdb.TDBFactory;

public class JenaTextExample
{
    static String URI = "http://www.tutorialacademy.com/jenatext#";
    
    public static void main(String ... argv)
    {
        TextQuery.init();
        Dataset ds = createIndexedDataset( "tdb", "luceneIndex", "hasLongText" );
        loadData( ds , "data.ttl" );
        queryData( ds );
    }
    
	public static Dataset createIndexedDataset( String tdbPath, String lucenePath, String indexedProperty ) 
	{
		Dataset graphDS = null;
		
		if( tdbPath == null )
		{
			System.out.println( "Construct an in-memory dataset" );
			graphDS = DatasetFactory.createMem();
		}
		else
		{
			System.out.println( "Construct a persistant TDB based dataset to: " + tdbPath );
			graphDS = TDBFactory.createDataset( tdbPath );
		}
	
	    // Define the index mapping 
	    EntityDefinition entDef = new EntityDefinition( "uri", "text", ResourceFactory.createProperty( URI, indexedProperty ) );
	    Directory luceneDir = null;
	    
	    // check for in memory or file based (persistant) index
	    if( lucenePath == null )
	    {
	    	System.out.println( "Construct an in-memory lucene index" );
	    	luceneDir =  new RAMDirectory();
	    }
	    else
	    {
	    	try 
	    	{
	    		System.out.println( "Construct a persistant lucene index to: " + lucenePath );
				luceneDir = new SimpleFSDirectory( new File( lucenePath ) );
			} 
	    	catch (IOException e) 
	    	{
				e.printStackTrace();
			}
	    }
	    
	    // Create new indexed dataset: Insert operations are automatically indexed with lucene
	    Dataset ds = TextDatasetFactory.createLucene( graphDS, luceneDir, entDef ) ;
	    
	    return ds ;
	}
    
	public static void loadData( Dataset dataset, String file )
	{
		System.out.println( "Load data ..." );
	    long startTime = System.currentTimeMillis();
	    dataset.begin( ReadWrite.WRITE );
	    try 
	    {
	        Model m = dataset.getDefaultModel();
	        RDFDataMgr.read(m, file);
	        dataset.commit();
	    }
	    finally
	    { 	
	    	dataset.end(); 
	    }
	    
	    long finishTime = System.currentTimeMillis() ;
	    long time = finishTime - startTime;
	    System.out.println( "Loading finished after " + time + "ms" );
	}

    public static void queryData( Dataset dataset )
    {
    	System.out.println("Query data...") ;
        
        String prefix = "PREFIX ta: <" + URI + "> " + 
        				"PREFIX text: <http://jena.apache.org/text#> ";
        
        String query = "SELECT * WHERE " +
        			   "{ ?s text:query (ta:hasLongText 'wonderful') ." + 
        			   "  ?s ta:hasLongText ?text . " +  
        			   " }"; 

        long startTime = System.currentTimeMillis() ;
        
        dataset.begin( ReadWrite.READ ) ;
        try 
        {
            Query q = QueryFactory.create( prefix + query );
            QueryExecution qexec = QueryExecutionFactory.create( q , dataset );
            QueryExecUtils.executeQuery( q, qexec );
        }
        finally 
        {
        	dataset.end() ; 
        }
        
        long finishTime = System.currentTimeMillis();
        
        long time = finishTime - startTime;
        System.out.println( "Query finished  after " + time + "ms" );

    }

}

After running this example you should see the following console output:

Construct a persistant TDB based dataset to: tdb
Construct a persistant lucene index to: luceneIndex
Load data ...
Loading finished after 123ms
Query data...
------------------------------------------------------------------------------------------
| s           | text                                                                     |
==========================================================================================
| ta:subject1 | "The Tutorial Academy is a wonderful place for tutorials!"               |
| ta:subject2 | "The Tutorial Academy offers wonderful tips and tricks for programming!" |
------------------------------------------------------------------------------------------
Query finished  after 124ms

The result consists only of triples containing the searched term “wonderful”.

5. Additional information

You can limit the returned results with:

?s text:query (ta:hasLongText 'wonderful' 10) .

You can use wild card search using “?” (exactly one character), “+” (at least one character), “*” (zero to infinte characters)
```
?s text:query (ta:hasLongText 'won?erful' 10) .
```
Why not just use REGEX Filters? Lucene is a really fast search engine, the index lookup ist alot faster then applying the REGEX Filters to every triple.
You can analyze the Lucene index (if stored persistantly) using the luke-with-depth.jar. This can give you addtional information if you have problems with your index. Download and start the JAR file in the command line:
```
java -jar luke-with-depths.jar
```
There you can see how many terms are indexed, with what frequency they occure etc.

Lucene Index View

If you have questions or problems, feel free to comment and ask.

1. Create Maven project

2. Example data

3. Code example explained

4. Complete example

5. Additional information

Leave a Comment Cancel reply