OrientDB Index Performance Evaluation

In this tutorial we evaluate the OrientDB index performance using transactional and non-transactional graphs. The evaluation tool uses  OrientDB version 2.2.17 and Tinkerpop 2.6. We create an index on a vertex property, insert multiple test vertices, commit the data and read a specific element afterwards. You can download the project on Github.

Let us start of with the pom.xml file for the required OrientDB dependencies.

The OrientDB implementation offers an in-memory, a persistent local and a persistent remote database. The OrientDB server is only required for remote access. If you use the OrientDB server in-memory, you can remove the remote access dependencies in the pom.xml file (oriendb-client, orientdb-enterprise).

Furthermore the OrientDB Tinkerpop Blueprints implementation allows you to instantiate two graph instances via a graph factory. A transactional and a non-transactional graph. The factory produces one connection at a time or keeps a connection pool where each connection can be recycled and reused. We did the performance evaluation with both, transactional and non-transactional graphs using the local persistant database stored on the hard disk.

We create the property index remotely via the Java driver, so no administrative interaction with OrientDB is required.

The methods above create a case-insensitive (CI) and case-sensitive unique hash index. They forbid duplicates and are therefore fit to index unique id or email properties. At first we check if the vertex class is already registered. Then we check if the index was already created.

Be careful with the setMandatory(true) property. You have to add properties directly while creating the vertex and not via setProperty(…) afterwards, as shown in the main code below.

There are several other hash index options:

  • NOTUNIQUE_HASH_INDEX  duplicates are allowed here
  • FULLTEXT_HASH_INDEX  index based on any single word in the property – useful for searching in unstructured data
  • DICTIONARY_HASH_INDEX – similar to the UNIQUE_HASH_INDEX but duplicate keys are replaced with the latest entry

Check the OrientDB documentation to get an overview about the remaining index configurations.

Building the index can take up some time depending on your hardware and configuration. The graph instances are returned from the graph factory.

The setupPool method is only required if multiple connections are established. We use the non-transactional graph to setup the index. Otherwise you will get a warning to avoid using the transactional graph.

The remaining code is basically a for loop with several iterations to average over multiple runs to increase the measurement quality.

We tested on an I7 with 16 GB RAM, inserted one million elements in 5 runs and got the following results for the plocal setting:

OrientDBIndexPerformance

The results are quite interesting. I would have guessed that you use the non-transactional graph for bulk inserts like we do in this tutorial. Furthermore the example code is single threaded, so we do not have any concurring read/write access. That is exactly what the OrientDB documentation proposes:

“In cases such as massive insertions, you may find the standard transactional graph OrientGraph is too slow for your needs. You can speed the operations up by using the non-transactional OrientGraphNoTxgraph.

With the non-transactional graph, each operation is atomic and it updates data on each operation. When the method returns, the underlying storage updates. This works best for bulk inserts and massive operations, or for schema definitions.”

I understand that the non-transactional graph performs a commit (“updates data on each operation”) with each vertex. The data shows that there is no time spent on committing after the insert. I would be glad if anyone can have a look at the setup to confirm that it is not completely wrong. Because seeing the results like this i would use the transactional graph almost every time.

The only disadvantage i see (despite concurrent writes) is the higher RAM usage. That is because of the transactions being stored in memory until the commit is forced. You can reduce the RAM usage by committing e.g. every 1000 vertices.

We see the expected results for the query time, where the hash indexed version outperforms the standard query by a factor between 500-2500. This difference will increase further if the amount of data is increased. I am not sure why the read discrepancy between the non-indexed graphs is that high.

This was just a test for the Java API performance. You should use the functionality for bulk uploads via CSV or JSON files for larger imports if possible.

If you have errors, exceptions or other problems feel free to comment and ask.

Facebooktwittergoogle_plusredditpinterestlinkedinmail