A series of unfortunate thoughts.: Graph Databases

For a small work project I have been evaluating a variety of graph databases so that I can efficiently model entity resolution. SQL joins are so boring.

Why Graph Database?

On the face of it, storing and modeling relationships between data is what Relational databases are all about, so why would you need a database that specifically models graph structures?

One compelling argument lies in the way you query relational data versus the way you query graph data. SQL works well when querying data with well defined, node-to-node relationships. For example if you have a table of employees, it is very easy to create a JOIN that will allow you to retrieve an employee and their manager, joined on a manager serial number field. This becomes much more difficult if you want to do something as simple as retrieving an employee's manager chain (especially in large companies). In SQL this would require either a series of queries, or essentially a recursive JOIN. It is possible to do, but messy and potentially slow.

This becomes even more problematic as you model domains where each node has many relationships with many different types of nodes. Take, for example, an online store that has social data connected to it. You would have customers, their relationships with each other, and their purchases and relationships between their purchases (such as X items were bought together). Using this data you could provide personalized suggestions, such as "many of your friends have bought this item recently, and also bought it with this other item". In SQL you would need to union multiple joins against the customer's friends, their purchases and between the purchases.

With a graph-type query, it becomes much simpler, where you start at the customer node, fan out to friend nodes of a certain depth (friends or friends-of-friends) and count up distinct sets of recent purchases.

It is certainly possible to fit a graph query language on top of a relational datastore (such as SPARQL/RDF on top of IBM DB2 or Oracle). Sometimes, however, picking a dedicated tool for the job works better.

Experiments

In typical developer fashion, I used a project as an excuse to play with new tools. In the process I ended up trying 3 different databases that are compatible with Java: OrientDB with the TinkerPop Blueprints API, Neo4j using both the web and embedded java api, and Apache Jena's implementation of RDF.

OrientDB with TinkerPop Blueprints

Links - OrientDB , Blueprints Plugin

OrientDB is a fantastic document store that is especially notable for its embedded Java implementation. If you ever need to store and query data in a semi-structured way for an application in-memory or with a local cache, OrientDB is a great way to go.

When it comes to graph storage, things get a little bit more complicated. It is very possible to use the native OrientDB API to do graph storage, but using this approach makes querying data a little more difficult. To get around this I tried using the Blueprints plugin for OrientDB to allow me to use a more standardized wrapper for creating and querying nodes in the graph. The idea behind this was that I could switch data-stores if OrientDB was not performant enough. Unfortunately the Bluepages plugin for OrientDB is not yet very stable and this made it very difficult to do things such as indexing nodes based on an attribute. This led to some very ugly code and use of an external index using Redis.

So I gave up using OrientDB.

Neo4j

Links - Neo4j , Cypher

Neo4j is a popular graph database that benefits from a relatively expansive query language called Cypher that allows for some very complex queries and update operations. Neo4j can be used as a standalone server or as an embedded database within a Java application.

When running as a standalone database server, Neo4j provides a very clean REST api for adding, querying and updating nodes. Unfortunately there are no wrapper APIs written to allow a Java application to access the database remotely, you must write your own. This is especially problematic since there is no option to use a binary remote API, so communication is slower than it needs to be and more verbose.

Because of this limitation, I ended up using Neo4j as an embedded database within my application. In order to simplify querying, I used a version 2.0 preview release so that I could assign labels to nodes and query them by their label (or "class" essentially). Everything seemed to go very smoothly and writing queries in Cypher proved relatively easy.

Unfortunately I ran into some issues with stability in the multi-threaded environment of my application. Eventually I was able to iron these out by upgrading to the next preview milestone and locking down all updates and queries to be single threaded. This made high-volume updates very slow, but I was able to mitigate this by surrounding the graph storage with a redis cache that would allow me to identify data changes and only perform updates on change.

Apache Jena RDF

Links - Jena , RDF , SPARQL

Apache Jena is a Java API that supports editing of Resource Description Frameworks (RDF) and graph queries using SPARQL. RDF is a format for describing relationships between objects rooted in Semantic Web data models. The native format of RDF is XML, so Jena supports reading and writing to XML RDF files. In addition to storing to XML files, Jena supports storing to a native Triple store called TDB as well as to relational databases using.

RDF makes heavy use of URIs/namespaces for identifying nodes. This provides good support for classifying nodes into different types and makes relationships between nodes very explicit and self-documenting. The typing system in RDF consequently feels much more mature than those in Blueprints or Cypher. Another benefit from the use of URIs is that it is very easy to create unique nodes, and providing additional data on the node is as simple as providing a REST service that uses the same URI structure.

For my initial implementation I used Jena with TDB for storage, as it allowed me to quickly prototype my data model. TDB supports multi-threaded applications through the use of thread-local transactions, but because of the complexity of the thread model of my application I ended up synchronizing a singleton, so its a bit slow.

The query language for Jena is SPARQL 1.1 (as of writing). From a pure query standpoint, SPARQL is relatively robust and very well defined (it is a W3C standard). Unfortunately from a performance standpoint, the complex queries I tried to perform proved difficult to optimize because the syntax abstracts a lot of the graph-walking process. Where an equivalent Cypher query would take seconds, a SPARQL query would take many many minutes, or I would give up and kill the query before any result was returned.

Conclusions

After more testing and trying the latest Milestone 6 of Neo4j, it looks like Neo4j is the winner. I attempted some larger scale testing in TDB with SPARQL and it turns out that performance is much worse than I had iniitially thought. Trying to formulate the same queries in SPARQL as Cypher can sometimes turn out to be exponentially worse in SPARQL. Cypher is a bit less abstract in the query structure so it allows you to avoid these pitfalls. This turned out to be a difference between a matter of seconds in Cypher and a matter of hours in SPARQL.

So with Neo4j more stable (thanks to avoiding parallel updates) Neo4j is now the best option for me.

A series of unfortunate thoughts.

Wednesday, November 13, 2013

Graph Databases

Why Graph Database?

Experiments

OrientDB with TinkerPop Blueprints

Neo4j

Apache Jena RDF

Conclusions

No comments:

Post a Comment

Engadget

Blog Archive