Jackrabbit and its two SQL languages – some findings

  • Christian Stocker

PHPCR is an important technology for us at Liip. The most mature content storage implementation to be used with PHPCR is Jackrabbit, which we for example use on liip.ch.

Jackrabbit provides – among many other features – a pretty advanced search engine based on Lucene. It has different Query Languages, called SQL, SQL2 and XPath (from different generations of the JCR specs). That two of them are named SQL often leads to missunderstands how Jackrabbit works.

This article is about those and the differences between them. We had to learn the hard way in some of our projects that they're not handled the same deep inside of Jackrabbit. The documentation is unfortunetaly pretty sparse in that regard, I hope this blogpost helps a little bit for clearing that up. (We won't talk a lot about XPath, but it has the same behaviour as the older SQL implementation in Jackrabbit).

But first some background:

Jackrabbit isn't a traditional relational database like MySQL or PostgreSQL. In fact they don't have very much in common. It's more a NoSQL database (which happens to use a SQL dialect for searching) or a Key-Value Store or a Document-Database than a RDBMS. Jackrabbit can use a RDBMS for storing its data, usually we use MySQL or PostgreSQL. But it's only used to store the nodes in a quite flat fashion. It even does this compressed and somehow optimized, so that if you look at the content of those tables, it won't make much sense. Jackrabbit doesn't really need a RDBMS for that. It also can store to a file system or other “flat” storage layers. Adobe itself uses for example a Tar-file in their commercial Jackrabbit-based CRX. What this all means: it doesn't really matter, where Jackrabbit stores its data and it's not related to what features are available in the API facing to the developers using Jackrabbit.

This means that because the storage layer is completely hidden away and the data is stored quite flat and non-relational, it's not possible to search this data directly on the storage layer. That is where the Lucene index comes into play. When we save something in Jackrabbit, it gets stored in the storage layer, but also indexed by Lucene. Later, when we search for something, that Lucene layer is used to search all the nodes which met the search criteria and then the matched nodes are fetched from the storage layer (if not already in the cache).

Jackrabbit can't directly use the result from the Lucene results, it has for example to check if you're allowed to access that node via its ACL methods. That's also a reason why there's no fast, reliable COUNT() in Jackrabbit. What happens here is that Jackrabbit fetches all the node-ids from the Lucene index (this one is fast) and then checks for each node, if you can access it. This can take quite some time if you have a large result set. It's no problem for small resultsets, no matter how big your total data is, Lucene is quite good in that.

And what has this all to do with all that SQL1 and SQL2 talk in the beginning?

There are two major JCR (Java Content Repository) Specifications: 1.0 (JSR-170 from 2005) and 2.0 (JSR-283 from 2009). In JCR 1.0 there were 2 query languages defined. XPath and SQL. XPath was the main one (it made sense for a hierarchical DB structure which is quite similar to one big XML document) and an SQL dialect was retrieved from that. It was called SQL, because it had the well known “SELECT FROM WHERE *** ORDER BY” syntax, but it has nothing to do with relational databases. It also has no relation to the ANSI SQL1 from 1986/1989 (depending how you look at it, but it's indeed better known as SQL-89), it wasn't even called SQL1 in JSR-170, just SQL, but from now on I will refer to it as SQL1 (or maybe better JCR-SQL1 to make it clear that this is not the ANSI SQL1 from 1986, but if I ever talk of only SQL1, it's JCR-SQL1 not ANSI SQL1). JCR-SQL1 didn't have JOIN capabilities or anything else “fancy” (like LENGTH).

Those 2 query languages were somehow poorly defined in the specs and lacked features, so the JCR people went and built something compeletly new for JCR 2.0: An Abstract Query Model (AQM) which clearly defined the semantics of a search. For this AQM two concrete language bindings are specified: JCR-SQL2 and JCR-JQOM (JCR Java Query Object Model, a mapping of the AQM in Java objects and methods). Again, that SQL2 has nothing to do with 1992's ANSI SQL2 (aka SQL-92), they just share some of the same syntax.

That new query model in JCR 2.0 added much more features, mainly JOINs. But this also made it much more complex to implement and to map those searches efficiently and performant onto Lucene queries (more about that below).

The XPath syntax was ditched in JCR 2.0, because people understood the SQL syntax way better than the XPath syntax (it's still available in Jackrabbit)

JCR-SQL2 is supported in PHP's Jackalope since almost the begining. The QOM since a few months. We didn't use the QOM in our projects until recently, but SQL2 only

As said above, due to the possible more complex nature of JCR-SQL2 queries, Jackrabbit has to do more and can't do everything on the Lucene indexes. For example due to JOINs, doing a proper ordering is much more complicated. This is again mainly a problem if you have large resultsets, but then it makes them really slow (like for example: Give me the 10 latest articles out of 10'000's of articles). Unfortunately Jackrabbit does it the slow way also if you don't use JOINs at all.

What we found out while analyzing this is that the older query engine for JCR-SQL1 and JCR-XPath is much more mature and does indeed queries like “Give me the 10 latest articles” pretty fast, even if you have 100'000s of articles in Jackrabbit. Due to the simpler nature of those queries and more maturity of the code, those are tuned pretty good. For example sorting happens before fetching the nodes from the storage layer. This fact is unfortunately not well advertised by the Jackrabbit community.

We invested then quite some time to get SQL1 running in Jackalope and make the QOM smart to choose between SQL1 or SQL2 automatically, depending on the complexity of the query.

We also changed some of the handcrafted JCR-SQL2 queries to JCR-SQL1 queries, because we did have queries which could potentially have many results. For example in fulltext queries, if you use common words you easily get many nodes as results and then in JCR-SQL2 all nodes were loaded into memory and then sorted. This was slow and trashed all the internal jackrabbit caches. With JCR-SQL1, Jackrabbit sorts them right in Lucene and only gets the actually top 10 nodes from the storage layer. This is always fast.

That's why it's really important to know when to use SQL1 or SQL2. You can't go wrong with SQL1 and only switch to SQL2 if you need more complex queries. Or just use the QOM, then you don't even have to think about that (but you still should know, what the impact of complex queries is).

I hope JCR-SQL2 queries will be one day as fast as SQL1 queries in Jackrabbit, but this won't be an easy task after investigating a little bit into the code and corresponding Jira issues.

Even with that shortcoming, the good thing is that Jackrabbit does have a built-in powerfull search based on the industry standard Lucene. If you need more flexibility than this solution provides, you have to go the common path with many other setups, look into an external Solr or Elastic Search service (which use Lucene internally as well, but are much more configurable). With the upcoming “Jackrabbit 3” (working title “ Oak”) that should be easy to implement directly on the JCR layer.

Tell us what you think