Processing Large XML Documents with PHP 5

  • Christian Stocker

There are different ways to process XML Documents in PHP 5. One can process them with SimpleXML, SAX, XMLReader or DOM and they all have their pro and cons (See my “XML in PHP 5” workshop slides for more details about them). But when it comes to large XML documents, the choices look quite limited.

Therefore I did some benchmark testing with the different extensions. The XML document is approx. 10MB big and consists of a lot of blog-entries from Planet PHP. The to-be-solved excercise was to get the title of the entry with the ID 4365. Not that much of a complicated task and with more complicated questions,

the results may differ.

The results (as text file) were actually not that surprising. SAX and XMLReader were very low on memory usage, but slower than with DOM/XPath. Here's a chart of the initial

results (parsing the full document).

But if we assume, there's only one ID = 4365, then we don't have to process the full document and we can stop after the first one (aka FO or firstonly in the results) was found. As this entry is in the first 10% in our example, the results are quite different. To no

surprise. With this approach and some luck with the entries order, we can cut down the processing time considerably, which is not possible with the DOM approach. There, it's all-or-nothing.

In the result charts you maybe also recognized the option “Expand” and “Expand & SimpleXML”. I added a new method to XMLReader this weekend called “expand()” (it's in CVS now). With this method, you can convert a node catched with XMLReader to a DOMElement. See also the libxml2 page for more information. This can be very useful, if you want to do DOM Operations on only a little part of a huge XML document. With the “Expand” script, we expand the node matching ID = 4365 with XMLReader and then apply an XPath operation on it. As you can see, it needs some lines

of code (the expand() method only returns a node, but we need a document for XPath), but after that, we can use every XPath expression and DOM Method we want. Even convert it to SimpleXML, as we do in the “Expand & SimpleXML” script. It's maybe a little bit useless in this case, as we don't save a lot of coding or time, but if your subtrees

are more complex or you want to build a new XML document, this can be quite useful. The time and memory used is approx. the same as with the plain XMLReader script (no surprise, since most of the time is spent in traversing the XML document and not parsing the subtree).

I also did some benchmarks with XSLT ( the chart). First I did the traditional method with loading the whole XML document into memory and then transform it. Time and memory used is more or less the same as with plain DOM processing, which is no surprise, since the

task this script has to do is almost the same as we did with the XPath stuff. But it gets interesting with the expand() feature of XMLReader. As we just want to transform the one entry, we search for it with XMLReader, create a DOMElement, resp. a DOMDocument and feed only that to the XSLT processor. This saves a lot of memory and scales very well

on the memory side. It takes longer time-wise (if you parse the full document, but that's the worst case scenario anyway), but if your XML documents are really huge (more than your available RAM for example), then this (or other XMLReader approaches) is the only feasible solution, IMHO.

To sum up: XMLReader is a powerfull extension to parse large XML documents, it's usually much faster than SAX (twice as fast), while still scaling without problems on the memory side. With the expand() method, it's now also possible to mix the features of DOM/SimpleXML/XSLT with XMLReader, if you only have to process parts of an XML

document.

Here are the scripts for reference:

SAX

XMLReader

Expand

Expand & SimpleXML

DOM & XPath

XSLT

XSLT w/ XMLReader


Tell us what you think