2011-11-04

Splitting big XML files with Apache Camel

In the upcoming Apache Camel 2.9 we have improved the support for splitting big XML files using streaming and very low memory footprint.

In previous versions and examples provided on the Camel website, often showed examples of using XPath to split XML files using the Splitter EIP pattern.

Unfortunately the underlying XPath framework do not support an iterator based result, as its limited to the types defined by the JDK in the XPathConstants. That means a NODESET would be used as result type, which causes the XPath framework to return a NodeList instance which contains the entire XML payload in memory. There is nothing you can do about this despite using StAXSource, SAXSource or other stream types as input to the XPathExpression. Regardless what it would return a NodeList as result.

Tokenizer solution
So the Camel team have two solutions in the works. The first is already implemented in the upcoming 2.9 release. Its based on the tokenizer language which supports an iterator stream based. This means we can split any big file one a piece by piece without causing the entire content loaded into memory. So I enhanced the tokenizer to support two additional modes:

  • pair
  • xml

pair mode
The pair mode is to be used when you need to grab piece by piece and you have a known start and end tokens to denote a record. For example if you have a [START] and [END] markers in the content.


from("file:inbox")
  .split(body().tokenizePair("[START]", "[END]")).streaming()
    .to("activemq:record");


xml mode
This was used as the foundation for the xml mode as well, as the idea is similar. So you define a child tag name as the record to grab. For example to split by then you do as follows:
from("file:inbox")

  .split(body().tokenizeXML("record")).streaming()
    .to("activemq:record");

The XML content may look like
<records>
  <record id="1">
    <!-- record stuff here -->
  </record>

  <record id="2">
    <!-- record stuff here -->
  </record>
  ...
  <record id="99999">
    <!-- record stuff here -->
  </record>
</records>

Now what about namespaces? Suppose you have a common namespace in the parent/root tag as follows:
<records xmlns="http://acme.com/records">

  ...
</records>

Then you can instruct the tokenizeXML to inherit namespaces from a parent/root tag by providing the name of the tag as the 2nd parameter as shown:
from("file:inbox")

  .split(body().tokenizeXML("record", "records")).streaming()
    .to("activemq:record");


Which means each splitted message will contain the namespace included:
<record id="1" xmlns="http://acme.com/records">

  <!-- record stuff here -->
</record>



What I like about the tokenizer is fully stream based and returns data as String content, which means there is no intermediate DOM or POJO objects or anything like that. Which mean it can split any kind of XML payload without having any model of it in the java code. So if you just need as in these examples to split the big XML file and send each splitted message to a JMS queue, then that is fast as there is no unnecessary to/from object marshaling.


A little test
I ran a little test on my laptop to process 40.000 records and the memory usage delta was about 4mb.
The test logs the time: Processed file with 40000 elements in: 53.676 seconds

Running the same test with XPath reneders a memory usage delta about 100mb.
The test is in fact a little faster: Processed file with 40000 elements in: 49.941 seconds

The reason is that after all content is loaded into memory, then its a pure CPU processing, where as the tokenizer does load the content from disk piece by piece.

That was just a small XML file with 40.000 records with a file size of about 7mb.
Now image if the XML file was a 500mb size with a million records. The XPath will be very slow and most likely cause a OOME exception on your server.

The unit tests is in camel-core which you can play with in the src/test/org/apache/camel/language directory.

What about the other solution
Its a community effort together with Romain who have created a Camel StAX component with a stream based iterator as well. However it requires a POJO model that has been JAXB annotation to be used. We will continue working on this and have his work contributed into the Apache Camel distribution.


10 comments:

steven said...

Great! Is this available in 2.9.0-RC1? I can't seem to find it.

Claus Ibsen said...

No its added after the RC1 release. We are actually cutting the 2.9.0 GA release today. Then Apache has a voting period, where you can download the RC of the GA and give it a spin. And if no issues found during the vote period, its graced as a GA release.

Otherwise keep eye on the Apache site as we will announce when the 2.9 is GA. And I most likely will blog about that as well.

Claus Ibsen said...

Oh and I wrote a part-2 blog entry as well. As we have another way of splitting big XML files.

http://davsclaus.blogspot.com/2011/11/splitting-big-xml-files-with-apache_24.html

steven said...

Ok, thanks. Looking forward to the new release!

baran elis said...

hi claus,

when I was trying to split a big xml file using split().tokenizeXML("child", "parent").streaming() I encountered a strange problem by chance.

I want my child tag to inherit parent's namespaces. When the parent has multiple namespace declarations and when these declarations are separated by newlines child cannot inherit namespaces. When they are on the same line it works fine.

Claus Ibsen said...

Baran.

Ah that sounds like a little bug. Do you mind reporting a ticket at Apache? There is a link to the issue tracker from this page: http://camel.apache.org/support
Its in the issue tracker you report bugs.

baran elis said...

yep, I will do so. thanks for your help.

Anonymous said...

but the code will not compile. At least not in 2.9.2; there is no body().tokenizeXML()?

from("file:inbox")

.split(body().tokenizeXML("record", "records")).streaming()
.to("activemq:record");

Anonymous said...

Hi Claus,

There's a typo by the java DSL here, as it should read:

split().tokenizeXML("record", "records").streaming().to("activemq:record");

Instead of:

split(body().tokenizeXML("record", "records")).streaming().to("activemq:record");

Which baffled the previous reader as well as me :-)

Babak

Kavita Laddha said...

Hi Claus,

Thanks for this very useful post! I had question, is there a way we can catch/handle a scenario where in the the tag given in the tokenizer is not in the file which Camel is trying to split? The file in example can be an XML file.

I tries to search in the forums, but couldnt find anything. Will appreciate your input and advice in this. Thanks for looking at my post.