Splitting big XML files with Apache Camel - Part 2

In my previous blog about splitting big files with Apache Camel, I said we were working on another solution, which is a new camel-stax component. The work is now complete, and the component will be part of the next Apache Camel 2.9.0 release. Thanks to Romain for his contribution.

The stax component
The stax component allows you to split big XML files as well, but it requires using JAXB and StAX. This means you need to define a POJO class(es) with JAXB annotations, to bind to the XML schema.
However the benefit is that you then work with the POJO classes in Camel.

For example the records example from the previous blog could be written with camel-stax as follows in the Java DSL


Where Record is the POJO class which has the JAXB annotations. And stax is a static import from the class org.apache.camel.component.stax.StAXBuilder.

If you are using XML DSL, then consult the camel-stax documentation which has such an example.

A little test
So I run the equivalent test from the previous blog as well with 40.000 elements, and the memory usage delta was about 8mb.
The test logs the time: Processed file with 40000 elements in: 55.962 seconds.

Running the test with 200.000 elements results in 9mb memory usage delta, and test time of: 250 seconds.

The unit test is in camel-stax component as the org.apache.camel.component.stax.StAXXPathSplitChoicePerformanceTest class.

Apache Camel 2.8.3 Released

This is just a quick blog entry to say that Apache Camel 2.8.3 has been released. The JIRA tracker has about 60 tickets resolved for this release.


Apache Camel 2.9.0-RC1 Released

The Camel team is working hard on the last pieces for the upcoming Apache Camel 2.9.0 release. In the mean time we decided to cut a release candidate; due to some larger changes like core API refactorings, Spring dependency changes, rewritten simple expression language, etc.

We would highly appreciate any feedback from the community in terms of any upgrade glitches, or other issues discovered in the release candidate.

The release is available to download from Apache, and as well from Central Maven repo.

For the release notes we suggest to take a look at the current in-progress release notes for the 2.9.0 release.


Coffe Machine and Camel in Action

Jonathan and I got published at the very end of last year when Manning announced that the Camel in Action book was available in print. As authors we are entitled to royalties of the sales of the book. Before you ask, we only get pocket changes compared to the amount of work we put into the book.

So what's the story with coffee then? Well as I have a home office I have to make my own coffee, as there is no fancy coffee machines around I can use. So for years I have been living off regular filtered, instant or stempel -coffee.

I made a promise to myself that I would buy a coffee machine when I get my royalty cheque. Well the cheque has arrived a month ago.

There is a lot of different types and brands of coffee machines. So I spend a while reading the web and watching you tube reviews of the machines. I also got advice from Johan Edstrom, who has a machine.

As I do not want to be my own barrista I was looking for a full automatic machine. At first I got my eyes on a machine from Gaggia, but recently spotted a new machine from Jura. I wanted a small machine as the daily use would be at my home office for a single person; our dog don't dring coffee, and my wife only drinks coffee in the weekends. The latest Jura ENA 9 Micro seemed like a great machine. It's small and compact, full automatic, and has a nosel to take in milk directly from the carton (or from a thermal bottle). So basically it's a one push button machine making coffee, expresso, cappuccino or latte.

So last wednesday I put in the order, and this morning the machine arrived at my doorstep.

My new coffee machine
The machine costs about 1300$, which would have been out of my normal price range. Well I could of course take the cash out of my regular pay cheque, but I wanted to stick to what was affordable from the royalties. That would also justify going for a higher priced model.

If I have to guess how many hours I have spend working on the book, then 1300 hours would be a good guess. So that's 1$ per hour. And that's before tax.

Just wanted to share this with the readers, so you know how I spend my royalties.

Thank you, the readers, for making my coffee a pleasure to drink from this day forward.


Splitting big XML files with Apache Camel

In the upcoming Apache Camel 2.9 we have improved the support for splitting big XML files using streaming and very low memory footprint.

In previous versions and examples provided on the Camel website, often showed examples of using XPath to split XML files using the Splitter EIP pattern.

Unfortunately the underlying XPath framework do not support an iterator based result, as its limited to the types defined by the JDK in the XPathConstants. That means a NODESET would be used as result type, which causes the XPath framework to return a NodeList instance which contains the entire XML payload in memory. There is nothing you can do about this despite using StAXSource, SAXSource or other stream types as input to the XPathExpression. Regardless what it would return a NodeList as result.

Tokenizer solution
So the Camel team have two solutions in the works. The first is already implemented in the upcoming 2.9 release. Its based on the tokenizer language which supports an iterator stream based. This means we can split any big file one a piece by piece without causing the entire content loaded into memory. So I enhanced the tokenizer to support two additional modes:

  • pair
  • xml

pair mode
The pair mode is to be used when you need to grab piece by piece and you have a known start and end tokens to denote a record. For example if you have a [START] and [END] markers in the content.

  .split(body().tokenizePair("[START]", "[END]")).streaming()

xml mode
This was used as the foundation for the xml mode as well, as the idea is similar. So you define a child tag name as the record to grab. For example to split by then you do as follows:


The XML content may look like
  <record id="1">
    <!-- record stuff here -->

  <record id="2">
    <!-- record stuff here -->
  <record id="99999">
    <!-- record stuff here -->

Now what about namespaces? Suppose you have a common namespace in the parent/root tag as follows:
<records xmlns="http://acme.com/records">


Then you can instruct the tokenizeXML to inherit namespaces from a parent/root tag by providing the name of the tag as the 2nd parameter as shown:

  .split(body().tokenizeXML("record", "records")).streaming()

Which means each splitted message will contain the namespace included:
<record id="1" xmlns="http://acme.com/records">

  <!-- record stuff here -->

What I like about the tokenizer is fully stream based and returns data as String content, which means there is no intermediate DOM or POJO objects or anything like that. Which mean it can split any kind of XML payload without having any model of it in the java code. So if you just need as in these examples to split the big XML file and send each splitted message to a JMS queue, then that is fast as there is no unnecessary to/from object marshaling.

A little test
I ran a little test on my laptop to process 40.000 records and the memory usage delta was about 4mb.
The test logs the time: Processed file with 40000 elements in: 53.676 seconds

Running the same test with XPath reneders a memory usage delta about 100mb.
The test is in fact a little faster: Processed file with 40000 elements in: 49.941 seconds

The reason is that after all content is loaded into memory, then its a pure CPU processing, where as the tokenizer does load the content from disk piece by piece.

That was just a small XML file with 40.000 records with a file size of about 7mb.
Now image if the XML file was a 500mb size with a million records. The XPath will be very slow and most likely cause a OOME exception on your server.

The unit tests is in camel-core which you can play with in the src/test/org/apache/camel/language directory.

What about the other solution
Its a community effort together with Romain who have created a Camel StAX component with a stream based iterator as well. However it requires a POJO model that has been JAXB annotation to be used. We will continue working on this and have his work contributed into the Apache Camel distribution.