Korelogic Limited

Written by:

James

January 9, 2024

One of our projects at Korelogic is a global data platform that has a lot of moving parts. One of the moving parts is ingesting data from a third party that is in XML format.

As a predominantly javascript engineer, this at first, gave me cold sweats. You mean... it's not in JSON? I trembled.

The reason I was working on this was modernising some of the code, and adding a new feature. This was exciting because it meant the new code would have the benefits of using Typescript and Prisma .
The existing code parsed the XML using xml2js which led to code like:

and, well, I didn't really like it. I am a very firm believer in ‘Assume Positive Intent', so that's fine - this code has been running in production for a long time with minimal problems - so the things I don't like about it are not problems just because I don't like it. But I think there are a few issues I’d like to address.

The first is that this is running in an AWS lambda - converting an xml file to JSON is not free in terms of GBSeconds. Also, it can be a bit of a pain to set up and test lambdas, even with Infrastructure as Code. This one is actually solved by how these changes are made to be more Well Architected - the lambda will write a row to a database table, and the business logic will now run regularly on any rows in this table and get the relevant file from the S3 bucket itself. Then if it fails, the row will still be there.

The biggest problem is all the legwork. There is a lot of data in these files, and there's a lot more code in this style, and most of it is concerned with avoiding Cannot read property of undefined type errors. You can use the super handy get function from the venerable Lodash library to avoid these errors, but it can still get messy with mapping over arrays.

So, I decided to look at better XML parsing solutions. I found Xpath. I became a developer after SOAP and XML were really popular, so I hadn't really used it before, but it's essentially a XML query language.

It's actually built into the browser, but this code is running on node.js, which doesn't have it natively, so I needed a library - I settled on xpath which was v popular on npm, and xmldom which was recommended in the README of xpath. Other libraries are available!

The equivalent of the code above came out as:

Although, for completeness I should add the helper function, which is called many times, is mainly there to handle the typescript involved in making sure we get a string or a null value. I wrote it quickly and it can definitely be improved!

When I wrote unit tests for this process (another thing I was able to add as part of these changes 😁) I had a lot of tests 'will not throw an error if an XML node is not present' - and I couldn't make them fail! I realised this is because of how Xpath and XML querying work - it's more akin to CSS selectors - if nothing matches, then nothing matches, it doesn't throw an error.

In the end I had an extract function, which the code above is from, which is a pure function that returns either the data or null - and an ingest function which takes an object that extract outputs and then decides whether the nulls mean it is ingestible or not (some data can be null, some is essential.) This meant everything was nicely typed, and also tested. It's also more readable - I mean, you need to understand what the Xpath expression //EventData/Stat[@Type='Venue'] means, and also the structure of the XML data to some extent. But, it's not too hard to see what it's doing - and knowing your data is kind of essential anyway.

My hope is that this code is much more maintainable - and easily extendable, because there's lots more events data to do interesting things with!

Quick Links

Socials