Large Datasets – Stuff I’m Learning

I usually prioritize code maintainability over performance, but for the first time, I’m experiencing the other side of the coin.

I currently work on software that’s expected to handle at least 350k rows by 50 columns, sanitize the data, “join” several datasets together, and perform calculations on that data. The data is frequently transformed similarly to a spreadsheet. Each transformation is expected to finish in a reasonable amount of time (seconds).

Here are a few of the hurdles I’ve run into:

1.) Memory limitations. Importing a 100MB CSV file to a JSON object can easily expand to gigs of memory because each “row” contains all the header names.

There’s no simple way around that, but you can at least alleviate some of the overhead during conversion by using read/write streams and processing each line as they come in. (Otherwise you’d need to store the whole original object in memory while creating a modified object on the side. Before garbage collection kicks in- you might at some point have 2 giant objects in memory.)

I like pure functions and immutability. Unfortunately, they tend to require creating a copy of the object because they can’t have side effects.

2.) Loops. When sanitizing big data, you’ll run loops through each item, checking for conditions and modifying objects. Computers are fast nowadays, but this is one area where it’s easy to destroy performance. When iterating over 17.5 million pieces of data, we needed to take extra caution not to stuff too much logic in the “hot loops”. One extra instruction per loop could easily double the processing time. Giving the iterator a reason to skip an iteration is a great way to improve this area. Caching- where applicable- can be a lifesaver too. Also, we opted for using Maps over arrays since the query time would theoretically be log n rather than n steps.

3.) Database inserts shouldn’t be huge. Using MongoDB, we tried to insert giant documents that were greater than 16MB. Mongo didn’t like it (which is ironic since the name “mongo” is derived from the word “humongous” ). So we had to do it the “right” way and break down the doc into numerous parts and insert them individually.

4.) Database inserts can be ludicrously slow… or blazingly fast. With MongoDB, the traditional way of inserting a document is to instantiate the model and run save() on it. That’s fine for most occasions, but not when inserting hundreds of thousands of rows at a time. In that case, using MyModel.collection.insert(arrayOfDocs) is the way to go. That will send everything over the wire at once and skip schema validation. BUT, that array must be no more than 16MB in size (even though they’ll be inserted individually) because… ♬ you can’t always get what you want. ♬ So I needed to chop the huge array of docs into chunks of 10,000 docs- with the assumptions that a given 10k chunk won’t exceed 16MB- then insert those chunks. It’s still pretty fast. So far it has worked out.

It feels like I’ve traveled back to the 90’s when you actually had to pay attention to memory consumption and find clever ways to overcome software/hardware limitations.

Leave a Reply

Be the First to Comment!