JS Optimizations: Non-blocking Code

In the world of JS, everything we do is ideally supposed to be non-blocking. It’s important in JS because your app only runs on one thread, and if that thread gets blocked… the entire app gets blocked.

In my work project, we have some number crunching logic in which a single function may run for several minutes. These functions don’t do any form of I/O, just data manipulation using loops. The problem is, standard loops are blocking. Whether it’s a for loop, map, filter or reduce, it’s going to block.

In order to unblock the code, we’d need to 1.) use recursion and 2.) use the setImmediate function to give the event queue time to breathe and resolve other function calls.

Here’s an example of some blocking code:

The timer callback will never run because longFunc() blocks the app, then immediately clears the timer.

A quick fix would be to make the longFunc an async function and add setImmediate inside the while loop like so:

That would unblock the loop by postponing the next iteration until the next tick, allowing the app to do other stuff. But that only works for a while loop, not for forEach, map, filter, etc.

A more versatile solution is this:

The non-blocking function has a recursive inner-function called loop that evokes itself while a condition is true. Your blocking logic would go inside the conditional statement before setImmediate. For us, we process a single row of a table inside the loop. Any small atomic operation would be suitable. The performance penalty seems practically moot.

Another added benefit of using setImmediate is that is prevents exceeding the call stack size. Now if we could figure out to parallelize the logic across multiple threads without copying the data, that would be amazing.

Jenkins CI: Pains and Gains

Update 09/24/18: Jenkins sucks for our needs. Authentication is cumbersome and it has become a mountain of hacks in order to handle multi-container docker configurations. We’re better off creating our own custom CI solution.

—-

Original post:

This year has been packed with some genuine full-stack goodness.

Setting up continuous integration is frustrating at first, but incredibly rewarding once a working pipeline is in place. A year ago, I setup a pipeline to automate Android builds with pretty version numbers, signed for release, and easily downloadable by colleagues for distribution. All it took was a push to the master branch.

For this current project, we use Jenkins. We run it in Docker, so there’s a tad bit of dockerception going on. The Jenkins container runs with the host docker socket mounted on a volume so that the container docker client can send actions to the host’s docker daemon.

Now my task was to get some tests running for a Node server and containerized MongoDB process. The Node part was easy. But adding an extra container meant that 1.) the ports would need to be configured for inter-container communication and 2.) the lifetime of the container(s) would need to be managed (read: cleaned up) between each job and 3.) there would need to be caching for node_modules and test datasets.

The port issue was rather simple: I just needed to apply the --network host argument to the Node container so that it would have access to all ports on the host, notably the MongoDB port. Though there is a way to explicitly bind containers by name, that’s an optimization for later.

The next issue was container lifetime. The Node container was spawned by Jenkins and thus destroyed by Jenkins whenever the job was stopped. That’s good. But the MongoDB container was spawned manually so I’d need to take ownership of it and remove it at some point. To do so, I grep’d through each container via docker ps -a containing substring “mongo”, and used awk to extract the ID column, then removed them.

Things get really funky when you’re trying to run certain commands in the Jenkins config because quotes and some special characters need to be escaped, double escaped, and probably escaped some more. But if you somehow miraculously get all your commands running, it’s a sweet situation.

The last issue was caching. At first I thought I’d need to cache by a key which would be the checksum/md5 hash of the package.json or something. Fortunately there was an even simpler solution: docker volumes. I just needed to mount a host cache directory to certain project directories. I did it for our test data as well the yarn cache directory. I tried doing the same for the node_modules directories but there was unexpected behavior. Caching just the yarn cache folder worked fine. This solution is preferred over Jenkins’ Job Cacher plugin.

My overall impression of Jenkins on an in-house machine is that it’s considerably more flexible than a 3rd party service like CircleCI. With Digital Ocean, we can scale up the memory or CPU power easily. Jenkins is less abstract that other solutions and behaves exactly how it says it will. And lastly, the new Blue Ocean UI is sick.

pongz

A friend and I made Pong in Javascript using HTML5 canvas. I then thought that it would be interesting to try it again and flex my C++ muscles.

The Github project: https://github.com/ryanbennettvoid/pongz

I didn’t plan on doing low-level pixel rendering, nor do I ever plan to in the near or distance future. So I chose the SFML library, which is a beautiful concoction of simple abstractions for managing windows, input, graphics, audio, networking and more.

Installing the lib was as easy as sudo apt-get install libsfml-dev since I use Mint Linux for development.

CMake. The first thing I wanted to do was get the CMake config in check. What I found interesting about including SFML into the project was that the program binaries had to be compiled first, then the SFML libs would be linked afterwords. I suck with compilers and stuff so I normally try to stick with header-only libraries. Nonetheless, I eventually cracked the code:

Architecture. Although my buddy and I managed to recreate Pong without classes, I would have great anxiety if I were to expand upon the code we wrote. So I made classes that made sense for the game.

Game is the high-level God object that glues everything together.

Ball is.. the ball. And Player represents the human player and computer player.

Tickable is a common interface for all rendered objects. A class that inherits Tickable must implement the tick method.

Passed through the tick method is the Game object, frame number, Event object for tracking inputs and Window object for rendering.

There is a strange thing happening here: a void pointer for the Game object. Why not just set the type as Game? Because the Game object itself is Tickable (a self-referencing class must be a pointer) and for whatever reason, it didn’t want me to use a Game* pointer either. Yep, my solution is dangerous and scary and stuff… but it works when I cast it back.

I then needed what other languages call “instanceof”. But it looks like C++ doesn’t have a straight answer to “how to you figure out the type of an object”. So I made all the objects inherit from an abstract class called Object that forced the implementation of a toString() method, that way each object could spit out a human-readable indicator of type. That came in handy when I needed to get all the players, for example.

I still needed to use dynamic_cast to at least be sure that an object was actually an… Object. Is this the best way to do the job? Probably not. Actually, I’m pretty certain it’s not. But it worked.

Logic. The physics of Pong are surprisingly simple. The velocity of the ball pretty much doesn’t change, only the direction does. And the direction only changes when the ball collides with something. How do we define a “collision”? There’s a moment when the ball is about to go off-screen. At that moment, I invert the X or Y direction and it works out.

There’s some slightly more complicated (and admittedly buggy) code for taking into consideration the player positions too.

Conclusion. I didn’t want to spend too much time on this project- just see what I could knock out in a few hours.

One thing I didn’t get around to doing was making the animations framerate-independent. When you just render everything as fast as possible, things can speed up and slow down depending on performance. What I’d do is have a global “delta” value which would be the duration between frames, and all displacements would be multiplied by that value. A bigger delta means bigger movements and vice-versa. In the end, the game speed would be independent from the framerate.

I really dig the SFML API. The renderable objects inherit from the sf::Transformable class, which has all the goodies you need to get/set the size, position, rotation of an object and perform boundary operations such as contains/intersects. I didn’t use the methods nearly as much as I could have, but next time I will just have my own classes inherit from the SFML classes directly.

From an architectural standpoint, I’ll try to avoid void* pointers and probably pointers in general, in favor of the ugly-but-safe unique_ptr and such. References are preferred.

Stylistically, I would normally place my starting curly braces on the same line as the parameters and space everything out more, but I went for a more mainstream style that I often see in other C++ code bases.

I’m a pretty big advocate of TDD nowadays, but I’m not too familiar with the TDD tools/ecosystem for C++. Something for later.

Trying Out Computer Vision

I’ve wanted to start working with computer vision for the last few years, but for one reason or another, didn’t pursue it. Now I am.

Why computer vision? Because the ability to turn pixels into something meaningful is a major stride towards smarter software. Identifying objects visually has a wide range of useful applications.

I’ve seen countless videos on YouTube about using OpenCV for motion tracking, face detection, and other awesome stuff. The code didn’t seem very verbose or difficult to understand.

So I started by installing OpenCV. Although I don’t like polluting my host system with installed libraries/dependencies and would rather use Docker to handle compilations, I didn’t want to add complexity to something I had never done before, so I just followed the introductory instructions to install OpenCV on Linux.

After building the first demo project successfully, I felt that the first hurdle was over. I could now start building my own computer vision program!

The first step was to use some sort of image for input. It could be loaded from an image or video file (which is great for reproducing results in tests), but also kinda boring, so I used my webcam stream. How hard could it be? Not hard at all.

Ludicrously simple.

Then I checked out some documentation to see how I could apply filters to the image.

End result:

Let’s see what I can cook up after further experimentation!

I used to hate TDD…

Nowadays, I’d never build a production app without Test Driven Development.

The reason I love TDD boils down to one fundamental belief: if you can’t precisely describe the expected behavior of your system, you have no business touching the code.

With TDD, you first describe how the system works. Then you run the test (which should initially fail so that you can rule out false-positives), then you modify the code until all the tests pass.

In my early days before implementing TDD, I’d often spend hours staring at the screen, perplexed. My mind would go blank because I didn’t know where one task began and another ended.

How do you know when a task is finished? Well with TDD it’s simple: your test passes all the assertions. But without TDD, you’d need to manually test out each piece of code with each possible input, and it’s easy to fudge that up because we’re humans and we’re impatient.

Unit Tests are used to test individual functions while Integration Tests are used to tests combinations of functions. Unit testing strongly encourages you to break down the system into individual pure functions (“pure” meaning that there are no side-effects and the output is derived directly from the input).

In the end, TDD is awesome because it’s a natural way of converting business requirements into code- and when combined with Continuous Integration (automatic builds on each commit/merge)- it gives you the peace of mind to get some sleep knowing that the apocalypse was postponed for at least another day.

Large Datasets – Stuff I’m Learning

I usually prioritize code maintainability over performance, but for the first time, I’m experiencing the other side of the coin.

I currently work on software that’s expected to handle at least 350k rows by 50 columns, sanitize the data, “join” several datasets together, and perform calculations on that data. The data is frequently transformed similarly to a spreadsheet. Each transformation is expected to finish in a reasonable amount of time (seconds).

Here are a few of the hurdles I’ve run into:

1.) Memory limitations. Importing a 100MB CSV file to a JSON object can easily expand to gigs of memory because each “row” contains all the header names.

There’s no simple way around that, but you can at least alleviate some of the overhead during conversion by using read/write streams and processing each line as they come in. (Otherwise you’d need to store the whole original object in memory while creating a modified object on the side. Before garbage collection kicks in- you might at some point have 2 giant objects in memory.)

I like pure functions and immutability. Unfortunately, they tend to require creating a copy of the object because they can’t have side effects.

2.) Loops. When sanitizing big data, you’ll run loops through each item, checking for conditions and modifying objects. Computers are fast nowadays, but this is one area where it’s easy to destroy performance. When iterating over 17.5 million pieces of data, we needed to take extra caution not to stuff too much logic in the “hot loops”. One extra instruction per loop could easily double the processing time. Giving the iterator a reason to skip an iteration is a great way to improve this area. Caching- where applicable- can be a lifesaver too. Also, we opted for using Maps over arrays since the query time would theoretically be log n rather than n steps.

3.) Database inserts shouldn’t be huge. Using MongoDB, we tried to insert giant documents that were greater than 16MB. Mongo didn’t like it (which is ironic since the name “mongo” is derived from the word “humongous” ). So we had to do it the “right” way and break down the doc into numerous parts and insert them individually.

4.) Database inserts can be ludicrously slow… or blazingly fast. With MongoDB, the traditional way of inserting a document is to instantiate the model and run save() on it. That’s fine for most occasions, but not when inserting hundreds of thousands of rows at a time. In that case, using MyModel.collection.insert(arrayOfDocs) is the way to go. That will send everything over the wire at once and skip schema validation. BUT, that array must be no more than 16MB in size (even though they’ll be inserted individually) because… ♬ you can’t always get what you want. ♬ So I needed to chop the huge array of docs into chunks of 10,000 docs- with the assumptions that a given 10k chunk won’t exceed 16MB- then insert those chunks. It’s still pretty fast. So far it has worked out.

It feels like I’ve traveled back to the 90’s when you actually had to pay attention to memory consumption and find clever ways to overcome software/hardware limitations.

Promises vs Async Await – Revisted

In my previous post, I touted how Async Await syntax is less verbose than traditional “then” syntax. While that’s kinda true, it’s kinda not.

Lately, I’ve been having Node.js issues when using AA syntax because of “unhandled promise rejections”.  Thank goodness Node warns you about these, otherwise they’d just disappear in the ether.

This is how to properly throw an error:

NOT this way:

If you throw a string, you won’t see the stack trace (line numbers) when the error is logged. You need to throw an Error object which can take in a string message.

Also, this is the proper way to use async await:

Any usage of await should be wrapped in a try-catch statement. If it’s not, the error will not propagate up the stack, it’ll just remain stuck in the scope of the function evocation that threw the error.

Promises vs Async Await

Too may times within the last few weeks have I gone over some JS code and seen Async Await syntax. Even in some Github readme docs, code examples occasionally use it too.

For a while, I thought that Promises were fine and Async Await (which I will now abbreviate as “AA”) was unnecessary, but I’m now convinced that for everyday use, AA is a great alternative- but only if you go all in.

Before (using Promises):

After (using Async Await):

There are pretty much 2 differences to note:

1.) The surrounding function must be declared as an async function.

2.) The evocation of the promise function must be prepended with await

It makes async code look synchronous and can vastly enhance readability of async code without worrying about callbacks.

Drawbacks to Async Await? Inconsistency. You can’t just replace usage of then with usage of await – you have to also make the surrounding function async. So if you want consistency, you’d either need to make all functions async or use traditional then syntax… or do what most humans would do and just deal with incosistency.

A further question would be: when should you use Promises and when not to? Should promises only be reserved for async code? If so, it may become a pain to figure out how a function should be evoked. I’ve experimented with “always use AA for both sync and async functions” and found it to be a bit of boilerplate to get going, but quite pleasant to use.

Brushing Up On Algorithms

TLDR; In my spare time, I now experiment with various algorithms and data structures.

Although I’ve hacked together a few ranking algorithms here and there, I never did pursue traditional sorting/recursion algorithms and data structures. Well now after wrapping up a few projects, I’m suddenly craving something new…

A week ago, I started working on a repo called fun-with-algos to compile a list of popular algorithms, including Binary Search, QuickSort, Fibonacci and more, based off of descriptions and pseudocode from Wikipedia articles and other references. Some of the algorithms really started to click for me, while others not so much. Nonetheless, I started to see patterns.

Any good search algorithm- e.g. Binary Search- requires that the array be already sorted so that it can use less-than/greater-than comparisons to filter out irrelevant entries.

Sorting algorithms tend to have at least one anchor point- a value that each other value is compared to then swapped to the left or right of.

With recursion, the input usually starts at a specified value then decreases with each iteration until it reaches one or zero, at which point recursion terminates and a final value is returned.

Then I tried creating a custom Array Flatten algorithm without the help of map, reduce or any other high level prototype methods. Although I needed a funky hack to get rid of circular references, it passed the tests. To my surprise, the solutions on StackOverflow varied drastically and the official ECMAScript standard seemed rather complicated for my needs.

I will continue to add more algorithms to the repo as I discover them.

Save The Trees… Quad Trees

Afterwards, I moved onto a data structure. Linked lists are hardly any more useful than arrays, but tree structures are pretty interesting. A few years ago, I created a location-based photo sharing app called ThatSnap. Although the app ceases to exist anymore, the lessons continue to resonate in my work.

I remember needing to use a Quad Tree structure to quickly query objects on the map within a specified bounding box (the viewable area of the device). Sometimes there could be hundreds or thousands of markers on the map. You only need to render what’s visible. And on top of that, you’d want to cluster the markers in order to reduce visual chaos and further improve performance.

At the time, I had pulled some implementations for Swift (iOS) and Java (Android) from GitHub. With some minimal tweaking, I was able to get Quad Trees working on both versions of the app. But I didn’t need to understand how that structure worked.

Well for the last few days, I’ve dug deep into Quad Trees and I now see just how beautiful they are- both visually and theoretically. Check out the repo.

It’s actually fairly simple. You start with one huge quad. It has a max capacity of 4 items (or whatever we want). Before you insert a 5th item, the quad splits into 4 equally sized child quads. You then proceed to then fill those up,  ensuring that each point actually belongs in the quadrant it’s placed in. Rinse and repeat.

When you query within a specified bounding box, you first identify each overlapping quad within that range, then you find overlapping points within those overlapping quads. The process is repeated recursively for child quads, just how insertion is.

Although I’ve primarily used the data structure for geolocation/mapping, I am curious about applications regarding computer vision and collision detection.

As a side note, this is my first non-trivial use of C++. It wasn’t too big of a jump from Java-which I had used for the last 3 years for Android dev- but I was always intimidated by the C++ build system and compiler mumbo jumbo. After fiddling with CMake, that part became simple. Classes, abstract classes, static methods and data types were all mostly the same as Java.

It has been a blast so far. Let’s see how much deeper down the rabbit hole I go.

Go here for the fun-with-algos repo and here for the quad-tree-cpp repo,

Dipping My Toes Into Machine Learning

TL;DR: I’ve created my first practice machine learning project- using a dataset of movies to predict ratings based on various movie attributes (starting with the release year).

Blue line = predicted average rating. Red line = actual average rating. Green line = number of samples for year. All values normalized. The predictions are surprisingly accurate.

Over the last  few years, I’ve been hearing more and more buzz about machine learning. I felt that it was about time I nailed down the basics. The only hurdle I needed to overcome was finding a free dataset with a large enough sample size. Thankfully, I discovered themoviedb.org which grants developers a beefy database of movies for free after signing up.

Well that wasn’t the only hurdle. They say that you can only do data science with Python or else all of your limbs get chopped off. I chose to use Javascript since there are plenty of ML libraries for it now and I could get started faster. I’ll make sure to actively monitor my limbs to ensure no dismemberment occurs.

Then I stumbled upon an article on hackernoon.com (which I’m beginning to increasingly adore, along with medium.com blogs) that walked me through a barebones basic Linear Regression implementation. After putting two and two together, I created a small web app that predicts movie ratings by the year. The brains of the program was fairly simple:

 

For a rough idea of the error margin, I included the ´expected ratings and source sample sizes for each year (something I wish all widespread statistical charts would include). All values are normalized to look pretty.

Moving forward I’d like to:

  • Increase the sample size. I’ve restricted it to 20,000 movies, but with some tweaks, I can potentially access many more items from the DB. (Edit 4/14/18: Now fetches all 350,000+ movies and faster w/ AWS Lambda)
  • Experiment with different algorithms. Linear Regression is pretty good for conditions where the output tends to be linear, but I wonder if another algo can produce even better results. (Edit 4/14/18: Now using Polynomial Regression for a curved output. Will test other methods.)
  • Stream samples to the learning algo without needing the feed it the complete dataset at once. At that point, I’d like to animate the chart to illustrate the predictions becoming increasingly accurate. (Edit 4/14/18: Chart now animates as data is fetched, but still need an ML implementation that allows incremental learning.)
  • Possibly use TensorFlow. as it seems to be the industry standard for ML and they recently released a JS lib.

Check out the code for the project here:

https://github.com/ryanbennettvoid/machine-learning-movies