TL;DR: I’ve created my first practice machine learning project- using a dataset of movies to predict ratings based on various movie attributes (starting with the release year).
Blue line = predicted average rating. Red line = actual average rating. Green line = number of samples for year. All values normalized. The predictions are surprisingly accurate.
Over the last few years, I’ve been hearing more and more buzz about machine learning. I felt that it was about time I nailed down the basics. The only hurdle I needed to overcome was finding a free dataset with a large enough sample size. Thankfully, I discovered themoviedb.org which grants developers a beefy database of movies for free after signing up.
Then I stumbled upon an article on hackernoon.com (which I’m beginning to increasingly adore, along with medium.com blogs) that walked me through a barebones basic Linear Regression implementation. After putting two and two together, I created a small web app that predicts movie ratings by the year. The brains of the program was fairly simple:
const ml = require( 'ml-regression' ), SLR = ml.SLR;
const x = [ 1999, 2000, 2001 ]; // list of years
const y = [ 7.1, 6.9, 7,5 ]; // list of average ratings per year
const regressionModel = new SLR( x, y );
// predicted average rating for 2018
const prediction = regressionModel.predict( 2018 );
For a rough idea of the error margin, I included the ´expected ratings and source sample sizes for each year (something I wish all widespread statistical charts would include). All values are normalized to look pretty.
Moving forward I’d like to:
Increase the sample size. I’ve restricted it to 20,000 movies, but with some tweaks, I can potentially access many more items from the DB.(Edit 4/14/18: Now fetches all 350,000+ movies and faster w/ AWS Lambda)
- Experiment with different algorithms. Linear Regression is pretty good for conditions where the output tends to be linear, but I wonder if another algo can produce even better results. (Edit 4/14/18: Now using Polynomial Regression for a curved output. Will test other methods.)
- Stream samples to the learning algo without needing the feed it the complete dataset at once. At that point, I’d like to animate the chart to illustrate the predictions becoming increasingly accurate. (Edit 4/14/18: Chart now animates as data is fetched, but still need an ML implementation that allows incremental learning.)
- Possibly use TensorFlow. as it seems to be the industry standard for ML and they recently released a JS lib.
Check out the code for the project here: