- Predict the number that a user would rate a certain movie with one decimal accuracy.
- Compare the predicted number with the real rating that we know (training data).
- Multiply the error with itself (square) and add that to a sum.
- Divide the sum of the squared errors by the number of predictions made.
- Take the square root of that sum.
In the picture above, you can see a blue line, which is a very potent reduction of rmse on the training set. It's actually a perfect logarithmic regression, and it follows that line perfectly. However (and I don't have sufficient probe points in red in the graph to demonstrate this), the performance on a set of data that is unknown (a probe to test against or the qualifying set) is gradually decreasing. Meaning, for data that is unknown to the algorithm (the predictions to be made) are getting worse, much worse. That means that the rmse for the probe and qualifying set is getting larger as the algorithm progresses. Thus, the training set is overfitting the parameters very heavily, yielding negative returns for the other data sets.
Looking at the projection of the red line, the rmse of the separated probe (known ratings, but excluded from the rating set) is not converging at some point. My conclusion is that I need to think of something else.
In some of my later iterations that I'm not showing here, it's interesting to see how the performance of the dataset is logarithmic, whilst the performance of the rmse on the probe/qualifying set is neither linear nor logarithmic. The start of the probe rmse is sort of linear, but possibly will become logarithmic after some time. At the moment, I'm at position 1120 on the leaderboard, but I've only touched the surface of this problem and I submitted those results based on a Matrix Factorization algorithm (MF) with only 16 features. As a comparison, some other people have submitted results using over 600 features, calculated over 120 epochs (runs for one feature) or so.