In the calculation run tonight, I ran the entire algorithm with a couple of different factors. I did not manage to significantly improve my ranking, but I did manage to get closer to the probe rating by training over the entire data-set, including the probe. I'm now researching at which point data overfitting starts to occur. I reckon it is related to the learning rate and I reckon it is about after ( 50 cycles + current-feature-num ).
If you only react to the probe rmse improvements to determine to switch to the next feature, this may never become apparent. I've run a wide number of different configurations, always using probe rmse as early stopping method (once it stops improving), but now I reckon that it may be too late. I'm now looking at early stopping once the hill has been reached, but using a simple calculation (as above). I'll probably post about those results in the future.
I've noticed that as time progresses, the point of highest improvement moves backwards and you need more epochs to get there. Thus, the first feature shows improvement immediately after the first epoch, but the 10th for example only shows improvement after 16-20 epochs, also "stopping" 16-20 epochs later than the first.
I've tried squeezing out as much rmse improvement as possible by using different parameters and global effect reductions, but this results in a flat line of no more improvement after about 32 features. Actually, applying the movie/user bias global effect worsens the result significantly, so I've turned that off. Instead, I'm relying on the bias training together with the features for the moment.
Using K = 0.03f or K = 0.01f also doesn't have any effect in the long run. Actually, the weird thing is that the probe rmse for the first feature isn't really affected by it at all. Also, in the long run I'm seeing the same numbers come back.
I've also tried gradually modifying K and lrate and neither did that have any effect. Note that in all these experiments, the probe rmse is the leading thing to determine to continue or not. I used to rely on a standard MAX_EPOCH setting that switched to the next feature, where I did get differences. The following sketch is instrumental in my explanation:
The graph shows how more or less the improvements build up and decay. The x-axis shows the epochs, the y is the probe rmse improvement. The vertical dashed line is the cut-off point where training stops. Thus, it shows that whenever training is stopped at the cut-off point, it is irrelevant what the parameters are, since the area doesn't seem to change much. I should have drawn the smaller bell-curve higher, so that the surface area under it makes it clear they have the same gain in total (which in my setup now is the case).
The learning rate squeezes the bell-curve together (allowing earlier stopping), but it also makes the total result less accurate (since overshoot may occur). The K has a similar effect to learning as it holds back (regularizes) the learning. But overall it should not matter too much.
So my approach now is to find out at which point before the cut-off it is best to stop learning and move on to the next feature. My guess is that over-fitting starts occurring right after the training has passed the hill. By aligning my parameters, I'm going to try to put that hill on a predictable point, such that I can simply set a maximum epoch and then move on to the next.
New tool in town: KnowledgeGenes.com
7 years ago