Friday, October 16, 2009

Overfitting vs Overtraining

What is overfitting?

In statistics, the overfitting event appears when a model is just too complex and it cannot predict real data. Actually, it learned the past data very very well and it cannot generalize the model, so the real error on the data that has to be predicted would be increased.

There are many domains where overfitting event occurs. In statistics and business environment, overfitting occurs when the model is not suitable for future predictions. Let's take a look on this data changes over time:

This kind of data was collected for a period of time. Now is time to predict how would be the data to the next period. Let's suppose that those values represent the amount of items that are going to be sold by a company. The company has to know what stock to make in order to meet the customers needs. If the company overestimates the stock, more money will be spend and the items could expire without being sold, then the company will loose money. If the company sub-estimates the stock, then the customers will be angry, and many of them will look for an alternative (another company). This way, the error of the prediction should be minimized in order to get all the benefits from the business.

Choosing a suitable model is a hard thing that has to be done. In general, many parameters are involved in order to predict this such a model: trend, number of citizen, period of year, weather, etc (depending of business, items that are going to be sold).

In next picture, you'll see a bad example of prediction. The model is too simple to be used in a business environment, and the error on the training set is a big one ( more than 50%). The prediction cannot be a good one, and because the model suffers of this big error, we'll call it a sub-fitting model.


An other approach would be the other extreme, a model that fits all the elements from the past data (training data).



Why is not good this such a model?

Because it is just not useful in generating the prediction. It has no power of generalizing because it was over-trained on the dataset for a long period of time, in order to minimize the error on the training set. But the real problem is that the training set is not the same with the real data that will be acquired in the future.


Then what is the difference between overfitting and overtraining?


As you may see, the overfitting is a phenomenon which appears as a result of overtrainig, but not only. For instance, overfitting event could occur when many parameters are used to create the model. The curve of the model could be approximated with a polynomial function of grade n. If n is to big, the model is going to be over-fitted. The same results could be obtained when using neural networks with too many hidden nodes on the hidden layer. In this case, the number of adjustable parameters (weights) is increased, so, as a result, the system could learn all the points given as inputs, affecting the generalization of the problem. To be continued.

No comments: