Friday, October 16, 2009

Overfitting vs Overtraining

What is overfitting?

In statistics, the overfitting event appears when a model is just too complex and it cannot predict real data. Actually, it learned the past data very very well and it cannot generalize the model, so the real error on the data that has to be predicted would be increased.

There are many domains where overfitting event occurs. In statistics and business environment, overfitting occurs when the model is not suitable for future predictions. Let's take a look on this data changes over time:

This kind of data was collected for a period of time. Now is time to predict how would be the data to the next period. Let's suppose that those values represent the amount of items that are going to be sold by a company. The company has to know what stock to make in order to meet the customers needs. If the company overestimates the stock, more money will be spend and the items could expire without being sold, then the company will loose money. If the company sub-estimates the stock, then the customers will be angry, and many of them will look for an alternative (another company). This way, the error of the prediction should be minimized in order to get all the benefits from the business.

Choosing a suitable model is a hard thing that has to be done. In general, many parameters are involved in order to predict this such a model: trend, number of citizen, period of year, weather, etc (depending of business, items that are going to be sold).

In next picture, you'll see a bad example of prediction. The model is too simple to be used in a business environment, and the error on the training set is a big one ( more than 50%). The prediction cannot be a good one, and because the model suffers of this big error, we'll call it a sub-fitting model.

An other approach would be the other extreme, a model that fits all the elements from the past data (training data).

Why is not good this such a model?

Because it is just not useful in generating the prediction. It has no power of generalizing because it was over-trained on the dataset for a long period of time, in order to minimize the error on the training set. But the real problem is that the training set is not the same with the real data that will be acquired in the future.

Then what is the difference between overfitting and overtraining?

As you may see, the overfitting is a phenomenon which appears as a result of overtrainig, but not only. For instance, overfitting event could occur when many parameters are used to create the model. The curve of the model could be approximated with a polynomial function of grade n. If n is to big, the model is going to be over-fitted. The same results could be obtained when using neural networks with too many hidden nodes on the hidden layer. In this case, the number of adjustable parameters (weights) is increased, so, as a result, the system could learn all the points given as inputs, affecting the generalization of the problem. To be continued.

Monday, June 22, 2009

Summer moments, seaside travel

Bulgaria has become one of the most challenging seaside places from Balkans. By challenging I mean one of the most impressive places, where the tourism is practiced at a high level. This is the place where the services are done with greatness and the tourists are welcomed in many 5 or 4 stars all-inclusive hotels.

Friday, May 1, 2009

Castiga o vacanta in Bulgaria

Dacă dorești să câștigi o vacanță în Bulgaria, pentru 2 persoane (+ un copil de până în 12 ani), la un hotel foarte bun, atunci ar trebui să participi la concursul oferit de TravelPlanner.

Concursul Castiga o vacanta in Bulgaria este la a treia editie si a ajuns să fie mai complex decât previziunile făcute, având peste 4200 de participanți.

Ce ar trebui să faci?

Să îți faci un cont și să îți promovezi link-ul personal în rândurile prietenilor dar și pe blog, site, forumuri, etc, la fel cum e stipulat in regulamentul concursului.

Care este scopul?

Concursul se va da persoanei care va reuși să își promoveze cât mai bine link-ul personal, și va fi votată de cele mai multe persoane, fie și de mai multe ori (dar in zile diferite). Nu sunt admise trișări de tipul: îmi schimb ip-ul in rețea și mă votez de câte ori vreau (folosesc o clasă de ip-uri personală sau a celei din rețeaua la care sunt abonat), sau folosirea unui proxy public. Cum se realizează acest test? Se verifică "gradul de împrăștiere a ip-urilor".

Deja au fost persoane descalificate in ediția curentă a concursului. Spre exemplu, in a doua ediție a concursului, au fost persoane care au strâns 18.000 de voturi în doar o zi de concurs, asta în condițiile în care se cerea doar promovarea link-ului, nu și votul din partea persoanei. Acest lucru ar trebui să fie un semnal de alarmă pentru vânătorii de reclamă care se folosesc de sisteme de contorizare a traficului ce nu pot depista acest tip de fraudă (gândul mă indreaptă și către cei de la trafic.ro).

În ediția actuală, folosirea unor clase de ip-uri (schimbarea ip-ului la rând) și votul manual sunt la fel dezapreciate, cu toate că rămâne impresionant volumul de muncă enorm pe care o persoană poate să îl depună pentru a se vota de "câte ip-uri îl ține". Dar românul tot român (amintesc pe Vlad Țepeș care a îmbrăcat armata în haine turcești pentru că altfel, fățiș, ar fi pierdut bătălia), nu se poate desprinde de trucurile care l-ar putea ajuta să câștige cu orice preț.

Vienna 2

Sunday, March 15, 2009

Vienna

Friday, February 20, 2009

Draw a custom function in ActionScript

Here you have the sample. Fill the function input text with your custom function. You may use any standard function, sin, cost, sqrt, etc and any arithmetic function. Then you could choose the precision between 0.01 and 10 and of course you may set the scale (the default is 10).

E.g.: write down "sin(x)" function in the input text field. Press then Live Draw Function to see the results.

Source code may be viewed using "view source" context menu.

PS: You may draw some circles using the mouse on the draw surface. It's not a bug, it's a feature! :D

Tuesday, February 17, 2009

Digit Recognizer - first step in ActionScript

I've implemented first machine which recognizes digits. I have used U.S. Post Database of digits to train the machine, and I have also used their database to test the machine. For each row, first number represents the digit and the next 256 numbers represent the input matrix (16x16).

First impression is that ActionScript is too weak for numeric computation. In the implemented machine I've got 256 input neurons, 20 hidden neurons on a single layer and of course 10 output neurons. Of course I've tried to increase the number of neurons on hidden layer but this seemed to be unreliable to actionscript 'cause AIR application just blocked after few seconds of computation. All after all, with the previous configuration I have succeded to train the machine to recognize ~1800 of 2007 digits from tests (the best case was 1815, but in general, over 1800) after a few trainings (generally after the second train). See the example here. It would take more than 10 seconds to train the machine but test would run great (works on my machine).

I must admit that the computation in getting results (even in many tests like those from this file) is done fast enough in ActionScript so I am sure that a component developed in Flex would be capable to run in normal conditions if it had the weight matrix computed previously. I am afraid that a learning machine would be ineffcient if it has been implemented in ActionScript so this is the reason why I think matrix of weight computation in backpropagation algorithm should be done in another layer using another programming language (e.g. Java and OpenAMF).

For those who are sceptic, an example of application for digit recognition: a replacement for captcha texts (draw a digit and we're sure you're not a robot).

Here you have the source code for NeuralNetwork class. Suddenly, comments are lost when the file was uploaded on ftp :D (coming up)

Monday, February 16, 2009

Neural Network in ActionScript 3

Keywords: neural network, backpropagation, supervised learning

Actionscript Neural Network = annet - the package developed to solve classification problems using supervised learning.

First problem implemented: XOR Problem
0 ^ 0 = 0
0 ^ 1 = 1
1 ^ 0 = 1
1 ^ 1 = 0

See the example here. You should train the network first pressing "Train" button first.

Coming up next: Digit Recognizer Component

Monday, January 26, 2009

Recesiune intelectuala

Ich kann nicht verstehen, die Menschen, die gerne andere Menschen leiden zu sehen. Ich gehöre zu denen. Meine Pläne sind zu vernichten den Feind, ein Feind, der geistigen existiert nur in meinem Kopf. Es ist an der Zeit für den Wandel. Jetzt ist es an der Zeit, dies zu ändern ist, um zu einem besseren Menschen.

Wednesday, January 21, 2009

Html Manager for Dummies

Am tot lucrat de ceva vreme la un Html Manager care a inceput sa o ia pe drumul cel bun. Cel putin asa mi se pare. Sa va spun despre ce este vorba. Este o scula facuta in Flex, ce lucreaza cu obiecte PHP (prin AMFPHP) si care ma ajuta sa creez site-uri foarte rapid, folosind mai multe tipuri de pagini, ce au anumite tipuri de elemente.
Spre exemplu, am o pagina care are informatii, apoi un rand de poze, apoi mai jos alte informatii, un formular construit dinamic cu care imi pastrez datele si cu ajutorul caruia obtin informatii de la utilizatori; in partea dreapta o reclama specifica, sau un grup de link-uri de pe site.
Practic un nou site inseamna un alt design, adica css + header + footer. In rest o mica aranjare a site-ului nu ar schimba cu nimic structura managerului.
Avantaje tot sunt, pentru ca se foloseste pentru administrare un RIA facut sa faciliteze lucrul cu baza de date, cu paginile, cu relatiile dintre ele.
Aruncati o privire spre http://ski-bulgaria.travelplanner.ro pentru a va convinge de utilitatea framework-ului.