What are training and test data and what are they used for?
In simplified terms, data sets in supervised learning are divided into training and test data in order to check how well algorithms can make predictions for unknown data.
To do this, an algorithm is first trained on the basis of the training data. It learns a function f(x), which allows the algorithm to assign an output y to each input x. The output y is the output of the algorithm. Based on this function, the algorithm can then make predictions.
To find out how accurate these predictions are, one determines metrics that make the performance of the algorithm measurable. However, these cannot be determined with the training data because the algorithm already "knows" the training data.
This is where the test data comes into play, since it is unknown to the algorithm. One lets the algorithm make predictions for the individual inputs x of the test data and then compares these with the known outputs y. In this way, it is possible to determine the output y of the test data. In this way it is possible to check the accuracy of an algorithm under real conditions. The prediction accuracy and other metrics can then be used to fine-tune the algorithm.
The ratio of training data to test data depends on the algorithm used. However, a data set is often divided into 80% training data and 20% test data.
There are several methods for dividing the data set into training and test data. However, the basic idea remains the same. One of the best known methods is Cross Validation.
Data plays a crucial role in the training of machine learning algorithms. Without enough data of appropriate quality, it is not possible to train an algorithm capable of making accurate predictions. Thus, an algorithm is only as good as its training data.
Sources (translated): Medium and V7
Damage good. All good.