The need for balanced and high-quality training datasets

Before a machine learning model is developed, a training set of manually labeled data is designed. The goal of AI is to augment human performance. Therefore, AI is built on work done by humans. The development of AI systems begins with asking questions that are human about a specific business process. While there has been scientific progress in using semi-supervised and unsupervised machine learning models, the majority of market applications require humans to label the training dataset.

The key principles of creating high-quality training datasets with human labeling


Accurate on labeling

Balanced & appropriately sized

These sensitive circumstances raise two major concerns: unbalanced data and algorithmic bias. To solve the problem of algorithmic bias, a manually trained dataset, diversification of labelers, and maintenance of impartiality are all required in the original decision-making process.  

Human-labeled training datasets are beneficial to use when there is no margin for an algorithm’s poor performance.  For example, human-labeled training datasets should be favored over synthetically generated data when image processing is used for ensuring safety or assessing industry risks or fairness of job application screening or criminal justice applications.  

Size and sampling best practices behind creating a training set vary and depend on the use case. Perfect training sets are required to maintain the class balance, which means feeding into the machine learning model a necessary number of instances for each class trained. In the real-world generated images upholding a strict class, balance is nearly impossible. For example, street-view cameras may collect a significantly larger number of pedestrians or SUVs as a class, than bikes or fire hydrants.