One of the most common idioms in the domains of Data Science, AI and Machine Learning is “Garbage in, garbage out”.
While a simple sentence on the surface, it elegantly captures one of the most pressing issues of the domains mentioned above. It implies that low quality data fed to any system will most certainly (unless by chance) give low quality predictions and results.
In all application areas of Artificial Intelligence, data is crucial. With it, models and frameworks can be trained to assist humans in a myriad of ways. These models, however, need high quality data, especially annotations which are reliable and are representative of the ground truth. With high quality data, the system learns by optimally tuning its own parameters, using the data to provide valuable insights.
In absence of such quality data, however, the tuning of these parameters is far from optimal, and in no way are the insights provided by such a system reliable. No matter how good the model, if the data provided is low quality, the time and resources poured into making a quality AI system will surely be wasted.
Since the quality of the data is so important, the first steps in the building of any machine learning system are crucial. The quality of the data is not only determined by their source, but equally as importantly by how the data is labeled, and by the quality of this labeling process. The quality of annotations for data is a key aspect of the data pipeline in machine learning. All the steps that follow depend highly upon this one.
Data being prepared for any task is prone to error. The main reason for this is the human factor. As with any task involving humans, there are inherent biases and errors that need to be taken into account. The task of labeling any form of data, text, video or image may elicit a varied response from different respondents / labelers. Due to the nature of many data annotation tasks, there is often no absolute answer, hence, an annotation process is required. The annotation process itself however is itself occasionally prone to error. There are two kinds of errors commonly encountered:
Data drift: Data drift occurs when the distribution of annotation labels, or features of the data, change slowly over time. Data drift can lead to increasing error rates for machine learning models or rule-based systems. There is no static data: an ongoing annotation review is necessary to adapt downstream models/solutions as data drift occurs. It is a slow and steady process that occurs over the course of annotation that may skew the data.
Anomalies: While data drift refers to slow changes in data, anomalies are step functions – sudden (and typically temporary) changes in data due to exogenous events. For example, in 2019-20, the COVID-19 pandemic led to anomalies in many naturally occurring data sets. It is important to have procedures in place to detect anomalies. When anomalies occur, it may be necessary to shift from automated solutions to human-based workflows. Compared to data drift these are considerably easy to detect and fix.
Quality Assurance Techniques
Various techniques can be employed to detect, fix and reduce the errors that occur in data annotation. These techniques ensure that the final deliverable data is of the highest possible quality, consistency and integrity. The following are some of those techniques:
Subsampling: A common statistical technique used to determine the distribution of data, this refers to randomly selecting and keenly observing a subset of the annotated data to check for possible errors. If the sample is random and representative of the data, this can give a good indication of where errors are prone to occur.
Setting a “Gold Standard/Set”: A selection of well-labeled images that accurately represent what the perfect ground truth looks like is called a gold set. These image sets are used as mini testing sets for human annotators, either as part of an initial tutorial, or to be scattered across labeling tasks to make sure that an annotator’s performance is not deteriorating, either due to poor performance on their part, or to changing instructions. This sets a general benchmark for annotator effectiveness.
Annotator Consensus: This is a means of assigning a ground truth value to the data after taking inputs from all the annotators and using the most likely annotation. This relies on the well-known fact that collective decision-making outperforms individual decision making.
Using scientific methods to determine label consistency: Again inspired from statistical approaches, these methods involve using unique formulas to determine how different annotators perform. The formulas determine human label consistency using scientific methods such as Cronbach Alpha, Pairwise F1, Fleiss’ Kappa, and Krippendorff’s Alpha. Each of these allow for a holistic and generalizable measure to the quality, consistency and reliability of the labeled data.
Annotator Levels: This approach relies on ranking annotators and assigning them to levels based on their labeling accuracy (can be tested via the gold standard discussed above) and give higher weight to the annotation of quality annotators. This is especially useful for tasks that have high variance in their annotations, or tasks that require a certain level of expertise. This is because the annotators that lack this expertise will have a lower weight given to their annotations, whereas annotators with expertise will have more influence on the final label given to the data.
Edge case management and review: Mark edge cases for review by experts. Determining an “edge case” can be done either by thresholding the inter-rater metrics listed above, or flagging by individual annotators or reviewers. This allows for the data that is most problematic to be reviewed and corrected as most of the anomalies occur in edge cases.
Automated (Deep learning based) Quality Assurance
The task of data annotation is very human in nature, as researchers and organizations are often looking specifically for human input. This makes it so that the quality of the labels is dependent on human judgement. There are certain approaches, however, that exploit the principles of deep learning to make this process easier, primarily by identifying data that may be prone to errors, thus picking out data that should be reviewed by humans, ultimately ensuring higher quality.
Without delving too deep, fundamentally, this approach relies on actively training a deep learning framework on the data as it is being annotated, and then using this neural network to predict the labels / annotations on the upcoming unlabeled data.
If an adequate framework is selected and then trained on data with high quality labels, (e.g the gold set mentioned above) the predictions will have little to no difficulty in classifying or labeling the common cases. In cases where the labeling is difficult, i.e. an edge case, the framework will have high uncertainty (or low confidence) in the prediction.
Interestingly, it just so happens that often when a robust model has low confidence on a label, a human will also display the same trait.
Whether you are in the tech industry or working on cutting edge research, having high quality data is of utmost importance. Regardless of whether your task is statistical or related to AI, having an early focus on the quality of data will pay in the long run.
At Ango AI, using a combination of techniques mentioned above, we ensure that we only ship the highest quality labels to our customers. Whether it’s by employing complex statistical methods to keep quality high, or cutting-edge deep learning frameworks to keep speed high and assist human annotators in review, we keep quality at its highest standards subjecting it to numerous checks before it’s finally delivered to you.