Lorenzo Gravina

“Since starting to work with Ango AI, being able to take advantage of their labeling service within a secure network allowed us to get better results, faster.”

Eren Bekin, Head of Innovation Unit at Anadolu Sigorta

Before meeting Ango AI, the leading Turkish insurance company Anadolu Sigorta was performing in-house labeling of its data using three different tools.

Anadolu Sigorta uses different data types to train their machine learning models: text, image, and audio, and for each, they needed to use separate tools, which did not integrate with one another or with the company’s existing infrastructure at large.

Since meeting Ango AI, Anadolu Sigorta has been using Ango Hub On-Prem to annotate all of their data, using a single, unified, cohesive platform which integrates seamlessly with the rest of their infrastructure, saving them in resources and dramatically increasing the quality of their labels.

The Customer

Anadolu Sigorta is one of the leading insurance companies in Turkey. It was the country’s first ever insurance institution. It numbers thousands of employees and millions of customers, both locally and abroad.

Anadolu Sigorta is one of the technology leaders in the Turkish insurance market. They have an eye for innovation, and they adopted AI and machine learning models as soon as it was feasible.

Thanks to their adoption of AI, Anadolu Sigorta can process a large number of documents and images coming from their agents on the field, with only minimal human supervision. This optimizes resource utilization and dramatically speeds up the processes of appraisal and data entry.

Before Ango AI

Machine learning models need to be trained before being used. In other words, if we want a model to recognize cars, we first need to feed it thousands of pictures in which the car is highlighted (e.g.: labeled.)

In Anadolu Sigorta’s case, they mostly use models to appraise damage and similar entities from pictures, to automatically file documents, and to process audio. For this reason, their in-house AI team spends a considerable amount of time labeling images, documents, and audio to train their models.

Lack of versatility

Before Ango AI, to fulfill their data annotation needs, Anadolu Sigorta had to use three different labeling platforms: one for text, one for audio, and one for images. This stems from the fact that most platforms in this category are specialized on a single data type, making it difficult for multimedia teams like Anadolu Sigorta to perform their labeling tasks cohesively.

The platforms did not communicate or integrate with one another, so it was impossible to get comprehensive information on the labeling tasks as a whole.

The platforms looked and worked differently from one another, making it so that an annotator would have to retrain themselves each time they switched. This led to reduced label quality, since the annotator’s muscle memory cannot be activated if they have to use radically different interfaces for every different media type. It also led to an inefficient use of time since the time it takes for an annotator to retrain is time which could have instead been used for labeling.

High Pricing

Since they were using three different platforms, from different vendors, they had to pay fees to each separately, incurring a high cost. The current “single-media” model of labeling platforms makes it so that multimedia teams are at a disadvantage, as they need to pay three times as much as a team using only one type of media.

After Ango AI

Anadolu Sigorta contacted Ango AI regarding their labeling needs in the first half of 2021. Once we understood their needs, we proposed they try our Ango Hub On-Premise solution, as it fit their needs perfectly.

Anadolu Sigorta has since installed Ango Hub On-Premise on their own servers and has been annotating data with it.

Since we do not collect any type of analytic data from our On-Premise installations for security reasons, we do not know for certain how much and for what they used our platform. They have, however, communicated to us that they labeled millions of data points, and that they intend to continue working with Ango Hub On-Premise for the foreseeable future.

Versatility

Now, with Ango Hub On-Premise, the AI team at Anadolu Sigorta only has to use one tool to annotate all of their data. Some of the advantages of this are:

Pricing

Since now Anadolu Sigorta only uses Ango Hub instead of three different platforms, they save considerably in resources, both time and money.

Deployment

Before Ango AI, setting up and maintaining their three different products took considerable time and effort.

Installing the entire Ango Hub On-Premise product to Anadolu Sigorta’s servers took, in total, less than 30 minutes thanks to Ango Hub’s adoption of the latest state of the art in deployment infrastructure, explained in detail in our Ango Hub On-Premise attachment.

Plus, any updates released after the initial deployment took less than 15 minutes each to deploy and activate, resulting in virtually no downtime for Anadolu Sigorta.

Updates

Based on the feedback from our customer, we have been able to push updates which helped not only this customer but every other customer as well. We have created a direct line of communication from the customer to us, which allowed us to instantly troubleshoot any issue the customer might have had.

Before, this was not easily possible, as a change in their data labeling pipeline would have meant changing three different products.

Facts

Our data science and ML teams are working hard on delivering our customers an all-new AI assistance feature: Magic Prediction. We aim to release this feature in the upcoming weeks. In the meantime, we’d like to share with you the progress done so far, as well as explaining exactly what Magic Prediction does.

The rise of data-centric ML has shown that the development of ML models still has some issues that can only be solved effectively with supervised learning. As such, data annotation is extremely relevant, especially for most industrial AI projects. Despite this, data annotation is also a labor-intensive activity. This makes it so that in data annotation processes, even small improvements are extremely valuable. At Ango, constantly improving end-to-end data labeling processes is our north star.

To speed up the process of data labeling even further, we are pleased to introduce our new feature, which we call “Magic Prediction”. It is a simple yet effective technique, which we believe will speed up the data annotation pipeline drastically.

in a typical bounding-box annotation scenario, an annotator performs two tasks:

  1. Drawing a bounding box containing the object, and
  2. Selecting the class name among possible candidates.

With Magic Prediction, we are eliminating this second step, by classifying the object inside the drawn bounding box automatically. To make it happen, we are training image classifiers as labeled data accumulates. As the number of annotated objects increases, we are able to train better-performed image classifiers.

Here’s a sample illustration of class predictor:

Let us introduce our lovely Smoky. She is cute, and crazy at the same time;
and of course, she is a cat!


Experiments


To prove and test the effectiveness of Magic Predictor, we conducted a number of experiments with a large variety of different images, classes, and labeling conditions. We are publishing the results here.


Dataset


For the experiments, the COCO object detection and segmentation dataset is used. In the COCO dataset, there are a total of 80 classes, but for the sake of simplicity, only animal classes (sheep, bird, cow, horse, elephant, dog, giraffe, zebra cat, and bear) are selected. The distribution of the classes is shown in the figure below:

Class Distribution of COCO Dataset (Only Animal Classes)


Results


To measure the performance of our classifier, we brought correctly and wrongly classified samples from the test set. From the figures below, if the animal is obvious and if there is no occlusion, it is correctly classified. However, if there is occlusion, or if the animal is far away from the camera, or if the lighting conditions are bad, the probability of the classifier making a mistake increases.

Correctly Classified Samples

Wrongly Classified Samples

The Effect of Training Size


It is good to know the minimum of how many data instances are needed for training and how frequently we should train our model. For these reasons, the model is trained with various sample sizes and its performance is measured on the same test data.

In the figure below, in order to see the effect of training size, accuracy vs. the number of training sample size figure is plotted. With the maximum sample size, an 88.17% accuracy score is obtained. As expected, as the sample size decreases, the accuracy also decreases. However, with only 250 annotated objects, the classifier reached 74.26% accuracy which is low, but still satisfactory. 83.02% accuracy is obtained when the sample size reaches 2172. Also, it is good to note that our classifiers are raw yet open to any improvement. Therefore, it is not wrong to say that this is the absolute minimum performance of the classifiers in this setup.

Sensitivity to Bounding Box Size


Until now, we used COCO bounding box annotations directly as an annotator input. In this section, we will discuss the effect of bounding box tightness on accuracy. In the figure below, the classifier was tested with various bounding box tightness levels. Obviously, with extremely tight and broad bounding boxes, the classifier begins to make mistakes. On the other hand, the classifier is tolerable to bounding box broadness at a certain level.

Sensitivity to Bounding Box Size

Object Detection vs. Magic Prediction


After seeing the magic prediction tool, you may think: rather than a magic prediction tool, can an object detection model be used directly?

In general, object detection models are more complex than classifiers, and it makes them more data-hungry, which means that you need more annotated data to reach a certain level. In addition to that, the runtime complexity of object detection models is higher.

What’s Next?


We are still working on getting highly accurate image classifiers to reach the best accuracy values. As a next property, we are planning to bring the ability to detect out-of-distribution cases. In addition to that, we are planning to combine the class predictor with our other AI assistance tools.


Author: Onur Aydın
Editor: Lorenzo Gravina
Technical Editor: Balaj Saleem

The Ango team prides itself in its incredible diversity. Our team has people coming from different countries, walks of life, and disciplines, each with unique interests and many stories to tell. We couldn’t just keep all these stories to ourselves. So welcome to People of Ango AI.

In this series of posts we will talk to the people who make up the Ango AI team, in conversations where they can share their stories, work, experiences, thoughts, and more.

Today we will be having a chat with our Machine Learning Engineer Balaj Saleem. Balaj was born and lives in Islamabad. He spent his childhood in many different cities in Pakistan, to then graduate from Bilkent University’s CS department in mid-2021. He has been working remotely with us ever since.

First of all, thank you for taking the time to have a chat with us. Could you please briefly introduce yourself to the readers, telling them a bit about your background and your work at Ango AI?

For quite some time I was interested in solving problems that impact society in general and specifically the tech industry. After finishing high school I felt like the best way to do it was to polish my self in the domains that I excelled and to then utilize those skills in the best way possible, for that I moved to Turkey from Pakistan, and in my last two years at Bilkent I developed a keen interest in AI and more specifically, machine learning. The way that changed the whole paradigm to be data centric and the idea of systems tuning themselves to make sense of data was truly fascinating. Around the same time, interestingly Ango AI was taking some of its first steps in the domain of AI and I thought it would be amazing to contribute here with the skills and enthusiasm that I had for the domain. Since then I have done a lot of work on systems that leverage machine learning to help make the tedious problem of data labeling faster, easier and more approachable. I have gotten to develop, learn about and contribute to some of the most cutting edge technology there in the field of Machine Learning and have had the pleasure to help teams globally meet their data needs in the best way possible.

At Ango, you’ve been spending time solving problems related to machine learning in general and data labeling specifically. What do you think is the state of data labeling right now in early 2022?

I think we’re at somewhat of a tipping point, the world of machine learning is starting to recognise the impact and importance of data, more importantly quality data. With numerous advocates and movements towards “Data-Centric AI” we have a spotlight on data labeling. ML teams all over the world are realizing that quality data is one of the most cumbersome yet important parts of any ML solution. 

Yet, traditionally,  we have a research orientation towards a “model-centric AI” thus the field of data labeling, academically and industrially is still in its infancy. But the work in the field is evolving at an exponential rate, and to be at the forefront of witnessing such an evolution is truly remarkable for me personally.

What are some of the biggest challenges you have worked on at Ango AI? And what do you think are the biggest challenges facing data annotation in general right now?

There are two fundamentals I have tried to focus on in my work at Ango AI: Quality and Efficiency. Many of the problems I have worked on have been closely related to solving these challenges of data annotation. 

I have worked on solutions using various technical frameworks to ensure that data is labeled fast and to a pixel perfect accuracy. Apart from this I have tackled quite a bit of research oriented challenges, focused on ideas ranging from predicting data labeling difficulty and duration to implementing deep learning algorithms straight into the browser as opposed to the server.

I think one of the key problems faced by the data annotation industry is that usually the problem of data annotation is treated as a sub-problem of any ML / AI product. The focus on the process of annotation is often minimal, even though according to some estimates it takes 80% of project time / effort. 

Machine Learning is a field that’s been growing considerably in the last few years, both in breadth and depth. What are your thoughts on the increasing role ML models are playing in our lives? What are their positive and potentially negative effects? What do we need to be careful about?

The overwhelming inflow of data from a myriad of sources in the past few years and the support of tech giants such as Google and Facebook to pursue Machine Learning is one of the key factors in this growth. 

This makes Machine Learning a lot more accessible and supported. This means that it doesn’t have to remain for the benefit of a select few in the Industry. As more and more organizations adopt ML and AI we have solutions that affect major parts of our lives, with the way our search engines work to the way agriculture is being approached. The possibilities are practically endless as long as the data is available.

However, one caveat certainly is the concentration of data with a handful of organizations. Fundamentally, this means that compared to the level of data collection the open source access to it is very limited, this certainly hinders not only the transparency of the way data is being used but also retards the development in the field of AI and ML. I think as individuals, a keen eye on how our data may be processed and more importantly what insights may be generated from them are key issues we need to think about.

Moving now to a more personal side of the conversation: you’ve been working remotely at Ango for a while now, from a city that’s two time zones away from Ango’s main hub. How has the experience of working remotely been for you? What are its pros and cons for you?

I think working remotely has been truly delightful. I spent about 4 years in Turkey and as amazing of a host country as it was, I missed Pakistan. Being here, surrounded by family and friends, whom I had been away from for quite some time certainly has helped me get in touch with a crucial part of what makes me me. 

A great thing is that the kind of work we do at Ango AI is pretty flexible. I get to work on my tasks at my own time discretion which certainly boosts productivity and promotes a work-family balance. It also helps me be more goal and task oriented and that allows me to make sure that I bring the best ideas and value to the company. 


There are rarely times when the time in Pakistan can be a bit late for an important scheduled meeting, but that’s really once in a blue moon. The only true con I would say is that I miss socializing and interacting with the amazing and dynamic team. In the time I was in Ankara, due to Covid I only got to meet the team once, and since then have worked mostly remotely. Everyone is extremely supportive and encouraging from aspects of life related to and beyond work, and so I would truly be glad to meet them again and spend some time in the office. 

Did you always envision a career in machine learning, or is it something you picked up along the way?

Machine learning was actually something I discovered quite later in my university and it all happened out of a simple project that we worked on together that worked to infer the emotions of a face given a picture of it, we managed to match the right emotion with the picture with nearly 70% accuracy and that truly opened my eyes to how impactful this work can be. After that I took more courses along the same lines, finally incorporating it into my final year project as well.

What do you enjoy tinkering with in your free time?

Not many people know this but I am actually really into making electronic music, it’s a great creative outlet for me. Although much of what I make goes either to friends or family or on my computer, there are some tracks that I put on soundcloud that go from chill lo-fi beats to more hip-hop / trap kind of stuff. Go ahead and find me on soundcloud if you want! 

Thank you again for your time today. Do you have any final words of advice for those who are looking to start a career in ML?

The pleasure is all mine, I would just like to say that keep being inquisitive and share your passion for the discipline with like-minded people. The domain of ML is still in the process of rapid evolution and make sure you keep yourself surrounded with people who can give you the best kind of guidance.

Artificial Intelligence and Machine Learning have seen a massive growth in the past few years, and have reached a market size of over 22 billion US dollars. While there are numerous domains AI interacts very closely with, Computer Vision is and has been one of the most significant ones, acting as a catalyst for this immense growth. 

At its core, computer vision is a field that deals with teaching and enabling systems to extract meaningful information, patterns, and insights from digital visual inputs such as images and videos. Using these visual inputs, complex algorithms and techniques are employed to allow machines to “see” by deriving information from a mathematical representation of the world. Often, these systems then take important actions based on these inputs and insights derived from them.

As humans, we spend many years of our lives understanding the visual stimuli that the world around us provides us, and picking out the information that is important to us. Along the same lines, machines need this visual stimuli, and an extensive amount of it, in order to determine patterns and pick out the most appropriate information. Without the biological apparatus that humans have, their best source of visual stimuli is data, more specifically: annotated data. 

Often, computer vision systems need a large amount of image-like data that has been carefully labeled and processed. These annotated images, can have a variety of elements, such as:

A detailed discussion upon the topic of image annotation can be found in one of our prior blog posts

These elements allow the system to separate the pixels, classes and regions of importance within an image, effectively picking out the “signal from the noise”. Learning from these patterns the system overtime adapts to become better at recognizing these patterns from unknown instances and providing deeper and more accurate insights.

Important Use Cases

While image annotation can be employed for countless applications, there are certain areas which can derive a lot of value from such annotated data and computer vision systems. As a result, most of the work in present times is being devoted to these fields.

While each of the areas mentioned below can have an extensive article upon their sub-application areas, we will give a high level overview of the healthcare, transportation and agriculture industries, and their use cases for image annotation.

Healthcare

Image source

The healthcare industry relies heavily on visual detection of various conditions, diseases and ailments. With the help of computer vision / machine learning systems and quality annotated data they can derive a lot of value. Specifically, the following areas within the healthcare industry have seen rapid and effective input:

Image source

Transportation

Image source

Transportation is a massive sector that has often been one of the first ones to accommodate technological progress and innovation. Many sub-sectors have already seen incredible research, ranging from autonomous driving to parking occupancy detection. Here are a few such areas that have seen concrete results by using AI systems relying on annotated images and videos:

Agriculture

Image source

Agriculture is another key area that has benefited immensely from the progress in computer vision systems and the availability of large scale agricultural datasets. Camera surveillance and massive fields produces a large amount of data, which can be annotated and processed by AI systems to provide critical insights such as disease and pest detection, crop and yield monitoring, and livestock health monitoring:

Conclusion

With countless opportunities that arise with the interaction of Artificial Intelligence and Computer vision it is evident that quality annotated data is key in any such application. As machines learn to “see” the world they need the right direction and mentorship and that comes through annotated data, once one has access to this data the possibilities are endless. 

The domains discussed above are simply a subset of a massive industry that relies heavily on annotated images. Machine Learning systems and the use cases for image annotation include manufacturing, mining, irrigation, construction, defect detection, workforce monitoring, product assembly, predictive maintenance, document classification and recognition, and much more.
Image annotation is one of Ango AI’s core offerings. Whether you need a platform to annotate images with your team, on the cloud or on-premises, or if you’re looking for a high-quality yet simple, fully-managed end-to-end data labeling solution, Ango AI provides it. Try our platform at hub.ango.ai, or contact us to learn more about how we can help you solve your data labeling needs.

When it comes to the world of AI, the word “learning” has a very specific meaning: it is the ability of a system to understand data. In the constantly evolving domain of Machine Learning, there are many learning approaches to cater to different use cases. There are two approaches, however, which are most commonly employed:

However there are many other types of learning that are less explored, such as reinforcement learning or semi-supervised learning. One such type of learning is Active Learning, an approach which is often not in the forefront of learning strategies but one that can be of immense use to many machine learning projects and tasks.

Fundamentally, Active Learning is an approach that aims to use the least amount of data to achieve the highest possible model performance. When following an Active Learning approach, the model chooses the data that it will learn the most from, and then trains on it. 

While traditional (passive) supervised machine learning only works by training the model in a single iteration on all training data. The process of Active Learning evolves in several iterations as follows:

  1. Choose initial training data (a small subset of all data)
  2. Train your model on the provided data.
  3. Check where in all the unlabeled data the model is most uncertain.
  4. Label this data using an Oracle (A human or machine that can provide accurate labels)
  5. Repeat steps 2-4 until all data is exhausted, acceptable model performance is achieved, or time / budget constraints are reached.
The Active Learning loop.

Types of Active Learning

Pool Based Active Learning

This is the most popular approach, commonly used when working on Active Learning projects.

The idea is that given a large pool of unlabeled data, the model is initially trained on a labeled subset of it. These training samples are then removed from the pool, and the remaining pool is queried for the most informative data repetitively. Each time data is fetched and labeled, it is removed from the pool and the model trains upon it. Slowly, the pool is exhausted as the model queries data, understanding the data distribution and structure better. This approach, however, is highly memory-consuming.

Stream Based Active Learning

The approach relies on moving through the dataset sample by sample. Each time a new sample is presented to the model, it is determined whether this sample needs to be queried for its label. However since not all of the data is available, the performance over time is often not at par with the pool based approach, as the samples that may be queried may not be optimal, providing the most information for our active learner.

Querying

The key to having a successful Active Learning model lies in selecting the most informative / useful samples of data for the model to train on. This process of “choosing” the data which would help a system learn the most is known as querying. The performance of an Active Learning model depends on the querying strategy.

There are many approaches to finding the most informative samples in the data, practically these can vary from case to case, however there are a few which can be adapted to many use cases:

Uncertainty Sampling

Used for many classification tasks, and also known as the 1 vs 2 uncertainty comparison, this approach compares the probabilities of the two most likely outcomes / classes for a given data point. The data points where this value is low are usually the most confusing ones for the model and hence would prove useful to be queried.

This Active Learning strategy is effective for selecting unlabeled items near the decision boundary. These items are the most likely to be wrongly predicted, and therefore, the most likely to get a label that moves the decision boundary.

Another measure that can be used for uncertainty sampling is entropy, which is a measure of “surprise” in a data instance. Points with high entropy are likely to be the most surprising / confusing to the model, therefore knowing the labels for these points would be beneficial for the model.

A theoretical comparison of Active Learning vs supervised learning model performance. (source)

Query by Committee

Query by Committee is a querying approach to selectively sample in which disagreement amongst an ensemble of models is used to select data for labeling.

In other words, an array (committee) of models which may differ in implementations is set up for the same task. As they train, they start to comprehend the structure of data. There are, however, points where the models in this committee are in high disagreement, (i.e. the classes / values assigned to the data point by different models is starkly different) hence these data points are chosen to be labeled by an oracle (usually a human) as they would provide the most information for the models.

Diversity Sampling

As the name suggests, this querying strategy is effective for selecting unlabeled items in different parts of the problem space. If the diversity is away from the decision boundary, however, these items are unlikely to be wrongly predicted, so they will not have a large effect on the model when a human gives them a label that is the same as the model predicted. This is often used in combination with Uncertainty Sampling to allow for a fair mix of queries which the model is both uncertain about and belong to different regions within the problem space.

Top right: One possible result from uncertainty sampling
 If all the uncertainty is in one part of the problem space, however, giving these items labels will not have a broad effect on the model.
Bottom left: One possible result of diversity sampling.
Bottom right: One possible result from combining uncertainty sampling and diversity sampling. Adapted from Human-in-the-loop Machine Learning by Robert Monarch.

Active Learning and Data Annotation

As can be observed from the fundamentals of the Active Learning approach, this method reduces the total amount of data needed for a model to perform well. This means that the time and cost that the data labeling process incurs is highly reduced as only a fraction of the dataset is labeled.

However, the tasks of data annotation and model training are often handled separately, and by different organizations. Hence the interaction of both the processes is a challenge that often becomes hard to tackle, owing to the confidentiality and privacy of the data and processes. 

Often, Active Learning is used in association with online or iterative learning during the process of data annotation, using Human in the Loop approaches. Active Learning then is responsible for fetching the most useful data and iterative learning, enhancing model performance as the process of annotation continues, and allowing a machine agent to assist humans.

A practical example of this would be using Active Learning for video annotation. In this task, consecutive frames are highly correlated and each second contains a high number (24-30 on average) of frames. Because of this, labeling each frame would be very time- and cost-intensive. It is thus more appropriate to select frames where the model is the most uncertain and label these frames, allowing for better performance with a much lower number of annotated frames. 

An intersection of Active Learning and Iterative Learning (Source)

Conclusion

Whether you are a data scientist working on projects that involve labeling vast amounts of data, or an organization that deals with a constant inflow of data that needs to be integrated into their AI system, labeling the right subset of this data for it to be fed to the model would inevitably cater to many of your needs, drastically reducing the time and cost needed to attain a well performing model.

More than 9 researchers out of 10 who have attempted some work involving Active Learning claim that their expectations were met either fully or partially (source). 

At Ango AI we work with Active Learning and many more such techniques to ensure that the speed and the quality of our labels is kept as high as possible, employing the latest research in AI assistance. Our focus on improving labeling efficiency via AI assistance has led us to pursue the intersection of Iterative learning and Active Learning and their applications for quality data annotation.

Data-related tasks consume nearly 80% of the time of AI projects. This makes them a key factor in the machine learning pipeline. Within these data-related tasks, data labeling in particular takes, on average, up to one fourth of the project’s time. Just like the stages that follow related to model development and hyper parameter tuning, the process of data labelling comes with challenges of its own, making it one of the most difficult, time consuming and expensive tasks if not handled in the right manner. 

It is often observed in the industry that the task of data labelling is tackled haphazardly by many organizations working to build an AI/ML pipeline, and is also underestimated in its complexity. This is a pitfall that causes inadequate results and may be a contributing factor to the reality that only 8% of firms engage in core practices that support widespread adoption. Most firms have run only ad hoc pilots or apply AI/ML in just a single business process as reported by Harvard Business Review.

So what exactly is it about data labeling that makes it such a challenging task? There are many facets to why exactly this is so, and this article will break them down in detail. 

Source

Subject Matter Expertise

Subject Matter Expertise means the amount of domain knowledge or information a labeler has on the data that is being labeled. Fundamentally, data labeling is a task that employs human knowledge at its core, in order to prepare data for a model to train upon in the future.

Often this data is of a nature that can not be accurately labeled without expertise regarding the characteristics and the complexity of the data. This is the primary reason subject matter experts are required for many labeling tasks.

For instance, a task asking annotators to label images of tumors found in MRI scans would be very difficult to comprehend and label by someone who has no medical or radiological knowledge. This type of data would be best understood by an expert radiologist or a doctor. 

Consider another case where an organization may want to distinguish faulty architectural blueprints from robust ones. For this task a qualified architect would do the best job in the identification of such blueprints, an unqualified labeler would certainly make many mistakes in this complex decision making process.

The availability and inclusion of subject matter experts becomes the primary challenge of a data labeling task, as not only can these experts be expensive, but are often very hard to access for many organizations due to mutually exclusive domains of operation of the experts and the organization.

Source

Subjectivity and Human Bias

Many machine learning tasks require data that is often subjective. There are sometimes no right or wrong answers; this makes the task inherently fuzzy and up to the labeler’s judgement. This induces human bias into the labels, as the labelers have to follow what seems like the best (or the most logical) answer to them.

More technically this concept is known as the induction of cognitive bias which can manifest itself in various ways, some these being:

Source

An example of subjectivity may be given by one of the recent projects handled by Ango AI, which aimed at discerning which frames in a video were most interesting. The use case of labeling the videos was to then summarize them only including the most interesting frames. As one may observe, the importance or significance of a frame depends completely upon the labeler’s discretion. One closely related problem that is caused by this is low consistency, which is another challenge of data labeling.

Another use case can be identified to be scene analysis from a still image. Two labels might give starkly different labels to the same scene. For instance, even observing the image below, one may interpret the man holding the briefcase giving the cogwheel to the robot and the scientist interpreting the results, while others may see it as the scientist programming the robot to give the cogwheel to the man holding the briefcase. 

In the simplest terms this can be the manifestation of the phenomenon captured by the widely used proverbial phrase, “whether the glass is half full or half empty?” and that completely depends on who is observing.

Consistency

Consistency, with regards to data labeling, is the level of agreement that exists for a label among different individuals (or machines) that labeled that specific item (or row) of data. This is specific to the case when multiple labelers are labeling a single piece of data. In general, high consistency is required for quality labeled data. However, maintaining consistency can be fairly challenging, partly due to the reasons of subjectivity and bias discussed above.

Beyond the aforementioned reasons, it is inherently human to make mistakes in tasks requiring judgement/discretion or logic and thus different labels for the same data item arise. This lowers consistency and demands consideration before the data can be delivered.

There are multiple ways to enhance consistency, but some of the most effective ones are the following:

Privacy

With the growing adaptation of outsourcing or crowdsourcing data for labeling it is of utmost importance to ensure the safety, privacy and confidentiality of the data that is being labeled. Unauthorized access, deletion, and storage of data at an unauthorized location are often concerns that need to be addressed by the labeling entity.

Often, organizations choose to have the labeling services on-premise to tackle this problem and ensure that no third party can access the data. This is the most effective way to ensure privacy, however it comes with its own managerial and administrative overhead as managing labels on premises and putting quality assurance measures in place is an extensive process.

The ideal way to tackle this challenge is to ensure that the firm that labels the data complies closely with privacy regulations and processes the data lawfully, fairly and in a transparent manner, keeping all stakeholders informed. This removes the complex layer of workforce and project management, and allows the experts to label the finalized data. Some of the things to look out for within the process of ensuring data privacy are:

Source

Conclusion

Data labeling, especially at large scale as is required today for many use cases, can be extremely challenging, with a variety of facets that need to be addressed. Without addressing these challenges the data may either be low quality (the pitfalls of which were discussed in our “Quality Assurance Techniques in Data Annotation” article) or will incur extra layers of complexity and financial overhead.

Often it is best to outsource this task to firms you can trust and those that deliver quality and speed and tackle all these challenges professionally. At Ango AI we provide such a service, ensuring that you get the highest quality, consistent and unbiased data labeled by a handpicked and highly talented team of experts subject to multiple cycles of review. Throughout the process we ensure transparent and effective communication providing initial samples of well labeled data, instructions and the ability for any labeler to report issues within data or the labeling process.

One of the most common idioms in the domains of Data Science, AI and Machine Learning is “Garbage in, garbage out”.

While a simple sentence on the surface, it elegantly captures one of the most pressing issues of the domains mentioned above. It implies that low quality data fed to any system will most certainly (unless by chance) give low quality predictions and results.

In all application areas of Artificial Intelligence, data is crucial. With it, models and frameworks can be trained to assist humans in a myriad of ways. These models, however, need high quality data, especially annotations which are reliable and are representative of the ground truth. With high quality data, the system learns by optimally tuning its own parameters, using the data to provide valuable insights.

In absence of such quality data, however, the tuning of these parameters is far from optimal, and in no way are the insights provided by such a system reliable. No matter how good the model, if the data provided is low quality, the time and resources poured into making a quality AI system will surely be wasted.

Since the quality of the data is so important, the first steps in the building of any machine learning system are crucial. The quality of the data is not only determined by their source, but equally as importantly by how the data is labeled, and by the quality of this labeling process. The quality of annotations for data is a key aspect of the data pipeline in machine learning. All the steps that follow depend highly upon this one.

Machine Learning Workflow

Data Quality

Data being prepared for any task is prone to error. The main reason for this is the human factor. As with any task involving humans, there are inherent biases and errors that need to be taken into account. The task of labeling any form of data, text, video or image may elicit a varied response from different respondents / labelers.  Due to the nature of many data annotation tasks, there is often no absolute answer, hence, an annotation process is required. The annotation process itself however is itself occasionally prone to error. There are two kinds of errors commonly encountered:

Data drift: Data drift occurs when the distribution of annotation labels, or features of the data, change slowly over time. Data drift can lead to increasing error rates for machine learning models or rule-based systems. There is no static data: an ongoing annotation review is necessary to adapt downstream models/solutions as data drift occurs. It is a slow and steady process that occurs over the course of annotation that may skew the data.

Anomalies: While data drift refers to slow changes in data, anomalies are step functions – sudden (and typically temporary) changes in data due to exogenous events. For example, in 2019-20, the COVID-19 pandemic led to anomalies in many naturally occurring data sets. It is important to have procedures in place to detect anomalies. When anomalies occur, it may be necessary to shift from automated solutions to human-based workflows. Compared to data drift these are considerably easy to detect and fix.

An anomaly. Credit

Quality Assurance Techniques

Various techniques can be employed to detect, fix and reduce the errors that occur in data annotation. These techniques ensure that the final deliverable data is of the highest possible quality, consistency and integrity. The following are some of those techniques:

Subsampling: A common statistical technique used to determine the distribution of data, this refers to randomly selecting and keenly observing a subset of the annotated data to check for possible errors. If the sample is random and representative of the data, this can give a good indication of where errors are prone to occur. 

Setting a “Gold Standard/Set”: A selection of well-labeled images that accurately represent what the perfect ground truth looks like is called a gold set. These image sets are used as mini testing sets for human annotators, either as part of an initial tutorial, or to be scattered across labeling tasks to make sure that an annotator’s performance is not deteriorating, either due to poor performance on their part, or to changing instructions. This sets a general benchmark for annotator effectiveness.

Annotator Consensus: This is a means of assigning a ground truth value to the data after taking inputs from all the annotators and using the most likely annotation. This relies on the well-known fact that collective decision-making outperforms individual decision making.

Using scientific methods to determine label consistency: Again inspired from statistical approaches, these methods involve using unique formulas to determine how different annotators perform. The formulas determine human label consistency using scientific methods such as Cronbach Alpha, Pairwise F1, Fleiss’ Kappa, and Krippendorff’s Alpha. Each of these allow for a holistic and generalizable measure to the quality, consistency and reliability of the labeled data.

 Fleiss’ Kappa 
(Where, pois the relative observed agreement among raters and
pe is the hypothetical probability of chance agreement.)
Credit

Annotator Levels: This approach relies on ranking annotators and assigning them to levels based on their labeling accuracy (can be tested via the gold standard discussed above) and give higher weight to the annotation of quality annotators. This is especially useful for tasks that have high variance in their annotations, or tasks that require a certain level of expertise. This is because the annotators that lack this expertise will have a lower weight given to their annotations, whereas annotators with expertise will have more influence on the final label given to the data.

Edge case management and review: Mark edge cases for review by experts. Determining an “edge case” can be done either by thresholding the inter-rater metrics listed above, or flagging by individual annotators or reviewers. This allows for the data that is most problematic to be reviewed and corrected as most of the anomalies occur in edge cases.

Automated (Deep learning based) Quality Assurance

The task of data annotation is very human in nature, as researchers and organizations are often looking specifically for human input. This makes it so that the quality of the labels is dependent on human judgement. There are certain approaches, however, that exploit the principles of deep learning to make this process easier, primarily by identifying data that may be prone to errors, thus picking out data that should be reviewed by humans, ultimately ensuring higher quality.

Without delving too deep, fundamentally, this approach relies on actively training a deep learning framework on the data as it is being annotated, and then using this neural network to predict the labels / annotations on the upcoming unlabeled data.

If an adequate framework is selected and then trained on data with high quality labels, (e.g the gold set mentioned above) the predictions will have little to no difficulty in classifying or labeling the common cases. In cases where the labeling is difficult, i.e. an edge case, the framework will have high uncertainty (or low confidence) in the prediction.

Interestingly, it just so happens that often when a robust model has low confidence on a label, a human will also display the same trait.

Conclusion

Whether you are in the tech industry or working on cutting edge research, having high quality data is of utmost importance. Regardless of whether your task is statistical or related to AI, having an early focus on the quality of data will pay in the long run.

At Ango AI, using a combination of techniques mentioned above, we ensure that we only ship the highest quality labels to our customers. Whether it’s by employing complex statistical methods to keep quality high, or cutting-edge deep learning frameworks to keep speed high and assist human annotators in review, we keep quality at its highest standards subjecting it to numerous checks before it’s finally delivered to you.