Image classification and object recognition are the most common applications in computer vision. Yet despite great efforts to improve, computers still perform poorly in cognitive tasks. Models need to understand not only objects, but interactions and relations from an image. This is where Visual Genome comes into play, allowing deeper inferences from computer vision.
Data is essential in developing machine learning applications. To train a model which performs well at perceptual tasks, we need a very detailed dataset. Stanford University graduates created the Visual Genome dataset for this purpose. They completed the first version of the dataset in December 2015, and they are still improving the dataset and updating new versions.
In this article we will talk about what Visual Genome is, we will perform statistical studies on it, and we will compare it with other datasets.
If you already know what Visual Genome is and need a platform to label your images, Ango Hub is free and has all the tools you need to get started labeling. If instead you are looking to outsource your Visual Genome labeling, Ango Service is what you are looking for. Book a demo with us to learn more. But back to Visual Genome.
The Visual Genome dataset is a dataset of images labeled to a high level of detail, including not only the objects in the image, but the relations of the objects with one another. We will show the full detail of the Visual Genome dataset in the rest of this article.
To understand images deeply, the creators of the Visual Genome dataset believe 3 key concepts should apply to their dataset.
Visual Genome has benefited from over 33,000 crowd workers from Amazon Mechanical Turk to collect and verify the dataset. They collected the data for the set in six months, followed by 15 months of experimentation on the representation of the data.
For a data annotation service focused on quality at scale, fully ready to label in the Visual Genome style, check out Ango Service to cut the time you spend on experimentation and get your labels better, faster.
This is the first step of the annotation process. When one asks labelers to label the data, sometimes, they can overlook some parts of the image and label the most salient parts only. That is why Visual Genome asked a labeler to localize 3 different regions with a bounding box and make their descriptions. You can see an example in Figure 1 about what these region descriptions and bounding boxes look like.
After a labeler completes the task, they sent each image to another annotator with the descriptions and boxes, to ask him to perform the same job (adding 3 boxes and descriptions different from the previous work as possible he can). This process goes on until they have 50 region descriptions for an image.
For each image in Visual Genome Dataset, there are 50 region descriptions on average. Each description has 1 to 16 words.
After region descriptions are done, workers start to extract features from the regions. They consider bounding boxes and region descriptions to detect objects, attributes, and relationships. You can see the examples in figures 2, 3, and 4. Also, they put new bounding boxes to represent every object, attribute, and relationship.
In Figure 4, you can also see how they represent relationships formally. Each relationship uses the same format [relationship(subject, object)]. These relationships are not unique for every image. For instance, if there is a “jumping over” action in another image included in the dataset, it is also represented by the same relationship [jumping_over(subject, object)].
During this extraction process, attention should be paid to 2 important features: Coverage and quality. While localizing an object by a bounding box, the box should be big enough to involve the object (coverage), as well as, it should be as tight as possible to eliminate errors (quality).
When the feature extraction process is done, labelers start to create region graphs, and scene graphs to formally represent every image. Through these graphs, they are able to combine all information in a scene in a more coherent way.
Region graphs contain information about a local part of an image. In figure 5, you can see 3 different region graphs of an image, representing all the objects, attributes, and relationships in the bounding boxes they belong to. With the arrows, it is easier to understand the object and subject of a relationship (relationships represented in green).
By combining all the region graphs in an image, a scene graph can be created. This combination is made through the common points (common objects) between the region graphs. You can see an example in figure 6.
Crowd Workers also write question-answer pairs for the regions and the whole image as well. All these questions are WH questions (what, where, who, why, when, how). Each image has 17 QA pairs on average and at least 1 question from each type. These QA pairs provide a great opportunity to carry on a study about how models can perform at answering questions when images or region descriptions are given as input. Avoiding ambiguous and speculative questions and being precise and unique are essential while preparing the pairs.
After the annotation process, the verification stage for all Visual Genome data is started. This stage is crucial to eliminate wrong labeling and to check question-answer pairs whether they are vague or not [(“This person seems to enjoy the sun.)”, subjective (“room looks dirty”), or opinionated (“Being exposed to the hot sun like this may cause cancer”)].
All the descriptions made for images (objects, attributes, relationships, noun phrases from region descriptions, and QA pairs) are mapped to synsets in WordNet. WordNet is a lexical database in English created by Princeton, and it is commonly used for natural language processing applications. Visual Genome benefited from this dataset and canonicalize the meanings to overcome ambiguities in the concepts and connect the dataset to other resources used by the research community.
One of these ambiguities results from words having more than one meaning. For instance, if we write a description for a table we should specify its meaning since an AI model can encounter different types of “tables” in other images. In other words, it should know that the table is “a piece of furniture” or “a set of data arranged in rows and columns”. WordNet eliminates this problem by mapping these 2 meanings as different (table.n.01, table.n.02).
Also, WordNet has a hierarchy between its words. During the annotation process, crowd workers are not forced to use particular words for any objects. For example, for a “male human” object there is no specification to call it man, person, boy, or human. However, all these words should belong to the same root “person.n.01”. Thanks to WordNet, ambiguities resulting from different labeling options can be eliminated as well.
Visual Genome also analyzes the attributes in the dataset by constructing attribute graphs. Nodes in these graphs are unique attributes and edges are the lines connecting these attributes that describe the same object.
You can see a subgraph of the 16 most frequently connected person-related attributes in figure 8(a).
From these graphs, they can infer objects and object types by focusing on only their unique attributes. Thus, it allows Visual Genome to make object identification based on selected characteristics.
In figure 8(b) you can see 3 cliques. Cliques are created if each pair of attributes in that group includes at least one connection. From the attributes on the left clique we can infer that it represents an “animal” or from the right bottom clique, it can be inferred that it describes “hair” traits.
Each synset in the top 25 has at least 800 example images. As you can see from the top 3, Visual Genome has many images of “sports” and “players”.
As you can see in the width graph, there is an increase around the 1.0 point. This increase results from the objects which span the entire width of the image like the sky, snow, ocean, mountain, etc. However, it does not happen in the height graph since objects who can span images vertically are not very common.
Each image includes 50 regions with a bounding box and its description on average.
These descriptions range from 1 to 16 words in length and 5 words on average.
Visual Genome contains approximately 35 objects per image.
As you can see, although Visual Genome does not have as many images as other datasets have, it outnumbers them in terms of categories. That much category might lead to a more difficult training process, but it can make models more effective and unbiased.
Attributes play a major role in the detailed description and disambiguation of objects in Visual Genome. The Visual Genome dataset contains 2.8 million total attributes.
Each image has 26 attributes per image. Figure 17 shows the most common attributes are color descriptions in general. For only humans, it is state-of-motion words (e.g. standing, walking).
Visual Genome has a total of 42,374 unique relationships, with over 2,347,187 million total relationships. There are 21 relationships per image, on average.
Spatial relationships are the most common ones in the dataset. Relationships involving humans are more action-oriented.
1,773,258 question answering (QA) pairs were collected in the Visual Genome dataset. Each pair consists of a question and its correct answer regarding the content of an image. Every image has 17 QA pairs on average.
The angles of the regions are proportional to the number of questions. “What” questions are the most common type.
While “why” and “where” questions have the longest answers, factual questions like “what” and “how” are the shortest ones.
Setup: For this experiment, we focus on the 100 most common attributes. Approximately 500 attributes from each category were sampled and finally, 50,000 attribute instances were obtained. This dataset is divided into three parts to be used: the training set, validation set, and testing set with the rates of 80%, 10%, and 10% respectively. After they train the model by using features from a convolutional neural network model has started to make predictions about the input image. In the below figure, you can see some examples.
In the second step of this experiment, they try to train a model by using object-attribute pairs jointly as inputs. Again, approximately 50,000 object attribute pairs have been used by dividing them into training, test, and validation sets. Objects that at least 100 times occurred and are associated with one of the 100 attributes are used for this 2nd step.
From the statistics, we can see the improvement when object-attribute statistics are used since objects are the core building blocks of the images.
A downside of these statistics is that sometimes the model incorrectly penalizes the answer. In figure 22 you can see the white teddy bear in the yellow box. Although “stuffed” is a correct prediction, the model penalizes it and expects the answer, “white”. This has resulted from crowdsourced ground truth answers, which sometimes do not include all valid answers.
The same approach in the attribute prediction experiment is also valid here. For the setup, again 100 most common relationships are focused and 34,000 relationships along with 27,000 unique subject-relationship-object pairs have been extracted. A 2-step experiment has been conducted, similarly to the attribute prediction experiment.
Figure 26: (a) Example predictions from relationship prediction experiment (b) Example predictions from the joint object-relationship prediction experiment.
As you can see from the red boxes at the top bottom, this time the model has some problems differentiating between different objects like boy vs. woman and car vs. bus even though it can correctly differentiate the salient objects.
Visual Genome allows for deeper cognitive tasks that require enhanced inference. It enlightens the way for further improvements in computer vision in the future. It was inspired by similar applications in computer vision. Also, benefited from related papers and studies in that field to create Visual Genome data.
Ango AI provides data labeling solutions fully compatible and ready for Visual Genome training. Both our data labeling platform Ango Hub, and our fully-managed data annotation service Ango Service, are equipped with all you need to get started labeling, or outsource your labeling for Visual Genome. Book a demo with us to learn more.
“Visual Genome is a large, formalized knowledge representation for visual understanding and a more complete set of descriptions and question answers that grounds visual concepts to language.” 
 “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.” [Online]. Available: https://visualgenome.org/static/paper/Visual_Genome.pdf. [Accessed: 12-Aug-2022].
 “WordNet,” Princeton University. [Online]. Available: https://wordnet.princeton.edu/. [Accessed: 12-Aug-2022].
 “Visual Genome,” VisualGenome. [Online]. Available: https://visualgenome.org/. [Accessed: 12-Aug-2022].
Authors: İbrahim Orcan Ön and Onur Aydın
Technical Proofreading: Balaj Saleem
A short while ago, our team of data annotators labeled nearly 25 thousand images of faces, classifying them by age, gender, hair color, beard and mustache color (if present), and glasses. We then released the annotated dataset free to the public. You can download the annotated face classification dataset here.
The dataset consists of 23032 face images. Each image was labeled by two independent annotators. The assets for our Face Classification Dataset were taken from the open Flickr-Faces-HQ (FFHQ) dataset. We analyzed the annotations created by our team, and we show our findings in this article.
The first graph illustrates the gender distribution over the dataset. Our annotators found 10472 “Male”, and 12591 “Female” assets in the dataset. In addition to that, 254 people were labeled as “Not sure”. When the total data and the results were compared, conflict was minimal. As such, it can be be ignored.
The conflicting results are added in the figures as a value of 0.5 for each key. To illustrate, if one annotator labels “Male”, but a second annotator labels “Female” for the same image, both the “Male” and “Female” columns’ “conflict” sections are increased by 0.5.
876 images were labeled as “Baby (0-2)” by both annotators. Similarly, 320 images were labeled as “Baby (0-2)” by one annotator, and “Child (3-9)” by another. The conflict between the annotators reached the maximum level in the young and adult ages. Since the difference between the face types of young and adult ages are smaller, the conflict is expected. The conflicts in each age group can be seen in Figures 3 and 4.
The following graph analyzes the hair color distribution of the dataset. According to the graph in figure 5, the most common hair color is “Brown” with 31% of the total set. It is followed by “Black” and “Blonde”. Since “Black” and “Brown” colors are similar to each other, the conflict reaches the maximum level there.
The next graph analyzes the beard color distribution of the dataset. According to the dataset, 82.9% of the assets have no beard and they are labeled as “No hair”. The conflict reaches the maximum between “Black” and “Brown” colors, as seen in the figure below.
The next graph analyzes the mustache color distribution of the dataset. According to the dataset, 82.2% of the images have no mustache and they are labeled as “No hair”. The conflict reaches the maximum between “Black” and “Brown”, like in the hair color graph. It can be noted that the graphs of the beard and the mustache are close to one another.
The next graphs analyzes the eye color distribution of the dataset. The most common eye color is “Brown” after the labeling process. Since it is hard to distinguish between eye colors, the conflict reaches the maximum level at the “Not visible” label as can be seen in figure 11.
The following graph shows the wearing glasses distribution of the dataset. It is clear in figure 13 that most people have no glasses.
Let’s go deeper analyzing the dataset. For the following graphs, only the results coming from the first annotator were used.
Most men and women in the set are classified as adults. Although the number of “Young” women in the set is really close to the number of “Adults”, the number of adults in the men category is dominating. Either women in the dataset were, on average, younger, or they appeared younger to our annotators.
According to the hair color distribution graph, the most popular color in each age category is brown, except in the baby category. Most baby images in the dataset have blonde hair.
The following figures are really close to each other. Figure 17 represents the beard color distribution by age group, and figure 18 represents the mustache color distribution by age group. Since babies have no beard and mustache, the baby color columns are empty as it is expected. Further, it appears the vast majority of people don’t sport either beard or mustache.
The following figure shows the eye color by age group in the dataset. Brown was by far the most popular color in all age categories.
According to the following figure, wearing prescription glasses occurs mostly in the adult category. Plus, people with no glasses are the vast majority in each category.
According to the hair color distribution graph, black hair color is the most popular among males. Most females appear to have brown hair.
The following figures are really close to each other. Figure 22 represents the beard color distribution by gender, and figure 23 represents mustache color distribution by gender. Most people do not have any beard or mustache. The most popular color both in beard and mustache is black.
The following figure shows the eye color distribution by gender. It appears that most people have brown eyes, with blue in second place for all genders.
The following graph illustrates the images in the dataset by type of glasses. It can be seen in figure 25 that most prescription glasses users are male. However, most faces annotated in the set have no glasses.
The following graph illustrates the conflict rates between annotators for each feature. Since eye color and age group are difficult to understand by looking at an image, that’s where most conflicts occurred.
Ango AI provides data labeling solutions for AI teams of all sizes and industries. Our data labeling platform, Ango Hub, is used by dozens of industry-leading companies to label millions of data points monthly. Hub is the most versatile platform in the market, supporting 15+ file types and 20+ annotation tools. It’s also free to try here.
Ango AI also offers an end-to-end, fully managed data labeling service, Ango Service, used by customers all over the world to label data ranging from banking, to insurance, government, medical, and more. We know all of our annotators personally and do not outsource. Book a call with us to learn more.
Authors: Onur Aydın, Kıvanç Değirmenci