Image classification and object recognition are the most common applications in computer vision. Yet despite great efforts to improve, computers still perform poorly in cognitive tasks. Models need to understand not only objects, but interactions and relations from an image. This is where Visual Genome comes into play, allowing deeper inferences from computer vision.
Data is essential in developing machine learning applications. To train a model which performs well at perceptual tasks, we need a very detailed dataset. Stanford University graduates created the Visual Genome dataset for this purpose. They completed the first version of the dataset in December 2015, and they are still improving the dataset and updating new versions.
In this article we will talk about what Visual Genome is, we will perform statistical studies on it, and we will compare it with other datasets.
If you already know what Visual Genome is and need a platform to label your images, Ango Hub is free and has all the tools you need to get started labeling. If instead you are looking to outsource your Visual Genome labeling, Ango Service is what you are looking for. Book a demo with us to learn more. But back to Visual Genome.
What is the Visual Genome dataset?
The Visual Genome dataset is a dataset of images labeled to a high level of detail, including not only the objects in the image, but the relations of the objects with one another. We will show the full detail of the Visual Genome dataset in the rest of this article.
Principles of the Visual Genome Dataset
To understand images deeply, the creators of the Visual Genome dataset believe 3 key concepts should apply to their dataset.
- Grounding of visual concepts to language: Especially, attributes and relationships in an image should be labeled with a high level of detail.
- Complete set of descriptions and QAs: Descriptions for an image should be high-level and dense. Labelers should describe all regions of an image instead of only focusing on the salient parts.
- Formalized representation: Visual Genome found a creative way to represent all the images formally. The relationships between objects and scene graphs of images are always represented in the same structure.
The Visual Genome Data Annotation Process
Visual Genome has benefited from over 33,000 crowd workers from Amazon Mechanical Turk to collect and verify the dataset. They collected the data for the set in six months, followed by 15 months of experimentation on the representation of the data.
For a data annotation service focused on quality at scale, fully ready to label in the Visual Genome style, check out Ango Service to cut the time you spend on experimentation and get your labels better, faster.
Creating Region Descriptions
This is the first step of the annotation process. When one asks labelers to label the data, sometimes, they can overlook some parts of the image and label the most salient parts only. That is why Visual Genome asked a labeler to localize 3 different regions with a bounding box and make their descriptions. You can see an example in Figure 1 about what these region descriptions and bounding boxes look like.
After a labeler completes the task, they sent each image to another annotator with the descriptions and boxes, to ask him to perform the same job (adding 3 boxes and descriptions different from the previous work as possible he can). This process goes on until they have 50 region descriptions for an image.
For each image in Visual Genome Dataset, there are 50 region descriptions on average. Each description has 1 to 16 words.
Adding Objects, Attributes, and Relationships
After region descriptions are done, workers start to extract features from the regions. They consider bounding boxes and region descriptions to detect objects, attributes, and relationships. You can see the examples in figures 2, 3, and 4. Also, they put new bounding boxes to represent every object, attribute, and relationship.
In Figure 4, you can also see how they represent relationships formally. Each relationship uses the same format [relationship(subject, object)]. These relationships are not unique for every image. For instance, if there is a “jumping over” action in another image included in the dataset, it is also represented by the same relationship [jumping_over(subject, object)].
During this extraction process, attention should be paid to 2 important features: Coverage and quality. While localizing an object by a bounding box, the box should be big enough to involve the object (coverage), as well as, it should be as tight as possible to eliminate errors (quality).
Creating Region Graphs
When the feature extraction process is done, labelers start to create region graphs, and scene graphs to formally represent every image. Through these graphs, they are able to combine all information in a scene in a more coherent way.
Region graphs contain information about a local part of an image. In figure 5, you can see 3 different region graphs of an image, representing all the objects, attributes, and relationships in the bounding boxes they belong to. With the arrows, it is easier to understand the object and subject of a relationship (relationships represented in green).
Creating Scene Graphs
By combining all the region graphs in an image, a scene graph can be created. This combination is made through the common points (common objects) between the region graphs. You can see an example in figure 6.
Making Question-Answer Pairs
Crowd Workers also write question-answer pairs for the regions and the whole image as well. All these questions are WH questions (what, where, who, why, when, how). Each image has 17 QA pairs on average and at least 1 question from each type. These QA pairs provide a great opportunity to carry on a study about how models can perform at answering questions when images or region descriptions are given as input. Avoiding ambiguous and speculative questions and being precise and unique are essential while preparing the pairs.
The Verification Process
After the annotation process, the verification stage for all Visual Genome data is started. This stage is crucial to eliminate wrong labeling and to check question-answer pairs whether they are vague or not [(“This person seems to enjoy the sun.)”, subjective (“room looks dirty”), or opinionated (“Being exposed to the hot sun like this may cause cancer”)].
All the descriptions made for images (objects, attributes, relationships, noun phrases from region descriptions, and QA pairs) are mapped to synsets in WordNet. WordNet is a lexical database in English created by Princeton, and it is commonly used for natural language processing applications. Visual Genome benefited from this dataset and canonicalize the meanings to overcome ambiguities in the concepts and connect the dataset to other resources used by the research community.
One of these ambiguities results from words having more than one meaning. For instance, if we write a description for a table we should specify its meaning since an AI model can encounter different types of “tables” in other images. In other words, it should know that the table is “a piece of furniture” or “a set of data arranged in rows and columns”. WordNet eliminates this problem by mapping these 2 meanings as different (table.n.01, table.n.02).
Also, WordNet has a hierarchy between its words. During the annotation process, crowd workers are not forced to use particular words for any objects. For example, for a “male human” object there is no specification to call it man, person, boy, or human. However, all these words should belong to the same root “person.n.01”. Thanks to WordNet, ambiguities resulting from different labeling options can be eliminated as well.
Visual Genome Attribute Graphs
Visual Genome also analyzes the attributes in the dataset by constructing attribute graphs. Nodes in these graphs are unique attributes and edges are the lines connecting these attributes that describe the same object.
You can see a subgraph of the 16 most frequently connected person-related attributes in figure 8(a).
From these graphs, they can infer objects and object types by focusing on only their unique attributes. Thus, it allows Visual Genome to make object identification based on selected characteristics.
In figure 8(b) you can see 3 cliques. Cliques are created if each pair of attributes in that group includes at least one connection. From the attributes on the left clique we can infer that it represents an “animal” or from the right bottom clique, it can be inferred that it describes “hair” traits.
Statistics About Visual Genome
Each synset in the top 25 has at least 800 example images. As you can see from the top 3, Visual Genome has many images of “sports” and “players”.
Region Description Statistics
As you can see in the width graph, there is an increase around the 1.0 point. This increase results from the objects which span the entire width of the image like the sky, snow, ocean, mountain, etc. However, it does not happen in the height graph since objects who can span images vertically are not very common.
Each image includes 50 regions with a bounding box and its description on average.
These descriptions range from 1 to 16 words in length and 5 words on average.
Visual Genome contains approximately 35 objects per image.
As you can see, although Visual Genome does not have as many images as other datasets have, it outnumbers them in terms of categories. That much category might lead to a more difficult training process, but it can make models more effective and unbiased.
Attributes play a major role in the detailed description and disambiguation of objects in Visual Genome. The Visual Genome dataset contains 2.8 million total attributes.
Each image has 26 attributes per image. Figure 17 shows the most common attributes are color descriptions in general. For only humans, it is state-of-motion words (e.g. standing, walking).
Visual Genome has a total of 42,374 unique relationships, with over 2,347,187 million total relationships. There are 21 relationships per image, on average.
Spatial relationships are the most common ones in the dataset. Relationships involving humans are more action-oriented.
Question-Answer Pairs Statistics
1,773,258 question answering (QA) pairs were collected in the Visual Genome dataset. Each pair consists of a question and its correct answer regarding the content of an image. Every image has 17 QA pairs on average.
The angles of the regions are proportional to the number of questions. “What” questions are the most common type.
While “why” and “where” questions have the longest answers, factual questions like “what” and “how” are the shortest ones.
Experiments with the Visual Genome Dataset
Setup: For this experiment, we focus on the 100 most common attributes. Approximately 500 attributes from each category were sampled and finally, 50,000 attribute instances were obtained. This dataset is divided into three parts to be used: the training set, validation set, and testing set with the rates of 80%, 10%, and 10% respectively. After they train the model by using features from a convolutional neural network model has started to make predictions about the input image. In the below figure, you can see some examples.
In the second step of this experiment, they try to train a model by using object-attribute pairs jointly as inputs. Again, approximately 50,000 object attribute pairs have been used by dividing them into training, test, and validation sets. Objects that at least 100 times occurred and are associated with one of the 100 attributes are used for this 2nd step.
From the statistics, we can see the improvement when object-attribute statistics are used since objects are the core building blocks of the images.
A downside of these statistics is that sometimes the model incorrectly penalizes the answer. In figure 22 you can see the white teddy bear in the yellow box. Although “stuffed” is a correct prediction, the model penalizes it and expects the answer, “white”. This has resulted from crowdsourced ground truth answers, which sometimes do not include all valid answers.
The same approach in the attribute prediction experiment is also valid here. For the setup, again 100 most common relationships are focused and 34,000 relationships along with 27,000 unique subject-relationship-object pairs have been extracted. A 2-step experiment has been conducted, similarly to the attribute prediction experiment.
Figure 26: (a) Example predictions from relationship prediction experiment (b) Example predictions from the joint object-relationship prediction experiment.
As you can see from the red boxes at the top bottom, this time the model has some problems differentiating between different objects like boy vs. woman and car vs. bus even though it can correctly differentiate the salient objects.
The Visual Genome Dataset in a Nutshell
Visual Genome allows for deeper cognitive tasks that require enhanced inference. It enlightens the way for further improvements in computer vision in the future. It was inspired by similar applications in computer vision. Also, benefited from related papers and studies in that field to create Visual Genome data.
Ango AI provides data labeling solutions fully compatible and ready for Visual Genome training. Both our data labeling platform Ango Hub, and our fully-managed data annotation service Ango Service, are equipped with all you need to get started labeling, or outsource your labeling for Visual Genome. Book a demo with us to learn more.
“Visual Genome is a large, formalized knowledge representation for visual understanding and a more complete set of descriptions and question answers that grounds visual concepts to language.” 
 “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.” [Online]. Available: https://visualgenome.org/static/paper/Visual_Genome.pdf. [Accessed: 12-Aug-2022].
 “WordNet,” Princeton University. [Online]. Available: https://wordnet.princeton.edu/. [Accessed: 12-Aug-2022].
 “Visual Genome,” VisualGenome. [Online]. Available: https://visualgenome.org/. [Accessed: 12-Aug-2022].
Authors: İbrahim Orcan Ön and Onur Aydın
Technical Proofreading: Balaj Saleem