Image Captioning

Image Caption Generation

Image caption generation is a task that involves generating a sentence that describes an input image. The proliferation of image posting services has led to an explosive increase in the number of images that can be collected on the Internet. For instance, there are 40 billion images on Instagram (as of September 2015) and 250 billion images uploaded to Facebook (as of September 2013). Thanks to this situation, it has become feasible to build a model that generates descriptive sentences for general images by training it on a large number of sets of images and their captions.

So, what's the benefit of generating a caption for an image? Let's consider the following image as an example. Image recognition technology can identify the labels of objects in the image (people, table, dinner), but it cannot comprehend their relationships. However, using an image caption generation model, it is possible to obtain a detailed caption that includes relationships between objects, like "Group of people sitting at a table with a dinner."

Example of image caption generation [Ushiku+, ICCV 2015]

Existing Research

There are two main approaches to image caption generation:

Reusing existing sentences
Generating new sentences

Reusing existing sentences involves reusing captions that are already included in the dataset. The caption to be reused is taken from an image in the dataset that is similar to the input image. While this method has the advantages of being relatively easy to implement and producing grammatically correct sentences, it has limitations. Specifically, it can't generate expressions that do not exist in the dataset, limiting its ability to handle new images.

On the other hand, generating new sentences is less limited in terms of expressiveness. However, it is more challenging because it must generate grammatically correct sentences. Recent advancements in deep learning technology have led to rapid developments in both image recognition and text generation. In image caption generation as well, deep learning technology has enabled the creation of entirely new sentences that describe the contents of images with correct grammatical structure.

Uniqueness and Achievements of This Lab on Image Captioning

Our lab has been publishing research results on image caption generation, recognizing the importance of this task from an early stage, before the emergence of deep learning. One of the challenges in image caption generation is the difficulty of creating natural sentences. We focused on generating "accurate" and "natural" captions by using the "reuse of existing sentences" approach. Specifically, we proposed a method to search for images similar to the one we want to describe from the dataset and combine their captions. We also proposed a method to generate captions by combining key phrases (multi-key phrases) while considering grammar. These initiatives have made it possible to attach more natural and accurate captions to images, contributing significantly to the research of image caption generation. In addition, we have successfully implemented image search by sentences instead of words using this technology.

Focus on multi-key phrase generation method [Ushiku+, ACMMM 2012]

In recent years, deep learning-based image caption generation techniques have been attracting attention. The key feature is that it generates completely new sentences instead of reusing sentences from the dataset. Our lab is also actively working on caption generation using deep learning, leveraging our unique insights gained over the years. In 2015, we proposed "CoSMoS (Commons for Similarity and Model)", which enables the measurement of similarity by projecting the features of images and captions into the same space. This similarity can be used to generate image captions with high accuracy. We achieved the world's highest performance at the time by combining this with AlexNet, the baseline for deep learning-based image feature extraction model.

In addition, while traditional image caption generation methods could use global information, they often overlooked local information. To address this, we introduced "Spatial Pyramid VLAD Coding", which divides the image into several regions and integrates the information obtained from each region. This allows us to correctly describe content that depends on local information.

Traditional image caption generation methods could only describe facts and couldn't handle subjective impressions (i.e., sentiment). In response to this, we succeeded in generating captions that include sentiment by training not only a network that handles objects but also a network that handles sentiment. Furthermore, we are working on generating a "story" that includes multiple sentences and subjective emotional changes. In this research, we can generate a story with emotional changes about an image by specifying the subjective emotional change in the story as an "Emotion Arc". Technology to generate a story from an image is expected to attract attention in the future from the perspective of AI's creativity and support for creators.

Image caption with sentiment [Andrew+, BMVC 2016]

Story generation about an image [Uehara+, WWW workshop 2022]

Large-Scale Vision & Language Models

Since the release of GPT-3, research into Large Language Models (LLMs), which have massive trainable parameters, has become very popular, especially in the field of text generation. This trend of scaling up models has also been applied in the study of Vision & Language, leading to the proposal of large-scale models like BLIP-2 and LLaVA, which have billions of parameters. Our lab is also working on training large-scale Vision & Language models. In our recent research, we focused on the existing large-scale Vision & Language models' lack of ability to explain their reasoning process and to engage in interactive dialogues with users. Therefore, we proposed a method called "Chain-of-Reasoning (CoR)" that explains the reasoning process and generates questions when there is uncertainty in the inference. Moreover, training such large models requires the use of high-performance computing with multi-node distributed learning technology. In this research, we used four DGX A100s (80GB) for multi-node distributed learning, allowing for fast and large-scale training.

Future Directions

The accuracy of image recognition is continually improving, and so is the performance of natural language generation models. Image caption generation, for example, is a task that involves two modalities: images and language. This type of research, which targets multiple modalities, is called "multimodal learning." Our lab is advancing research in multimodal learning that includes not only images and language but also sound, video, etc.