For millenniums, artists and writers have been able to generate captions from images or from images to paintings. It goes as far back to ancient Egyptian tomb paintings from 2000 B.C., to classic Chinese paintings from 1000 A.D. to the popular memes of the 2010s.

img

Over the past few years, the big Transformer based models have taken over the NLP world by a storm with bigger and better models streaming out from bigger compute and bigger data. However, the quality of the output has been lacking as the outputs are coherent but incomprehensible, with lots of times missing on context and semantics.

A similar problem can be observed in generating texts from images. All of the images captioning paper to date can, to a certain degree, generate descriptive texts from attending to different parts of the images. That is the understanding and language level of a two-year-old child. It’s about time we try to grow past beyond the cognitive abilities of a two-year-old and to generate a story from an image with better implicit comprehension.

My research interest is generating meaningful and yet non-descriptive texts from images. The ultimate goal is to design better image captioning systems that attempt to answer the following:

  1. How can we create statistical and machine learning models that generate non-descriptive and yet relevant image captions?
  2. How can we use statistical and machine learning models to better understand the context of an image beyond a laundry list description?

For my most recent project, Memefly, we attempt to generate memes from image inputs. We trained on a GRU network par-injected with image embedding extracted using pre-trained InceptionV3 and sentence embedding using methods outlined in Show and Tell: A Neural Image Caption Generator [Vinyals ‘14], Where to put the Image in an Image Caption Generator [Tanti ‘17], and Dank Learning: Generating Memes Using Deep Neural Networks[Peirson ‘18]. We also implemented beam-search as suggested by Professor Andrew Ng. We scrapped from various online resources a small dataset of around 108 memes and 20,000 descriptions for our training data. We implemented our network using Tensorflow Keras framework and our github repo can be found here.

During the training and evaluation process, we ran into and overcame several hurdles. The biggest challenge that we ran into is that there is no good quantitative score to measure the quality of the meme generated due to the diverse set of meme descriptions for each image. Typical metrics such as Perplexity score or BLEU score, do really measure the quality of the meme generated, neither does CIDEr which was designed for image captioning tasks. It seems that the currently available NLP metrics are great for determining distribution mapping classification models but quite insufficient for pseudo-generative models that we are trying to create.

Image generation using GAN faced a similar problem where there lacks any statistical metrics to evaluate the quality of generated images. Researchers from Stanford released HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models [Zhou, ‘19] and the associated website to tackle the problem on the image generation side. Unfortunately, there’s no such thing as language generation models. In the end, we ended up using human evaluations by doing inferencing on the small set of in-sample images in addition to a few out of sample images.

In Sample Meme Generated

img

Out of Sample Meme Generated

img

One other challenge that we faced is on generating different memes from the same image. The image caption model that we have is a deterministic process, i.e., the model will spit out the same words from the same images and text input. We attempted three methods: (a) adding Gaussian noise to the image before feeding it to our model pipeline, (b) adding Gaussian noise to the image embedding after image, and (c) choosing top N candidates from beam search. We had little success with (a) and (b) and found anecdotally that adding Gaussian noise to images or image embedding tends to produce more similar sentence outputs for different images. For example, our model tends to generate descriptions really similar to each other from portrait types meme images such as Most Interesting Man and Chuck Norris. We opt for (c) at the end, randomly choosing from the top 5 sentences generated using beam search to add varieties to our results.

Thank you for reading and please feel free to comment below.