Image Captioning is the process of creating a
text description of an image. It uses both Natural
Language Processing (NLP) and Computer Vision to
generate the captions. The image captioning task is done
by combining the detection process when the
descriptions consist of a single word like cat, skateboard,
etc. and Image Captioning when one predicted region
covers the full image, for example cat riding a
skateboard. To address the localization and description
task together, we propose a Fully Convolution
Localization Network that processes a picture with a
single forward pass which can be consistently trained in
a single round of optimization. To process an image, first
the input image is processed using CNN. Then
Localization Layer proposes regions and includes a
region detection network adopted from faster R-CNN
and captioning network. The model directly combines
the faster R-CNN framework for region detection and
long short-term memory (LSTM) for captioning
Keywords : Image Caption, Recurrent Neural Network, Long short-term memory, Convolution Neural Network, Faster R-CNN, Natural Language Processing