I have been a while not reading pure image captioning paper. One day I checked the coco leaderboard, and found the first two are both quite recent result. So I checked their paper, and found that they are both using policy gradient. Intersting.

So some basic background for image captioning: the standard way to train a image caption is to maximize \(P(w_{1:T}| I)\). Using chain rule and take log, you can basically train the model by minimizing the cross entropy loss of each word.

However, there are two shortcomings for this kind of strategy:

  1. The evaluation metric is different from the training loss. This is an common issue in many problems, like minimizing log-likelihood for segmentation but evaluating using mean IOU. Not only we want to get better evaluation metric, but also the evaluation metric has perceptual mearning. You can think of log loss as putting the same weight on all the words in the sentence, however, if it’s human to judge the captions, human may focus more on if the captions are informative or not. So, presumably optimizing directly on the metric can get better caption.

  2. The exposure bias. During the training, the input of each time step of the model is from the true captions. However, when generating captions, the input of next time step is previous work generated by the model. Once the model generate a “bad” word that the model has never seen before, the error will propogate and result in bad caption. Schedule sampling is motivated by this problem.

So the obvious idea is to match the training stage and test stage. Since we want better CIDER, so let’s optimizer CIDER. Since we sampled words in test stage, then we sampled word in training stage too.

However, the sampling operation is not differentiable, so you can’t use normal backpropogation to get the gradients. Here comes the Reinforcement Learning part. The RL is to some game-like task. Say the game is Flappy bird; it’s about what position(state) to jump(action) can achieve the highest score(reward). RL is to train a model that can choose a right action given a state so that the model can finally can get high reward.

In the case of image captioning, the state is the image and previously generated words, the action is choosing next word, and the reward is CIDER or other metrics.


Policy Gradient is a very fundamental algorithm in RL. You can check Karparthy’s blog to learn the basic PG.

PS: The bay area deep learning school has a tutorial on reinforcement learning which covers PG and another popular algorithm in RL, Q learning. Highly recommended.

It’s not the first work applying PG to image captioning. FAIR proposed MIXER. After that, this paper applies actor-critic (an improved PG algorithm) to machine translation. However, these two recent papers are still using PG, not knowing if the next state-of-the-art will be AC or not.

The first paper is Self-critical Sequence Training for Image Captioning from IBM; it’s now the best on coco leader board. The second is Optimization of image description metrics using policy gradient methods, which is now the second on leader board.


First, let me introduce the basic Policy gradient:

The objective is to maximize expected reward:

\[L(\theta) = -\mathbb{E}_{w^s\sim p_\theta} [r(w^s)]\]

where \(w_s\) is the sampled caption, which can be seperated to \(w_1^s\) to \(w_T^s\). \(p_\theta\) is our model parameterized by \(\theta\).

By simple derivation, we can get the gradient of \(L\) with respect to \(\theta\).

\[\nabla_\theta L(\theta) = -\mathbb{E}_{w_s\sim p_\theta}[r(w^s)\nabla_\theta \log p_\theta(w^s) ]\]

To this formula, we make a small change:

\[\nabla_\theta L(\theta) = -\mathbb{E}_{w_s\sim p_\theta}[(r(w^s) - b)\nabla_\theta \log p_\theta(w^s) ]\]

The ‘b’ represents baseline. As long as the b is not dependent on \(w^s\), this change won’t change the gradient value. The proof is as follows:

\[\mathbb{E}_{w_s\sim p_\theta}[b\nabla_\theta \log p_\theta(w^s)] = b\sum_{w_s} \nabla_\theta p_\theta(w^s)\\ = b\nabla_\theta\sum_{w_s} p_\theta(w^s) \\ = b\nabla_\theta 1 = 0\]

In practice, we use a sampled caption to estimate the gradient. (As GD to SGD)

\[\nabla_\theta L(\theta) \approx -(r(w^s) - b)\nabla_\theta \log p_\theta(w^s)\]

(We will discuss the baseline after a while.) By chain rule, we can get

\[\nabla_\theta L(\theta) = \sum_{t=1}^T \frac{\partial L(\theta) }{\partial s_t } \frac{\partial s_t }{\theta}\]

Here, \(s_t\) is a vocabulary size vector, the score of each word at time t. (The input of softmax layer)

\[\frac{\partial L(\theta)}{\partial s_t} \approx (r(w^s) - b)(p_\theta(w_t|h_t) - 1_{w_t^s})\]

For sampled each \(w_t^s\), the later term is always less than 0. In this case, if the previous term \(r(w^s) - b\) is greater than 0, then the gradient is negative. If we do gradient descent, then the scores of sampled words will be pushed up.

The purpose of using baseline is to reduce the variance of the gradient. Say the reward is always between 100 and 120; if we don’t have b, all the sampled results will be pushed in each SGD update. The difference is, the sampled results with higher reward will be pushed up to a larger extent. However, even if the reward is 120, the scale of push-up is not much more than reward 100. You can imageine that the update is very unstable.

However, if we have a baseline 110 now, the result with reward 100 will be pushed-down, because the gradient is positive. Another possible baseline is \(\mathbb{E}[r(w_s)]\), aka the average reward. When you sample a caption, you can know if it’s above average or below average. If it’s above average, push it up, and vice versa.


The first paper define the baseline in a quite interesting way. They use the reward of \(\hat w\) generated by greedy decoding as baseline. This method avoid training a baseline function which will involve choosing hyperparameters. However, this method is much simpler and they claim that this would reduce variance. And the name of self-critical comes from this design.

So, if the sampled result is worse then greedy decoding, the result will be pushed-down, and vice versa.

In the paper, they propose that they can use other \(\hat w\), like n-best list or beam search.

No matter what result they get. This choice of baseline is a very interesting idea, but no that intuitive. In general, the sampled result will mostly be worse than greedy decoding result, so the first term of the gradient are mostly negative. To improve\(\mathbb{E}[r(w^s) - r(\hat w)]\), improving \(r(w^s)\) is not the only way; you can also decrease \(r(\hat w)\). An extreme case is when you have a random model, the rewards you get from sampled and greedy search could be very close, similarly bad. However, the paper shows that this case doesn’t happen. One possible reason is they pretrain the model using cross entropy loss which avoid the model diverge to really bad result.

Another thing is, the geration process in training and testing is coherent. However, during training, the output is sampled, and during testing, the output is generated by greedy decoding or beam search.

Some implementation details:

They are using show attend tell model. The vision feature is the output of the last convolutional layer of resnet-101. They didn’t crop or rescale the image, but input the original image, and then use spatial adaptive average pooling to resize it to 14x14x2048. (I omit the non-attention model in that paper.)

They also tried different metric as reweard. However, only optimizing on CIDER can gain improvements on all the other metrics compared to cross-entropy loss, which implicitly proves CIDER is better than other metrics.


The second paper is quite similar to seqGAN. This paper defines Q, which is the expected reward of choosing \(w_t\) given the previously generated t-1 words and the image.

\[Q_\theta(w_{1:t-1}, w_t) = \mathbb{E}_{w_{t+1:T}}[R(w_{1:t-1}, w_t, w_{t+1:T})]\]

The purpose of using Q is to split the contributions of each words, kind of like giving each word a reward.

\[Q_\theta(w_{1:t-1}, w_t) \approx \frac{1}{K} \sum_{k=1}^K R(w_{1:t-1}, w_t, w_{t+1:T}^k)\]

In practice, the Q is estimated by rollouts. K here is chosen to be 3. (Recall that AlphaGo also uses rollouts to estimate values),

In this way, the gradient becomes:

\[\frac{\partial L(\theta)}{\partial s_t} \approx (Q(w_{1:t-1}, w_t) - B_\phi(w_{1:t-1}))(p_\theta(w_t|h_t) - 1_{w_t})\]

(To be consistent with the first paper, I slightly change the expressions in the second paper.) The difference here is, the same reward became different Ws, the same b become a parametric function B. The b here is a trainable function of the hidden vector at each time step. It’s the same as what’s in MIXER’s code.

The only thing I’m curious about is why not use the Q value of the previous time step? Then, \(Q(w_{t+1}, w_{1:t}) - Q(w_{t}, w_{1:t-1})\) is the contribution of \(w_{t+1}\).

The paper also uses other tircks, which I’m not covering here. The caption model they use is the show tell model from google. The CNN is inception v3.

Conclusion:

These two papers are two divergent algorithm based on the same motivation. However, it’s hard to compare them in parallel, because their cnn-rnn network architecture is different. There are a lot of possible future works, like what’s mentioned in the conclusion of the second paper. For example, use discriminator score as reward like what’s in seqGAN instead of these metrics.