
Table 1: Classification and localization errors measured on the ILSVRC-2015 validation dataset. We
always used the last convolutional ReLU layer for visualization.
Model
Classification error (%) Localization error (%)
Top-1 Top-5 Top-1 Top 5
AlexNet 44.58 (44.2) 21.69 (20.8) 68.04 (68.3) 56.18 (56.6)
GoogLeNet 32.46 (31.9) 11.82 (11.3) 56.89 (60.09) 45.44 (49.34)
VGG-16 30.94 (30.38) 10.87 (10.89) 55.82 (56.51) 44.82 (46.41)
at the same time as localizing the important areas by overlaying the Grad-CAM heatmap on the
image created by Guided Backpropagation. In accordance with this, it can be calculated by taking the
element-wise product of the outputs of these two methods.
Lastly, Grad-CAM++ [3] is a proposed improvement of Grad-CAM which applies a ReLU on the
gradients
∂y
c
∂A
k
i,j
to filter out gradients that have a negative influence on the output class (similarly to
Guided Backpropagation). Figure 1 presents examples of images produced by the methods mentioned.
Throughout our experiments, we used VGG-16/VGG-16-BN, AlexNet and GoogLeNet (all three
pretrained on ImageNet), DenseNet (pretrained on NHS Chest-X-ray14 [7]) and trained a simple
three-layer convolutional network on MNIST (see Section 5). For the reproducibility tasks, we
used the ILSVRC 2015 validation dataset [8] that contains 50k images of 1,000 categories and the
corresponding bounding boxes. Chest-X-ray14 contains 112,120 X-ray images of 14 + 1 different
classes (14 of them representing detectable diseases and one implying "no findings"). As bounding
boxes are only available for 984 images, our experiment on medical images was restricted to them.
The images were resized to 224 × 224 and the bounding boxes were modified accordingly.
4 Reproducibility study
Localization ability
An intuitive application of Grad-CAM’s heatmaps is in localization tasks
where we are interested in not only the occurrence but also the location of an object. This task can
be approached with bounding boxes; it can be viewed as a supervised regression problem where the
label
y = (x
min
, y
min
, x
max
, y
max
)
is compared against ground truth bounding boxes. Generating
labeled bounding boxes can be costly, especially in fields where expertise is needed (e.g., medical
data). Because of this, it can be interesting to use them in a weak localization task where the network
is not explicitly trained on bounding boxes. Given an image, we can generate a heatmap and convert it
to a binary map by e.g. using a
15%
threshold. This binary image will then contain multiple clusters
around which bounding boxes may be drawn. We isolate the one with the largest area and compare it
with the true bounding box by computing the Jaccard similarity
J(box
1
, box
2
) =
|box
1
∩box
2
|
|box
1
∪box
2
|
, also
known as IoU score. We can then regard this as a binary classification problem where
(m, n)
is the
size of box
real
and positive predictions can be counted as:
box
predicted
≃ box
real
⇐⇒ J(box
predicted
, box
real
) ≥ min
0.5,
m · n
(m + 10)(n + 10)
(3)
Our results are shown in Table 1. The numbers generally lie within
±1%
of those found in the original
paper (these are in parentheses in the table). The slight differences can be attributed to the lack of
information about the layer that was used (in the case of VGG) and more importantly preprocessing
(image rescaling). We rescaled the images to
256 × 256
before applying a
224 × 224
center crop as
this is standard procedure for ImageNet. Overall, Grad-CAM provided fairly impressive localization
results, considering that the model was not explicitly trained for this task.
Pointing game
Pointing game is another technique for investigating Grad-CAM’s localization
ability. Originally introduced by Zhang et al. [9], this method extracts the maximally activated point
from a heatmap and checks whether it is within the bounding box of the target object category – which,
in this case, is a ground truth label. The localization accuracy is then defined as
Acc =
#Hits
#Hits+#Misses
where a point within the bounding box is counted as a hit. In the Grad-CAM paper [1], this metric –
henceforth referred to as recall – is extended with the fact that now the top-5 predictions are used
3