Grad-CAM is a visual explanation method for convolutional neural networks, introduced by Selvaraju et al. in 2016. This analysis evaluates the reproducibility of Grad-CAM's results and explores its effectiveness in medical imaging, particularly chest X-rays. The study includes novel experiments assessing Grad-CAM's localization capabilities compared to other techniques. Key findings highlight Grad-CAM's strengths and weaknesses, particularly in specialized datasets. This research is valuable for those interested in deep learning interpretability and medical applications.

Key Points

  • Evaluates the effectiveness of Grad-CAM in medical imaging tasks.
  • Compares Grad-CAM's localization capabilities with other explanation methods.
  • Introduces novel metrics for assessing the fidelity and contrastivity of Grad-CAM.
  • Analyzes reproducibility issues in the original Grad-CAM paper.
Dileesha A
Author:Rajmund Nagy, Doumitrou Daniil Nimara, Livia Qian
16 pages
Language:English
Type:Research Paper
Dileesha A
Author:Rajmund Nagy, Doumitrou Daniil Nimara, Livia Qian
16 pages
Language:English
Type:Research Paper
211
/ 16
Analysis and Evaluation of Grad-CAM Explanations
Rajmund Nagy
rajmundn@kth.se
Doumitrou Daniil Nimara
nimara@kth.se
Livia Qian
liviaq@kth.se
Abstract
In this project, we reimplement the paper Grad-CAM: Visual Explanations from
Deep Networks via Gradient-based Localization from 2016 which introduced a vi-
sual explanation method for convolutional neural networks. Our experiments focus
on evaluating the reproducibility of the results shown in the paper (e.g. localization
task, pointing game, comparison with occlusion maps); moreover, we propose
novel experiments in order to better understand the strengths and weaknesses of
this technique. In this regard, we 1) analyze Grad-CAM’s ability to explain chest
X-Rays (medicine is a field in which localization is of utmost importance) and
compare its localization capability with other explanation methods; 2) measure its
fidelity and contrastivity; and 3) introduce a new metric (to the best of our knowl-
edge) based on the notion of sensitivity. Our results advocate for Grad-CAM’s
efficacy in CNNs and provide new information regarding its particularities; for
instance, we show that great network performance does not translate as smoothly to
good localization in the more specialized medical dataset (where we achieve results
comparable with other papers). Furthermore, our implementation of Grad-CAM++
provides a promising alternative, outperforming Grad-CAM in the aforementioned
difficult dataset. Lastly, our fidelity experiments propose that the method might get
outperformed by non-CNN based explanation methods when a large portion of the
network is non-convolutional.
1 Introduction
Despite the increase in the commercial use of deep learning, many neural networks are still treated as
black boxes. This is particularly problematic in tasks where mistakes are exceptionally costly (e.g.
self-driving cars). Many visual explanation methods have been developed in recent years to tackle
this issue. In this project, we investigate the paper "Grad-CAM: Visual Explanations from Deep
Networks via Gradient-based Localization" [1] from a reproducibility perspective and carry out three
new experiments to further evaluate the proposed technique. Our code is publicly available here
1
.
The remainder of this report is structured the following way: Section 2 will introduce related work
in order to better conceptualize Grad-CAM’s reasoning, approaches and strengths. We will then
summarize Grad-CAM in more detail in Section 3. After providing a firmer understanding of the
method, we will analyze the original paper’s reproducibility in Section 4. In Section 5, we will
explore new experiments to examine Grad-CAM in novel ways, both quantitatively and qualitatively.
Finally, we will summarize our findings and share our greatest challenges in Sections 6 and 7.
2 Related work
In 2016, Zhou et al. [2] showed that global average pooling layers (GAP) can help CNNs retain
their ability to localize objects despite being only trained for image classification. They proposed
Class Activation Maps (CAM), which visualize a network’s attention on a given image when making
1
GitHub Repository
HT20 Deep Learning Advanced, DD2412, KTH Royal Institute of Technology.
Figure 1: Exemplary visualization of the implemented methods on VGG-16 for one of the dog classes.
Grad-CAM++ focuses on more relevant dog features (face). See Appendix for more examples.
a certain prediction by combining feature maps in the final convolutional layer. However, as the
calculation of CAMs poses strict constraints on the network architecture, their technique cannot be
applied to most CNNs. In 2017, Selvaraju et al. [1] removed these constraints with Grad-CAM
(Gradient-weighted CAM) by using the gradient information to represent the importance of each
feature map. Two limitations of CAM explanations remained namely, that they often fail to capture
the entire object and that there is a consistent drop in localization performance when multiple instances
of the same class are present. Chattopadhyay et al. [3] addressed them both with Grad-CAM++,
where positive pixel-wise gradients are incorporated into the weights.
The method of Integrated Gradients [4] proposes a more thorough inspection of the input. It considers
the linear path between a baseline image (e.g. full black) and the actual input, and calculates the
importance of each spatial location by accumulating the pixel-wise gradients along this path. On
the other hand, SHAP (SHapley Additive exPlanations) [5] aims to quantify the contribution of a
pixel
z
i
by taking its average marginal contributions across all possible feature subgroups (Shapley
value). The intuition is fairly simple. The features can be seen as agents cooperating in a game of
correctly classifying an image. Each agent can cooperate with 0, 1, ...,
|features| 1
other features
toward the goal. Then, the importance of each agent can be viewed as their individual contribution to
the outcome, averaged over all possible groups. SHAP uses sampling techniques to measure these
quantities and return them as explanations.
3 Methods
Grad-CAM [1] is a visual explanation method that can argue for why a network has made a certain
prediction for a specific image. Given a pretrained CNN-based model, an image and a class of interest
c
, it generates a heatmap from the relevant layer’s feature map activations by first forward propagating
the image and then backpropagating the gradients to the layer of interest. Before backpropagation,
the gradients should be set to zero for every class except
c
. The heatmap is defined as the linear
combination of the feature map activations. The weight belonging to a specific feature map
A
k
is
denoted by the neuron importance weight α
c
k
:
α
c
k
=
1
Z
X
i
X
j
y
c
A
k
ij
(1)
where
y
c
is the class score belonging to
c
and
Z
is a normalization factor. Since we are only interested
in the features that have a positive influence on
c
, pixels with negative values can be canceled with
ReLU. The heatmap can then be calculated as
L
c
Grad-CAM
= ReLU
X
k
α
c
k
A
k
!
(2)
Grad-CAM is a generalization of CAM [2] as it works on any convolutional layer (CAM only worked
on the last convolutional layer if it was followed by a single fully connected softmax classification
layer). This makes it applicable to a wide range of CNN families and capable of generating heatmaps
of different detailedness. Another positive attribute it has is that it does not interfere with the base
network’s architecture, thus allowing for computational efficiency and adaptability.
Guided Grad-CAM, a method also presented by Selvaraju et al. [1], is a combination of Grad-CAM
and Guided Backpropagation [6]. It shows the fine-grained details and the relevant edges in an image
2
Table 1: Classification and localization errors measured on the ILSVRC-2015 validation dataset. We
always used the last convolutional ReLU layer for visualization.
Model
Classification error (%) Localization error (%)
Top-1 Top-5 Top-1 Top 5
AlexNet 44.58 (44.2) 21.69 (20.8) 68.04 (68.3) 56.18 (56.6)
GoogLeNet 32.46 (31.9) 11.82 (11.3) 56.89 (60.09) 45.44 (49.34)
VGG-16 30.94 (30.38) 10.87 (10.89) 55.82 (56.51) 44.82 (46.41)
at the same time as localizing the important areas by overlaying the Grad-CAM heatmap on the
image created by Guided Backpropagation. In accordance with this, it can be calculated by taking the
element-wise product of the outputs of these two methods.
Lastly, Grad-CAM++ [3] is a proposed improvement of Grad-CAM which applies a ReLU on the
gradients
y
c
A
k
i,j
to filter out gradients that have a negative influence on the output class (similarly to
Guided Backpropagation). Figure 1 presents examples of images produced by the methods mentioned.
Throughout our experiments, we used VGG-16/VGG-16-BN, AlexNet and GoogLeNet (all three
pretrained on ImageNet), DenseNet (pretrained on NHS Chest-X-ray14 [7]) and trained a simple
three-layer convolutional network on MNIST (see Section 5). For the reproducibility tasks, we
used the ILSVRC 2015 validation dataset [8] that contains 50k images of 1,000 categories and the
corresponding bounding boxes. Chest-X-ray14 contains 112,120 X-ray images of 14 + 1 different
classes (14 of them representing detectable diseases and one implying "no findings"). As bounding
boxes are only available for 984 images, our experiment on medical images was restricted to them.
The images were resized to 224 × 224 and the bounding boxes were modified accordingly.
4 Reproducibility study
Localization ability
An intuitive application of Grad-CAM’s heatmaps is in localization tasks
where we are interested in not only the occurrence but also the location of an object. This task can
be approached with bounding boxes; it can be viewed as a supervised regression problem where the
label
y = (x
min
, y
min
, x
max
, y
max
)
is compared against ground truth bounding boxes. Generating
labeled bounding boxes can be costly, especially in fields where expertise is needed (e.g., medical
data). Because of this, it can be interesting to use them in a weak localization task where the network
is not explicitly trained on bounding boxes. Given an image, we can generate a heatmap and convert it
to a binary map by e.g. using a
15%
threshold. This binary image will then contain multiple clusters
around which bounding boxes may be drawn. We isolate the one with the largest area and compare it
with the true bounding box by computing the Jaccard similarity
J(box
1
, box
2
) =
|box
1
box
2
|
|box
1
box
2
|
, also
known as IoU score. We can then regard this as a binary classification problem where
(m, n)
is the
size of box
real
and positive predictions can be counted as:
box
predicted
box
real
J(box
predicted
, box
real
) min
0.5,
m · n
(m + 10)(n + 10)
(3)
Our results are shown in Table 1. The numbers generally lie within
±1%
of those found in the original
paper (these are in parentheses in the table). The slight differences can be attributed to the lack of
information about the layer that was used (in the case of VGG) and more importantly preprocessing
(image rescaling). We rescaled the images to
256 × 256
before applying a
224 × 224
center crop as
this is standard procedure for ImageNet. Overall, Grad-CAM provided fairly impressive localization
results, considering that the model was not explicitly trained for this task.
Pointing game
Pointing game is another technique for investigating Grad-CAM’s localization
ability. Originally introduced by Zhang et al. [9], this method extracts the maximally activated point
from a heatmap and checks whether it is within the bounding box of the target object category which,
in this case, is a ground truth label. The localization accuracy is then defined as
Acc =
#Hits
#Hits+#Misses
where a point within the bounding box is counted as a hit. In the Grad-CAM paper [1], this metric
henceforth referred to as recall is extended with the fact that now the top-5 predictions are used
3
/ 16
End of Document
211

FAQs

What is Grad-CAM and how does it work?
Grad-CAM, or Gradient-weighted Class Activation Mapping, is a technique used to visualize the decisions made by convolutional neural networks. It generates heatmaps that highlight the regions of an image that contribute most to the model's predictions. By backpropagating the gradients from the output layer to the final convolutional layer, Grad-CAM assigns importance weights to the feature maps, allowing for a visual representation of the model's focus during classification.
What are the main findings regarding Grad-CAM's effectiveness in medical imaging?
The analysis found that Grad-CAM performs well in localizing diseases in chest X-ray images, achieving notable accuracy in weakly supervised localization tasks. However, the results indicate that its performance may not be as strong in specialized medical datasets compared to other explanation methods. The study suggests that while Grad-CAM is effective, it may be outperformed by techniques like Integrated Gradients and SHAP in certain contexts.
How does Grad-CAM compare to other visualization methods?
Grad-CAM is compared to other visualization techniques such as Integrated Gradients and SHAP in terms of fidelity and contrastivity. The study reveals that Grad-CAM exhibits high contrastivity, especially in deeper networks, but may be more sensitive to threshold values. This comparison highlights the strengths and weaknesses of each method, providing insights into their applicability in different scenarios.
What challenges were faced during the analysis of Grad-CAM?
The analysis encountered several challenges, including reproducibility issues related to hyperparameter specifications in the original Grad-CAM paper. Additionally, the computational demands of the experiments required significant resources, and the lack of detailed methodology in the original research complicated the replication of results. These challenges emphasize the need for clearer guidelines in future studies.
What novel experiments were conducted in this analysis?
The analysis introduced several novel experiments, including an evaluation of Grad-CAM's localization accuracy on medical images and a comparison of its performance with other explanation methods. Additionally, new metrics were developed to assess the sensitivity and fidelity of Grad-CAM's visualizations. These experiments aimed to provide a deeper understanding of Grad-CAM's capabilities and limitations.
What is the significance of the findings related to Grad-CAM's sensitivity?
The findings suggest that Grad-CAM can exhibit sensitivity in certain scenarios, meaning it assigns nonzero attribution to features that significantly influence predictions. This characteristic is crucial for understanding how well Grad-CAM aligns with the theoretical expectations of explanation methods. The exploration of sensitivity adds a new dimension to the evaluation of Grad-CAM and its potential applications.
How does the user study contribute to the evaluation of Grad-CAM?
The user study aimed to assess the perceived trustworthiness of Grad-CAM visualizations compared to other methods like Guided Backpropagation. It revealed that participants had varied opinions on the reliability of the visualizations, indicating that the choice of method can influence user interpretation. This aspect of the study highlights the importance of user perception in the effectiveness of visualization techniques.