Analysis and Evaluation of Grad-CAM Explanations PDF Downloa

Analysis and Evaluation of Grad-CAM Explanations

Rajmund Nagy

rajmundn@kth.se

Doumitrou Daniil Nimara

nimara@kth.se

Livia Qian

liviaq@kth.se

Abstract

In this project, we reimplement the paper Grad-CAM: Visual Explanations from

Deep Networks via Gradient-based Localization from 2016 which introduced a vi-

sual explanation method for convolutional neural networks. Our experiments focus

on evaluating the reproducibility of the results shown in the paper (e.g. localization

task, pointing game, comparison with occlusion maps); moreover, we propose

novel experiments in order to better understand the strengths and weaknesses of

this technique. In this regard, we 1) analyze Grad-CAM’s ability to explain chest

X-Rays (medicine is a field in which localization is of utmost importance) and

compare its localization capability with other explanation methods; 2) measure its

fidelity and contrastivity; and 3) introduce a new metric (to the best of our knowl-

edge) based on the notion of sensitivity. Our results advocate for Grad-CAM’s

efficacy in CNNs and provide new information regarding its particularities; for

instance, we show that great network performance does not translate as smoothly to

good localization in the more specialized medical dataset (where we achieve results

comparable with other papers). Furthermore, our implementation of Grad-CAM++

provides a promising alternative, outperforming Grad-CAM in the aforementioned

difficult dataset. Lastly, our fidelity experiments propose that the method might get

outperformed by non-CNN based explanation methods when a large portion of the

network is non-convolutional.

1 Introduction

Despite the increase in the commercial use of deep learning, many neural networks are still treated as

black boxes. This is particularly problematic in tasks where mistakes are exceptionally costly (e.g.

self-driving cars). Many visual explanation methods have been developed in recent years to tackle

this issue. In this project, we investigate the paper "Grad-CAM: Visual Explanations from Deep

Networks via Gradient-based Localization" [1] from a reproducibility perspective and carry out three

new experiments to further evaluate the proposed technique. Our code is publicly available here

The remainder of this report is structured the following way: Section 2 will introduce related work

in order to better conceptualize Grad-CAM’s reasoning, approaches and strengths. We will then

summarize Grad-CAM in more detail in Section 3. After providing a firmer understanding of the

method, we will analyze the original paper’s reproducibility in Section 4. In Section 5, we will

explore new experiments to examine Grad-CAM in novel ways, both quantitatively and qualitatively.

Finally, we will summarize our findings and share our greatest challenges in Sections 6 and 7.

2 Related work

In 2016, Zhou et al. [2] showed that global average pooling layers (GAP) can help CNNs retain

their ability to localize objects despite being only trained for image classification. They proposed

Class Activation Maps (CAM), which visualize a network’s attention on a given image when making

GitHub Repository

HT20 Deep Learning Advanced, DD2412, KTH Royal Institute of Technology.

Figure 1: Exemplary visualization of the implemented methods on VGG-16 for one of the dog classes.

Grad-CAM++ focuses on more relevant dog features (face). See Appendix for more examples.

a certain prediction by combining feature maps in the final convolutional layer. However, as the

calculation of CAMs poses strict constraints on the network architecture, their technique cannot be

applied to most CNNs. In 2017, Selvaraju et al. [1] removed these constraints with Grad-CAM

(Gradient-weighted CAM) by using the gradient information to represent the importance of each

feature map. Two limitations of CAM explanations remained – namely, that they often fail to capture

the entire object and that there is a consistent drop in localization performance when multiple instances

of the same class are present. Chattopadhyay et al. [3] addressed them both with Grad-CAM++,

where positive pixel-wise gradients are incorporated into the weights.

The method of Integrated Gradients [4] proposes a more thorough inspection of the input. It considers

the linear path between a baseline image (e.g. full black) and the actual input, and calculates the

importance of each spatial location by accumulating the pixel-wise gradients along this path. On

the other hand, SHAP (SHapley Additive exPlanations) [5] aims to quantify the contribution of a

pixel

by taking its average marginal contributions across all possible feature subgroups (Shapley

value). The intuition is fairly simple. The features can be seen as agents cooperating in a game of

correctly classifying an image. Each agent can cooperate with 0, 1, ...,

|features| − 1

other features

toward the goal. Then, the importance of each agent can be viewed as their individual contribution to

the outcome, averaged over all possible groups. SHAP uses sampling techniques to measure these

quantities and return them as explanations.

3 Methods

Grad-CAM [1] is a visual explanation method that can argue for why a network has made a certain

prediction for a specific image. Given a pretrained CNN-based model, an image and a class of interest

, it generates a heatmap from the relevant layer’s feature map activations by first forward propagating

the image and then backpropagating the gradients to the layer of interest. Before backpropagation,

the gradients should be set to zero for every class except

. The heatmap is defined as the linear

combination of the feature map activations. The weight belonging to a specific feature map

denoted by the neuron importance weight α

∂y

∂A

(1)

where

is the class score belonging to

and

is a normalization factor. Since we are only interested

in the features that have a positive influence on

, pixels with negative values can be canceled with

ReLU. The heatmap can then be calculated as

Grad-CAM

= ReLU

(2)

Grad-CAM is a generalization of CAM [2] as it works on any convolutional layer (CAM only worked

on the last convolutional layer if it was followed by a single fully connected softmax classification

layer). This makes it applicable to a wide range of CNN families and capable of generating heatmaps

of different detailedness. Another positive attribute it has is that it does not interfere with the base

network’s architecture, thus allowing for computational efficiency and adaptability.

Guided Grad-CAM, a method also presented by Selvaraju et al. [1], is a combination of Grad-CAM

and Guided Backpropagation [6]. It shows the fine-grained details and the relevant edges in an image

Table 1: Classification and localization errors measured on the ILSVRC-2015 validation dataset. We

always used the last convolutional ReLU layer for visualization.

Model

Classification error (%) Localization error (%)

Top-1 Top-5 Top-1 Top 5

AlexNet 44.58 (44.2) 21.69 (20.8) 68.04 (68.3) 56.18 (56.6)

GoogLeNet 32.46 (31.9) 11.82 (11.3) 56.89 (60.09) 45.44 (49.34)

VGG-16 30.94 (30.38) 10.87 (10.89) 55.82 (56.51) 44.82 (46.41)

at the same time as localizing the important areas by overlaying the Grad-CAM heatmap on the

image created by Guided Backpropagation. In accordance with this, it can be calculated by taking the

element-wise product of the outputs of these two methods.

Lastly, Grad-CAM++ [3] is a proposed improvement of Grad-CAM which applies a ReLU on the

gradients

∂y

∂A

i,j

to filter out gradients that have a negative influence on the output class (similarly to

Guided Backpropagation). Figure 1 presents examples of images produced by the methods mentioned.

Throughout our experiments, we used VGG-16/VGG-16-BN, AlexNet and GoogLeNet (all three

pretrained on ImageNet), DenseNet (pretrained on NHS Chest-X-ray14 [7]) and trained a simple

three-layer convolutional network on MNIST (see Section 5). For the reproducibility tasks, we

used the ILSVRC 2015 validation dataset [8] that contains 50k images of 1,000 categories and the

corresponding bounding boxes. Chest-X-ray14 contains 112,120 X-ray images of 14 + 1 different

classes (14 of them representing detectable diseases and one implying "no findings"). As bounding

boxes are only available for 984 images, our experiment on medical images was restricted to them.

The images were resized to 224 × 224 and the bounding boxes were modified accordingly.

4 Reproducibility study

Localization ability

An intuitive application of Grad-CAM’s heatmaps is in localization tasks

where we are interested in not only the occurrence but also the location of an object. This task can

be approached with bounding boxes; it can be viewed as a supervised regression problem where the

label

y = (x

min

, y

min

, x

max

, y

max

)

is compared against ground truth bounding boxes. Generating

labeled bounding boxes can be costly, especially in fields where expertise is needed (e.g., medical

data). Because of this, it can be interesting to use them in a weak localization task where the network

is not explicitly trained on bounding boxes. Given an image, we can generate a heatmap and convert it

to a binary map by e.g. using a

15%

threshold. This binary image will then contain multiple clusters

around which bounding boxes may be drawn. We isolate the one with the largest area and compare it

with the true bounding box by computing the Jaccard similarity

J(box

, box

) =

|box

∩box

|box

∪box

, also

known as IoU score. We can then regard this as a binary classification problem where

(m, n)

is the

size of box

real

and positive predictions can be counted as:

box

predicted

≃ box

real

⇐⇒ J(box

predicted

, box

real

) ≥ min



0.5,

m · n

(m + 10)(n + 10)



(3)

Our results are shown in Table 1. The numbers generally lie within

±1%

of those found in the original

paper (these are in parentheses in the table). The slight differences can be attributed to the lack of

information about the layer that was used (in the case of VGG) and more importantly preprocessing

(image rescaling). We rescaled the images to

256 × 256

before applying a

224 × 224

center crop as

this is standard procedure for ImageNet. Overall, Grad-CAM provided fairly impressive localization

results, considering that the model was not explicitly trained for this task.

Pointing game

Pointing game is another technique for investigating Grad-CAM’s localization

ability. Originally introduced by Zhang et al. [9], this method extracts the maximally activated point

from a heatmap and checks whether it is within the bounding box of the target object category – which,

in this case, is a ground truth label. The localization accuracy is then defined as

Acc =

#Hits

#Hits+#Misses

where a point within the bounding box is counted as a hit. In the Grad-CAM paper [1], this metric –

henceforth referred to as recall – is extended with the fact that now the top-5 predictions are used

Overview

Analysis and Evaluation of Grad-CAM Explanations

Grad-CAM is a visual explanation method for convolutional neural networks, introduced by Selvaraju et al. in 2016. This analysis evaluates the reproducibility of Grad-CAM's results and explores its effectiveness in medical imaging, particularly chest X-rays. The study includes novel experiments assessing Grad-CAM's localization capabilities compared to other techniques. Key findings highlight Grad-CAM's strengths and weaknesses, particularly in specialized datasets. This research is valuable for those interested in deep learning interpretability and medical applications. Key Points Evaluates the effectiveness of Grad-CAM in medical imaging tasks. Compares Grad-CAM's localization capabilities with other explanation methods. Introduces novel metrics for assessing t…

/ 16

FAQs

What is Grad-CAM and how does it work?

Grad-CAM, or Gradient-weighted Class Activation Mapping, is a technique used to visualize the decisions made by convolutional neural networks. It generates heatmaps that highlight the regions of an image that contribute most to the model's predictions. By backpropagating the gradients from the output layer to the final convolutional layer, Grad-CAM assigns importance weights to the feature maps, allowing for a visual representation of the model's focus during classification.

What are the main findings regarding Grad-CAM's effectiveness in medical imaging?

The analysis found that Grad-CAM performs well in localizing diseases in chest X-ray images, achieving notable accuracy in weakly supervised localization tasks. However, the results indicate that its performance may not be as strong in specialized medical datasets compared to other explanation methods. The study suggests that while Grad-CAM is effective, it may be outperformed by techniques like Integrated Gradients and SHAP in certain contexts.

How does Grad-CAM compare to other visualization methods?

Grad-CAM is compared to other visualization techniques such as Integrated Gradients and SHAP in terms of fidelity and contrastivity. The study reveals that Grad-CAM exhibits high contrastivity, especially in deeper networks, but may be more sensitive to threshold values. This comparison highlights the strengths and weaknesses of each method, providing insights into their applicability in different scenarios.

What challenges were faced during the analysis of Grad-CAM?

The analysis encountered several challenges, including reproducibility issues related to hyperparameter specifications in the original Grad-CAM paper. Additionally, the computational demands of the experiments required significant resources, and the lack of detailed methodology in the original research complicated the replication of results. These challenges emphasize the need for clearer guidelines in future studies.

What novel experiments were conducted in this analysis?

The analysis introduced several novel experiments, including an evaluation of Grad-CAM's localization accuracy on medical images and a comparison of its performance with other explanation methods. Additionally, new metrics were developed to assess the sensitivity and fidelity of Grad-CAM's visualizations. These experiments aimed to provide a deeper understanding of Grad-CAM's capabilities and limitations.

What is the significance of the findings related to Grad-CAM's sensitivity?

The findings suggest that Grad-CAM can exhibit sensitivity in certain scenarios, meaning it assigns nonzero attribution to features that significantly influence predictions. This characteristic is crucial for understanding how well Grad-CAM aligns with the theoretical expectations of explanation methods. The exploration of sensitivity adds a new dimension to the evaluation of Grad-CAM and its potential applications.

How does the user study contribute to the evaluation of Grad-CAM?

The user study aimed to assess the perceived trustworthiness of Grad-CAM visualizations compared to other methods like Guided Backpropagation. It revealed that participants had varied opinions on the reliability of the visualizations, indicating that the choice of method can influence user interpretation. This aspect of the study highlights the importance of user perception in the effectiveness of visualization techniques.

Analysis and Evaluation of Grad-CAM Explanations

Phonomotor Versus Semantic Feature Analysis Treatment

Dampak Dan Faktor Job Insecurity

Nico Breakthrough Trading Journal May 2025

Modeling Optimal Investment and Reinsurance in Ambiguity Markets

Vogue Covers and Women’s Rights in the United States

Liquidated: An Ethnography of Wall Street by Karen Ho

Dekolonisatie Van De Denkwijze Lena Melis

La Comunidad Internacional y su Participación en Procesos de Paz

Medidas de Protección de Niños, Niñas y Adolescentes

Sistemas de Protección de Derechos de Niños y Adolescentes

Examining the Role of HR Metrics and Analytics in Decision-Making

KMU Past Papers 2020-2024 with Explanations

Heart of Darkness by Joseph Conrad Analysis

Nico Breakthrough: Technical Analysis Insights

9th Grade Literature Vocabulary Units 1 and 2

Corporate Analysis and Valuation Unit 2 Study Notes

Corporate Analysis and Valuation Study Notes Semester 4

Corporate Analysis and Valuation 1st Edition 2024