MOTIVATION OF READING: 语音任务可解释性
Link: http://arxiv.org/abs/2204.03852
Code:http://project.cslt.org/
1. Overview
Motivation of the work:
If any of the visualization tools are reliable when applied to speaker recognition, which makes the conclusions obtained from visualization not fully convincing.
Three CAM algorithms will be investigated: Grad-CAM++, Score-CAM and Layer-CAM. The main idea of these algorithms is to generate a saliency map by combining the activation maps (channels) of a convolutional layer.
2. Mehodology
A class activation map (CAM) is a saliency map that shows the important regions used by the CNN to identify a particular class.
2.1 Grad-CAM and Grad-CAM++
Grad-CAM
Grad-CAM++
2.2 Score-CAM
2.3 Layer-CAM
3. Experiment
Speaker model
3.1 Single-speaker experiment
Grad-CAM++ and Score-CAM tend to regard all the speech segments being important, while Layer-CAM produces more selective and localized patterns.
It shows that the three CAM algorithms indeed find salient regions. For example, in the insertion experiment, the curves of CAM algorithms clearly are much higher than that of the random masking, indicating that the regions exposed earlier by CAMs are indeed more important than random regions.
3.2 Multi-speaker experiment
In the multi-speaker experiment, we concatenate an utterance of the target speaker with one or two utterances of other interfering speakers, and draw the saliency map.
A denotes the target speaker while B denotes the interfering speaker.
Layer-CAM shows surprisingly good performance: it can accurately locate the segments of the
target speaker, and mask non-target speakers almost perfectly. In comparison, Grad-CAM++ and Score-CAM are very weak in detecting non-target speakers.
It can be seen that Layer-CAM gains much better AUCs than the other two CAMs.
3.3 Localization and recognition
Since Layer-CAM can localize target speakers, we can use it as a tool to perform localization and recognition.
Firstly identify where the target speaker resides and then perform speaker recognition with the located segments only. We assume this is better than using the entire utterance.
OBSERVATION:
1. Layer-CAM, in contrast, delivers remarkable performance improvement, and this is the case for the saliency maps at all layers.
2. Although saliency maps at all layers produced by Layer-CAM are informative, the one from S2
seems the most discriminative. One possibility is that the saliency map of S2 is more conservative and retains more regions when compared to the ones obtained from higher layers.
3. We find that for Layer-CAM, aggregating saliency maps from different layers can improve performance.