Reliable Visualization for Deep Speaker Recognition

Reliable Visualization for Deep Speaker Recognition - 语音可解释性

MOTIVATION OF READING: 语音任务可解释性

Link: http://arxiv.org/abs/2204.03852

Code：http://project.cslt.org/

1. Overview

Motivation of the work:

If any of the visualization tools are reliable when applied to speaker recognition, which makes the conclusions obtained from visualization not fully convincing.

Three CAM algorithms will be investigated: Grad-CAM++, Score-CAM and Layer-CAM. The main idea of these algorithms is to generate a saliency map by combining the activation maps (channels) of a convolutional layer.

2. Mehodology

A class activation map (CAM) is a saliency map that shows the important regions used by the CNN to identify a particular class.

2.1 Grad-CAM and Grad-CAM++

Grad-CAM

Grad-CAM++

2.2 Score-CAM

2.3 Layer-CAM

3. Experiment

Speaker model

3.1 Single-speaker experiment

Grad-CAM++ and Score-CAM tend to regard all the speech segments being important, while Layer-CAM produces more selective and localized patterns.

It shows that the three CAM algorithms indeed find salient regions. For example, in the insertion experiment, the curves of CAM algorithms clearly are much higher than that of the random masking, indicating that the regions exposed earlier by CAMs are indeed more important than random regions.

3.2 Multi-speaker experiment

In the multi-speaker experiment, we concatenate an utterance of the target speaker with one or two utterances of other interfering speakers, and draw the saliency map.

A denotes the target speaker while B denotes the interfering speaker.

Layer-CAM shows surprisingly good performance: it can accurately locate the segments of the
target speaker, and mask non-target speakers almost perfectly. In comparison, Grad-CAM++ and Score-CAM are very weak in detecting non-target speakers.

It can be seen that Layer-CAM gains much better AUCs than the other two CAMs.

3.3 Localization and recognition

Since Layer-CAM can localize target speakers, we can use it as a tool to perform localization and recognition.

Firstly identify where the target speaker resides and then perform speaker recognition with the located segments only. We assume this is better than using the entire utterance.

OBSERVATION:

1. Layer-CAM, in contrast, delivers remarkable performance improvement, and this is the case for the saliency maps at all layers.

2. Although saliency maps at all layers produced by Layer-CAM are informative, the one from S2
seems the most discriminative. One possibility is that the saliency map of S2 is more conservative and retains more regions when compared to the ones obtained from higher layers.

3. We find that for Layer-CAM, aggregating saliency maps from different layers can improve performance.

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/398543.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！