Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

안녕, 세상!

Multimodal Cross-And Self-Attention Network For Speech Emotion Recognition (2021 ICASSP) 논문 본문

It공부/딥러닝논문리뷰

Multimodal Cross-And Self-Attention Network For Speech Emotion Recognition (2021 ICASSP) 논문

dev_Lumin 2022. 1. 25. 20:28

Abstract

SER은 utterance의 언어적 요소와 어떻게 사람이 그것을 말하는지 둘 다에 대한 철저한 이해가 요구됨
어떻게 이 두 가지의 정보를 융합할 지가 SER에서 중요한 과제 중 하나임
새로운 Multimodal Cross- and Self-Attention Network (MCSAN)을 제안함
MCSAN의 핵심은 병렬적 cross와 self-attention module을 채용하는것
- 해당 모듈들은 audio와 text의 상호작용 속의 intra와 inter modal 둘 다에 대해 명백히 모델링 하기 위해서 사용
MCSAN은 IEMOCAP와 MELD dataset으로 평가됨
실험은 제안된 모델이 두 datasets 모두에서 state-of-the-art 성능을 보인다는것을 증명함

Introduction

대부분 SER의 최신 연구들은 acoustic 정보에만 집중함
speech에 있는 textual 정보는 덜 사용됨
일반적으로 multimodal 정보를 결합하는데 고려되어야 할 상호작용의 두 가지 종류가 있음
- Intra-modal interactions
  - single modality안에서의 fine-grained feature interaction
  - acoustic features안에서의 frame과 frame간의 관계
  - textual features안에서의 word와 word간의 관계
- inter-modal interactions
  - text와 speech의 특징 관계
SER을 위한 acousitc과 text 정보를 결합하려는 최신 연구들을 크게 세 가지 타입으로 카테고리화 가능
- 1) 각 modality에 대해 독립적인 모델들을 만들고 각 outputs를 마지막 emotion classification을 위해 결합
  - ex [14]
  - 각 modality에 대해서 서로 다른 입력에 가장 적합하도록 다른 구조들이 채택될 수 있음
  - intra-modal interactions가 잡힐 수 있으나, inter-modal interaction은 탐구되지 못함
- 2) aligned audio and text를 input으로 사용
  - aligned features는 처음에는 결합되고, 이 후 sequential 학습을 위해 시간에 종속된 model에 넣어짐
  - 그러므로 inter-modal interactions는 전체 과정에서 잡혀짐
  - 그럼에도 불구하고 정렬 정보를 제공하는데 비용이 듬
- 3) audio와 text 사이의 잠재적 cross-modal relationships를 추론하는 attention 기법을 사용
  - multi-hop 기법, 각 word에 대해 aligned speech frames를 학습하는 attention 기법 등
  - 하지만 그 누구도 명백하게 audio와 text의 intra-와 inter-modal interaction을 둘 다 모델링한 경우는 없었음
위의 문제를 해결하기 위해 MCSAN을 제안함
cross-attention module과 2개의 self-attention modules가 사용됨
cross-attention module은 cross-attention 기법을 사용하며, audio와 text 사이 정보를 전달하기 위해 사용함
이러한 모델 덕분에 MCSAN은 audio와 text의 inter-과 intra-modal 상호작용을 명백히 모델링 가능
MCSAN의 효율성을 증명하기 위해서 두 데이터셋 사용
ablation 연구도 진행함

2. The Proposed Model

MCSAN는 먼저 audio encoder과 text encoder을 각각 acoustic과 textual features를 encode할 용도로 사용
encoded feature sequences는 cross-와 self-attention modules에 들어가서 audio와 text의 inter-과 intra-modal interaction을 학습함
마지막으로 이 modules의 outputs는 concatenated되고 emotion prediction을 위한 fully connected classifier에 보내짐

2.1 Audio Encoder

input acoustic feature sequence of utterance

$T^{'}_{a}$ : acoustic frames의 수
$d_a$ : feature dimension
LSTM을 가진 CNN을 audio encoder로 사용
- 2개의 1D temporal(시간적) convolutional layers는 local pattern을 잡아내는데 사용
$T^{'}_{a}$ 가 전형적으로 크기 때문에 각 convolution layer는 temporal resolution을 줄이고 subsequent learning을 이용하기 위해서 max-pooling layer을 후에 부착시켜 사용함
그리고 나서 BiLSTM layer가 적용되고, 이 layer은 sequence내부에서의 temporal dependencies를 잡아냄
BiLSTM의 forward와 backword의 hidden states은 encoded acoustic features를 얻기 위해서 평균으로 계산됨

$T_a$ : 두 번째 pooling layer 이후의 acoustic frames의 수

2.2 Text Encoder

$T_l$ : words의 수
$d_l$ : feature dimension
주로 $T_l$의 수가 적다는것을 고려하여 오직 BiLSTM layer을 word-level textual features를 encode하는데 사용

2.3 Cross-Attention Module

해당 모듈은 position embedding layer (Fig 1에서 간단하게 표현하려고 묘사안함) 과
$N$개 만큼 쌓여진 cross-attention layers와 feed-forward layers로 구성되어 있음
position embedding layer은 feature sequence에 temporal information을 주입하기 위해 사용됨
두 개의 modalities 사이의 연관성을 학습하고 학습된 연관성에 따라 하나의 modality가 다른 modality에게 정보를 전파하려는 목적으로 cross-attention 기법을 사용하는 것이 이 module의 insight임
audio와 text의 연관성을 학습하기 위해서, 먼저 각 feature sequence를 세 개의 관점인 query, key, value로 linear projection을 사용하여 변형시켜야 함

audio와 text의 query와 key의 dot products를 crossed way로 계산을 하여 두 modalities의 연관성을 추정함
그리고 그 결과를 scaled시키고 softmax로 row-wisely normalized시켜서 attention weights를 얻음
이 후 각 feature sequence의 value를 상응하는 attention weights를 사용하여 집계함
위의 설명은 single-head attention이고 논문에서는 $N$개의 multi-head attention사용
마지막으로 한 modality의 features를 다른 modality로 부터의 propagated information을 함께 넣어 update함

추가적으로 representation capacity를 더욱 증가시키기 위해서 fully connected feed-forward layer을 cross-attention layer 뒤에 추가함

2.4 Self-Attention Module

cross-attention module과 병렬적으로 배치하여 self-attention module을 놓아서 audio와 text안의 intra-modal interaction을 잡도록 목표를 둠

전체 과정은 다음과 같음

2.5 Classification

마지막 classification을 하기 위해서, 먼저 각 cross- 와 self-attention으로 부터 나온 outputs를 global max-pooling layer을 사용하여 요약함

utterance-level representation을 얻기 위해서 그들을 concatenate 함
마지막으로 fully-connected network와 softmax layer로 underlying emotion을 예측하도록 함
cross-entropy loss를 사용

3. Experiments

3.1 Datasets

IEMOCAP는 SER에서 가장 많이 사용되는 데이터
총 7487 utterances from 7 emotions
- frustration, neutral, anger, sadness, excitement, happiness, surprise
10-fold cross-validation (8:1:1 training, validation and test)
WA, UA 평가 지표
MELD는 새로운 multimodal dataset for emotion recognition in conversation
13708 utterances with seven emotions : anger, disgust fear, joy, neutral, sadness, and surprise of 14333 dialogues from the classic TV-series Friends
training (9989), validation (1109), test (2610)
average F1 score 평가 지표

3.2 Implementation Details

speech signals로 부터 40-dimensional MFCC 추출
window size와 hop size는 25ms와 10ms로 각각 설정
MFCC feature sequence의 max length는 1000으로 설정
audio encoder로 들어가기 전에 z-normalization 진행함
textual features에 대해서는 먼저 word-tokenizer을 적용
그리고 나서 utterance의 각 word는 GloVe model을 사용하여 300-dimensional vector로 embedded 됨
cross- 와 self-attention module안의 stacked layer의 수는 1개
heads의 수는 8개
audio encoder 안의 convolutional과 max-pooling layer의 kernel size는 3
Adam Optimizer을 사용하고 learning rate는 IEMOCAP에서는 0.001 그리고 MELD에서는 0.0005를 사용
batch size는 256
IEMOCAP에서는 30 epochs, MELD에서는 20 epochs

3.3 Baselines

3.4 Comparison to State-of-the-art Methods

IEMOCAP

MCSAN이 baseline modesl보다 성능이 좋은것을 확인할 수 있음
CAN는 aligned audio와 text가 input으로 필요하지만, cross-attention 기법의 장점으로 논문의 모델은 alignment information이 필요 없음
AMH는 MHA의 tri-modal version으로 visual information을 MHA의 framework에 통합시킴
MCSAN이 acoustic과 textual 정보만 사용했지만 AMH보다 더 좋은 성능을 냄
MCSAN의 effectiveness를 더 증명하기 위해, MELD dataset에서도 평가함

게다가 논문의 모델은 많은 양의 unlabeled data를 사용하여 만든 semi-supervised model보다 성능이 더 좋음

3.5 Ablation study

논문에서 제안한 모델안의 몇몇 key factors를 평가하기 위해서 IEMOCAP dataset으로 몇몇 실험을 함

input으로 오직 acoustic 혹은 textual information만 사용한 경우 성능이 떨어지는 것을 확인할 수 있음
- 이것은 두 종류의 정보를 효율적으로 fuse를 SER system에서 하는것이 중요하다는것을 제안함
attention modules의 효과를 평가함
- self- 혹은 cross- attention module이 제거된다면, 성능이 감소한것을 확인할 수 있음
- inter-과 intra-modal interactions 둘 다에 대한 명백한 모델링이 필요하다는 것을 증명함
- cross-attenion module의 유무도 마찬가지
모델 구조적 실험
- attention module을 병렬적으로 배치하는 대신, sequential방식을 서로 다른 순서로 조합함
  - 하지만 cross+self 혹은 self+cross 조합 그 어느것도 parallel 구조보다 좋지 못함
마지막으로 model의 capacity에 대해 평가함
- self- 와 cross-attention modules 안에 더 많은 layers를 쌓으면 성능이 더 떨어진다는것을 발견함
- 이것은 아마 large models를 충분하게 학습하기엔 데이터이 수가 너무 작기 때문에 overfitting이 발생했다고 생각함

4. Conclusion

Speech emotion recognition을 위한 MCSAN을 제안했고, cross-와 self-attention을 병렬적으로 놓음으로써 좋은 성능을 냈음
미래에 tri-modal version으로 확장할 계획이 있음

저작자표시 비영리 동일조건 (새창열림)

'It공부 > 딥러닝논문리뷰' 카테고리의 다른 글

RSANet: Towards Real-Time Object Detection with Residual Semantic-Guided Attention Feature Pyramid Network (0)	2022.03.26
EPSANet : An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network 정리 (0)	2022.02.21
UW-NET : An Inception-Attention Network For Underwater Image Classification (0)	2022.02.15
Multimodal Emotion Recognition With High-Level Speech And Text Features 논문 (0)	2022.01.08
SlimYOLOv3: Narrower, Faster and Better for Real-Time UAV Applications 논문리뷰 (0)	2022.01.07

'It공부/딥러닝논문리뷰' Related Articles

Comments

안녕, 세상!

Multimodal Cross-And Self-Attention Network For Speech Emotion Recognition (2021 ICASSP) 논문 본문

Multimodal Cross-And Self-Attention Network For Speech Emotion Recognition (2021 ICASSP) 논문

Abstract

Introduction

2. The Proposed Model

2.1 Audio Encoder

2.2 Text Encoder

2.3 Cross-Attention Module

2.4 Self-Attention Module

2.5 Classification

3. Experiments

3.1 Datasets

3.2 Implementation Details

3.3 Baselines

3.4 Comparison to State-of-the-art Methods

3.5 Ablation study

4. Conclusion

'It공부 > 딥러닝논문리뷰' 카테고리의 다른 글

티스토리툴바