[논문 리뷰] AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

CVPR 2024 Accepted paper.

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual m

arxiv.org

Abstract

이 논문에서는 2단계 cross-modal 러닝 방식을 사용하여 오디오와 비주얼 모달리티 사이의 상관관계(correspondence)를 잡는 Audio-Visual Feature Fusion (AVFF)을 제안한다. 1딘계에서는 self-supervision 방식으로 representaion learning을 진행하여 내재적(intricsic)인 오디오-비주얼 관계를 예측한다. 풍부한 cross-modal 표현을 얻기 위해서 contrasive learning과 autoencoding objectives 방식을 사용하며, 새로운 오디오-비주얼 보완 마스킹과 feature fusion 방식을 소개한다. 학습된 representations은 2단계로 가서 supervised deepfake classification을 수행한다. FakeAVCeleb 데이터셋에서 98.6% accuracy and 99.1% AUC 으로 SOTA를 달성했다.

1. Introduction

Deepfake 생성 AI는 문화 컨텐츠에 새로운 가능성을 보여줬지만, 악의적인 사용은 여러 사회적인 문제를 야기한다. Deepfake 생성 AI가 발전할 수록, deepfake를 detection할 수 있는 기술 발전도 필요하다.

Face context에서 입술 움직임(비지음, visemes)과 음성 단위(음소, phonemes) 사이에 내재적인 상관관계가 존재하기 때문에 실제 비디오의 오디오-비주얼 대응 관계는 직관적이며, 얼굴 표정과 음성 표현에 담긴 뉘앙스는 일치한다. 이에 반해 오디오 기반 감정 등의 내재적인 감정을 deepfake 비디오에 충분히 표현하는 것은 어려운 일이다.

이 논문에서는 오디오 비주얼 모달리티 representations을 충분히 학습하여 deepfake detection하는 방법을 소개한다. 제안된 방식은 새로운 보완 마스킹(complementary masking)과 cross-modal feature fusion 전략을 사용하여 오디오-비주얼 상관관계를 명확하게 잡을 수 있다.

라벨링된 딥페이크 데이터셋을 분류를 위해 직접사용하는 이전의 방법들은 오디오-비주얼 대응 관계를 완전히 활용하지 못한다. 이런 문제를 해결하기 위해, self-supervised representation learning과 supervised downstream classification 으로 이루어진 2단계 파이프라인을 제안한다.

Representation learning 단계에서는 실제 얼굴 영상으로부터 오디오-비주얼 representaions를 추출한다.

또한, CAV-MAE로부터 영감을 받아 contrastive learning과 autoencoding의 상호 보완적인 특성을 이용한다. 풍부한 representation을 추출하기 위해, contrasive learning을 보완하는 새로운 오디오-비주얼 보완 마스킹과 퓨전 전략을 autoencoding 내에 추가하였다.

Classification 단계에서는 오디오-비주얼 응집이 부족한 부분을 찾아 딥페이크 영상을 분류하는 분류기를 학습한다.

• We propose a novel self-supervised representation learning method that explicitly captures audio-visual correspondences in real videos. To learn the correspondences, we pursue a dual-objective of contrastive learning and autoencoding, and supplement it with a novel audio-visual complementary masking and fusion strategy.

• Qualitative analysis using t-SNE [58] shows a clear separation between the real and fake video embeddings at the end of the representation learning stage. This demonstrates the

• We propose a two-stage deepfake detection method comprising of the aforementioned representation learning stage followed by a deepfake classification stage. Our method yields state-of-the-art performance on deepfake detection when either or both the audio and visual contents are AI generated. We achieve 98.6% accuracy and 99.1% AUC on FakeAVCeleb, surpassing the existing audio-visual stateof-the-art by 14.9% and 9.9% respectively.

2. Related Works

2.1. Multi-Modal Representation Learning

다중 모달리티로부터 joint representaion을 학습하는 것은 다른 테스크에서 SOTA 성능을 보여주었다.

SyncNet: Siamese Network 제안. 오디오-비주얼 간 립싱크 에러 판단. 각 모달리티는 별개의 branch로 학습. 인코딩 공간에서의 유사성을 위해 contrasive loss 사용.
CLIP : zero-shot image classification model. 이미지와 캡션 별 각각의 인코더 사용, latent space에서 적절한 페어 찾음.
AudioCLIP: CLIP 을 오디오로 확장.

Masked Autoencoder (MAE) framework를 활용한 Self-supervised 방식도 나타났다.

AV-MAE: joint masked autoencoder for audio, visual, and joint audio/visual classification. explore different encoding policies for dual-modality inputs, demonstrating the ability to decode one masked modality from the other.
CAV-MAE: vanilla masked autoencoder에 contrastive loss 추가, audio-visual pair information 활용.

이 연구에서는 CAV-MAE에서 영감을 얻어 dual contrastiveautoencoding objective을 사용한다. 기존의 MAE와는 (1) complementary masking strategy post-encoding 을 사용한다는 것과, (2) 모든 모달리티를 공유된 learnable masked tokens of MAEs 로 바꾸는 cross-modal fusion을 수행한다는 점에서 다르다.

2.2. Deepfake Detection

Visual-only methods.

LipForensics: 생성으로 재현하기 어려운 입술 움직임에 주목.
그 외는 head pose, 눈 깜빡임 등 spatial and temporal 도메인을 고려.
FTCN: CNN와 트랜스포머 네트워크 결합, 짧/긴 temporal 일관성 없음(incoherence)를 찾음. spatial feature는 attention 기반 모듈로 찾고 temporal 결과와 퓨전함.

ViT 기반 모델도 많이 나옴.

CViT: CNN의 means로부터 추출된 learnable feature가 ViT의 classification task로 들어감.

RealForensics: 멀티모달 사전학습 파이프라인을 사용하여 내재적인 representation을 학습해 분류기가 더 잘 분류할 수 있도록 한다. 분류기에는 비주얼 인풋만 들어간다.

최근 deepfake detection에는 오디오-비주얼을 사용한 연구가 이루어지고 있다.

Audio-visual methods.

Emotions Don’t Lie: 오디오-비주얼 멀티 모달을 deepfake detection에 사용한 첫 논문. Siamese Network 제안. 각 단일 모달 features는 감정 인식 네트워크에 들어가 두 모달리티 사이의 상관관계를 비교한다.
Not made for each other: Modality Dissonance Score (MDS) network 제안. Contrastive loss는 단일 모달리티 임베딩에 대해 계산되어 오디오-비주얼 간의 차이 계산.
Voice-Face matching Detection (VFD): Contrastive loss를 얼굴과 목소리의 동질성(homogeneity)을 모델링 하기 위해 사용.
Detecting deep-fake videos from phoneme-viseme mismatches: phoneme(음소)-viseme(dynamic of mouth shape)의 mismatch에 주목. 입 영역을 주목하여 deepfake 모델이 어떤 특정 움직임을 생산하지 못하는지를 보여줌.

요즘은 멀티모달리티를 단일 모달리티 feature로 퓨전하는 파라다임으로 변화하고있다.

AV-DFD: a joint audio-visual deepfake detection framework. 오디오-비주얼이 aligned되고 cross-attention mechanism으로 들어간다.

AVFakeNet and AVoiD-DF 은 encoding/decoding potential of ViTs and build feature fusion in the embedding space on the decoder side.

3. Method

제안된 알고리즘인 AVFF는 2단계로 구성되어있다. (1) representation learning, (2) deepfake detection.

1단계에서는 실제 얼굴 영상을 이용한 self-supervised learning으로 audio-visual representation을 얻는 것이 목표다. Contrastive learning objective 와 autoencoding objective 안에 있는 complementary masking and fusion strategy를 통해 correspondences를 학습힌다. Complementary masking and fusion strategy는 단일 모달 오디오-비주얼 임베딩(a, v)을 취하고, 이를 체계적으로 마스킹하여 MAE 방식에서 재구성을 통해 고급 임베딩(a′, v′)의 학습을 유도한다. Cross-modal 의존성을 갖기 위해, 한 모달리티 토큰은 다른 모달리티의 마스킹된 임베딩을 cross-modal token conversion networks를 통해 학습하는 것에 사용된다.

2단계에서 분류기는 1단계에서 학습된 representation을 사용하여 실제와 가짜 비디오를 구별한다.

1단계는 deepfake detection downstream의 pre-training 역할을 수행한다.

3.1. Preprocessing

Visual frames and the corresponding audio waveforms는 각각 5 fps and 16 kHz로 샘플링된다.

오디오-비주얼 correspondence을 강조하기 위해 FaceX-Zoo를 사용하여 얼굴 영역을 crop하고 배경을 지웠다. 이 과정은 배경 변화로 인한 영향을 최소화하기 위해 수행되었다.

Audio waveform은 log-mel spectrogram with L frequency bins로 변환된다.

3.2. Representation Learning Stage

주요 목적은 실제 비디오에서 audio-visual feature correspondences를 찾는 것이다.

CAV-MAE 에서 영감을 받아 contrastive learning and autoencoding objectives를 포함하는 dual self-supervised learning 방식을 사용한다.

Contrastive learning 혼자만으로는 cross-modal correlations 학습하는데 크게 도움을 주지 못하는 것을 발견하여 저자는 autoencoding objective과 보완 마스킹을 넣는 것, cross-modal fusion strategy를 autoencoding framework에 넣어 사용하는 것으로 보완했다.

이것은 풍부한 cross-modal representations을 학습하여 발전된 deepfake detection을 가능하게 한다.

Input Tokenization.

오디오, 비주얼 데이터는 샘플링됨($x_a, x_v$).

$x_a$는 $16\times16$ non-overlapping 2D 패치들로 토큰화. (Audio-MAE와 비슷함)

$x_v$는 $2\times16\times16$ non-overlapping 3D spatio-temporal 패치들로 토큰화. (MARLIN과 비슷함)

각 토큰화된 representations는 8개의 동등한 시간적 슬라이스(equal temporal slices)로 나눈다. (8개 구간?으로 나눈다고 보면 될듯함.) $\mathbf{x_a} = \{ x_{a,t_i} \}_{i=1}^8, \mathbf{x_v} = \{ x_{v,t_i} \}_{i=1}^8$ 8은 실험적으로 정한 숫자. 이 슬라이싱은 각 슬라이스의 모달리티 간 시간적 연관성을 보존한다.

Feature Encoding.

각 인코더 $E_a, E_v$는 $x_a, x_v$를 인코드, 각각 feature embedding인 $a, v$ 출력.

학습 가능한 positional embedding인 $pos^e_p$.

여기서 $\mathbf{p} = \{p_{t_i} \}_{i=1}^8 = \mathbb{E}_p (x_p + \text{pose}_p), \quad \text{where} \quad p \in \{a, v\}$

Complementary Masking.

Feature embedding인 $a, v$에서 이진 마스크 $\quad (M_a, M_v) \in \{0, 1\}$를 사용하여 50%의 시간 슬라이스를 마스킹. 여기서 $M_a, M_v$는 상호 보완적(complementary)임. $M_a$가 1인 슬라이스에서는 $M_v$가 0이고, 그 반대인 경우도 성립됨.

오디오 feature에 있는 모든 마스킹된 슬라이스는 비주얼 슬라이스에서는 볼 수 있다.

Visible temporal 슬라이스는 $\mathbf{p_{\text{vis}}} = \mathbf{M_p} \odot \mathbf{p}$

마스킹된 temporal 슬라이스는 $\quad \mathbf{p_{\text{msk}}} = (\neg \mathbf{M_p}) \odot \mathbf{p}, \quad \text{where} \quad p \in \{a, v\}$

$\odot$은 Hadamard product를 의미.

$\neg$는 NOT 연산자.

Cross-Modal Fusion.

Visible temporal 슬라이스 $a_vis, v_vis$는 학습 가능한 audio-to-visual(A2V), visual-to-audio(V2A) 네트워크로 들어가 cross-modal temporal 대응물(counterparts)인 $v_a = \text{A2V}(a_{\text{vis}}), \quad a_v = \text{V2A}(v_{\text{vis}})$ 생성.

$v_a$는 $\mathbf{v_a} \text{ contains } \{ v_{t_i}, a = \text{A2V}(a_{t_i}), \forall t_i \text{ where } a_{t_i} \in a_{\text{vis}} \}, \text{ and similarly } \mathbf{a_v}$를 포함한다. $a_v$도 이와 유사한 방식으로 정의된다.

각 A2V/V2A 네트워크는 다른 모달리티의 토큰 수에 맞추기 위한 단일 레이어 MLP로 이루어져 있으며, 그 뒤에 단일 트랜스포머 블럭이 이어진다.

오디오 임베딩 $a^{\prime}$은 cross-modal fusion을 사용해 만들어짐.

원본 feature인 $a$를 cross modal vector $a_v$와 같은 시간 인덱스에 있는 마스킹된 슬라이스로 대체한다.

비주얼 임베딩 $v^{\prime}$은 이와 비슷하게 얻어진다.

이 과정은 각 모달리티별 마스킹된 temporal 슬라이스들이 같은 시간 인덱스 상의 corss-modal 슬라이스로 교체되는 것을 의미한다.

Decoding.

단일 모달 오디오, 비주얼 디코더 $G_a, G_v$는 $a^{\prime}, v^{\prime}$을 입력으로 하여 오디오, 비주얼 reconstruction인 $\hat{x}_a = G_a(a' + \text{pos}_g^a) \quad \text{and} \quad \hat{x}_v = G_v(v' + \text{pos}_g^v)$을 한다. 여기서 $pos^g_a, pos^g_a$는 학습 가능한 포지션 임베딩이다.

디코더는 트랜스포머 기반 아키텍쳐를 사용했고 단일 모달 슬라이스와 크로스 모달 슬라이스를 조합하여 두 모달리티의 마스킹을 복원하는 역할을 수행한다.

Loss Functions.

2가지 목적함수 사용, audio-visual contrastive loss and an autoencoding loss.

양방향 오디오-비주얼 contrasive loss는 다음과 같다.

$L_c = - \sum_{p,q \in \{a,v\}, p \neq q} \frac{1}{2N} \sum_{i=1}^{N} \log \left( \frac{\exp \left( \frac{\| \bar{p}^{(i)} \|_T \| \bar{q}^{(i)} \|}{\tau} \right)}{\sum_{j=1}^{N} \exp \left( \frac{\| \bar{p}^{(i)} \|_T \| \bar{q}^{(j)} \|}{\tau} \right)} \right)$

여기서 $\bar{p}^{(i)}$는 $i$번째 데이터 샘플의 단일 모달 임베딩을 패치(patch)차원에서 mean한 latent vector이다.

$N$은 샘플의 수, $\tau$는 temperature parameter(contrasive learning에서 유사도를 스케일링 하기 위해 쓰이는 하이퍼파라미터), $i, j$는 샘플 indices.

오디오-비주얼 contrasive loss는 두 임베딩 간 유사도 제약을 준다.

Autoencoder loss인 $L_{ae}$는 reconstruction과 adversarial loss로 이루어 있다(MARLIN 과 유사).

Reconstruction MSE loss인 $L_{rec}$는 입력 $(x_a, x_v)$과 그들의 reconstructions인 $(\hat{x}_a, \hat{x}_v)$로부터 계산. MAE 방식에 따라 마스킹된 토큰에 대해서만 계산.

Adversarial loss인 $L_adv$를 위해선 Wasserstein GAN loss를 사용해 reconstruction을 보충했다.

Reconstruction loss 와 비슷하게 adversarial loss는 마스킹된 토큰에서만 계산된다.

여기서 $D_p$는 각 모달리티의 discriminator.

$L_{\text{adv}}^{(D)} \quad \text{and} \quad L_{\text{adv}}^{(G)}$는 adversarial loss를 나타낸다.

생성 학습 단계에서 전체적인 학습에 사용한 loss는 다음과 같다. 여기서 $\lambda_*$은 파라미터이다.

$L^{(G)} = \lambda_c L_c + \lambda_{\text{rec}} L_{\text{rec}} + \lambda_{\text{adv}} L_{\text{adv}}^{(G)}$

마스킹된 temporal 슬라이스에 계산된 autoencoding loss objective는 디코더가 다른 모달리티로부터 학습하도록 한다. 마스킹된 인덱스에서 디코더의 입력 임베딩은 다른 모달리티로부터 얻어지기 때문이다.

이 전략은 오디오-비주얼 일치를 강제한다.

3.3. Deepfake Classification Stage

이 단계의 목적은 오디오나 비주얼 둘 중 하나 이상 fake일 때 deepfake 비디오를 찾는 것이다. 그러기 위해서 인코더와 representaion learning 단계에서 학습된 크로스 모달 네트워크를 사용하였다.

supervised 방식으로 학습. 파이프라인은 fig. 3.

학습된 representaion은 오디오-비주얼 간 높은 일치율을 갖고 있기 때문에 분류기가 가짜 영상의 오디오-비주얼 결합 부족을 이해할 것으로 예상한다.

Input Tokenization.

1단계와 동일함.

Feature Extraction.

토큰화된 입력 $(x_a, x_v)$는 1단계의 백본으로 가서 단일 모달리티 인코더로부터 나온 feature embedding인 $(a, v)$와 크로스 모달 임베딩인 $(a_v, v_a)$를 얻는다.

여기서 크로스 모달 임베딩은 모든 temporal 슬라이스들로부터 계산되며, 이 단계에서는 마스킹을 사용하지 않는다.

두 임베딩을 concat하여 $(f_a, f_v)$를 얻는다. $f_p = p \oplus pq, \quad \forall p, q \in \{a, v\}, \, p \neq q$

$\oplus$는 feature 차원에 따른 concatemation 연산자.

Classifier Network.

$Q$: classifier network

이는 각 모달리티를 합친 임베딩을 입력으로 받아 real or fake인지를 예측한다.

분류 네트워크는 두 개의 단일 모달 패치 축소 네트워크로 구성되어있다.

$(\Psi_a, \Psi_v), \text{ followed by a classifier head, } \Gamma$

아웃풋 임베딩은 feature 차원에서 concat되며 분류기로 들어가 logits인 $l$을 출력한다.

$l = Q(f_a, f_v) = \Gamma(\Psi_a(f_a) \oplus \Psi_v(f_v))$

Loss Function.

크로스 엔트로피 사용.

Deepfake Classifier Inference Stage.

Inference 동안, 비디오를 학습에 사용된 샘플 시간 길이인 $T$ 블록으로 나눈다. 스텝 사이즈는 temporal 슬라이스 기간인 $\frac{T}{8}$.

출력 logits는 각 블럭에 대해 계산되고 최종 분류는 mean of the output logits에 의해 결정된다.

4. Experiments and Results

4.1. Implementation

1단계 representation learning을 위해 LRS3의 real videos 사용.

2단계 deepfake detection을 위해 FakeAVCeleb 사용.

4.2. Evaluation and Discussion

Intra-dataset performance, cross-manipulation generalization, and crossdataset generalization 테스크로 검증.

멀티, 단일 모달 모델들과 성능 비교.

메트릭: accuracy (ACC), average precision (AP), and area under the ROC curve (AUC) averaged across multiple runs with different random seeds.

멀티 모달 알고리즘에 대해서는 오디오나 비주얼이나 둘 중 하나 이상이 fake이면 fake로 처리하여 성능 비교.

단일 모달 알고리즘에 대해서는 비주얼만 fake이면 fake로 처리하여 성능 비교.

Intra-Dataset Performance.

FakeAVCeleb - 70% 학습에 사용, 나머지는 test.

제안한 모델이 FakeAVCeleb 에서 멀티, 단일 모달 모델들을 추월하여 SOTA 찍음.

Cross-Manipulation Generalization.

사전에 접하지 않았던 생성 방식으로 생성된 데이터에 대해 시험했다.

FakeAVCeleb는 RVFA, FVRA-WL, FVFA-FS, FVFA-GAN, FVFA-WL 방식들로 생성된 데이터로 구성된 데이터셋인데, 각 카테고리 별로 한 개만 남겨두고 나머지는 다 학습시킨 뒤, 남겨둔 방식에 대한 성능을 검증했다.

모든 케이스에 대해 최고 혹은 비슷한 수준의 성능 달성.

Cross-Dataset Generalization.

다른 데이터 분포를 갖고 있는 경우에도 실험.

KoDF 데이터셋에 test([16]에 나와있는 프로토콜 사용).

RealForensics과 비슷.

DF-TIMIT, DFDC dataset에도 테스트. 오픈 소스가 있는 모델만 비교하고 없는 모델은 비교 못함.

Analysis on the Learned Representation.

Downstream task를 학습하는 동안, 초기 1-3 에폭 동안에 AUC가 주목할만한 성능을 보이는 것을 발견했다. 이 발견으로 1단계에서 수행한 representation learning을 분석하였다.

FakeAVCeleb 데이터셋의 각 카테고리 별 임베딩을 랜덤 샘플링해서 t-SNE로 시각화한 결과 real과 fake간 분명한 차이를 나타내는 것을 확인했다.

1단계에서는 아무런 fake 샘플을 보지 않았음에도 불구하고 잘 구별하는 것을 볼 수 있다.

이것이 초기 downstream 학습 때 높은 AUC를 보이는 이유를 설명한다.

또한, deepfake 영상을 만드는 기법 별로도 구분을 잘 하는 것을 보아 생성 방식의 작은 단서도 찾을 수 있음을 의미한다.

-끝-

저작자표시 (새창열림)

'논문 리뷰' 카테고리의 다른 글

[논문 요약] EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning (0)	2024.12.11
[논문 정리] LIO-SAM: Tightly-coupled Lidar Inertial Odometry via Smoothing and Mapping (0)	2024.12.04
[논문 리뷰] Intra- and Inter-Modal Curriculum for Multimodal Learning (0)	2024.10.15
[데이터셋 소개] VGG-SOUND: A LARGE-SCALE AUDIO-VISUAL DATASET (0)	2024.09.02
[논문 리뷰] From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation (1)	2024.08.28

AI 연구하는 깨굴이

[논문 리뷰] AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection