본문 바로가기
  • AI 개발자가 될래요
Deep Learning

[데이터셋 조사] Audio-Visual 데이터셋 조사

by 꿀개 2024. 11. 20.

[데이터셋 조사] Audio-Visual 데이터셋 조사

 

LRS3

This dataset introduced by Afouras et al. exclusively comprises of real videos. It consists of 5594 videos spanning over 400 hours of TED and TED-X talks in English. The videos in the dataset are processed such that each frame contains faces and the audio and visual streams are in sync.

https://mmai.io/datasets/lip_reading/

 

Lip Reading Sentences 3 Dataset

Lip Reading Sentences 3 A new, challenging speaker recognition domain & dataset --> Explore -->

mmai.io

 

FakeAVCeleb - 구글폼 필요

The FakeAVCeleb dataset is a deepfake detection dataset, which consists of 20,000 video clips in total. It comprises of 500 real videos sampled from the VoxCeleb2 and 19500 deepfake samples generated using different manipulation methods applied on the set of real videos. The dataset consists of the following manipulations where the deepfake algorithms used in each category are indicated within brackets.

• RVFA: Real Visuals - Fake Audio (SV2TTS)

• FVRA-FS: Fake Visuals - Real Audio (FaceSwap)

• FVFA-FS: Fake Visuals - Fake Audio (SV2TTS + FaceSwap)

• FVFA-GAN: Fake Visuals - Fake Audio (SV2TTS + FaceSwapGAN)

• FVRA-GAN: Fake Visuals - Real Audio (FaceSwapGAN)

• FVRA-WL: Fake Visuals - Real Audio (Wav2Lip)

• FVFA-WL: Fake Visuals - Fake Audio (SV2TTS + Wav2Lip)

https://github.com/DASH-Lab/FakeAVCeleb

 

GitHub - DASH-Lab/FakeAVCeleb: FakeAVCeleb

FakeAVCeleb. Contribute to DASH-Lab/FakeAVCeleb development by creating an account on GitHub.

github.com

 

KoDF - 구글폼 필요

This dataset is a large-scale dataset comprising real and synthetic videos of 400+ subjects speaking Korean. KoDF consists of 62K+ real videos and 175K+ fake videos synthesized using the following six algorithms: FaceSwap, DeepFaceLab, FaceSwapGAN, FOMM, ATFHP, and Wav2Lip. We use a subset of this dataset following to evaluate the cross-dataset generalization performance of our model.

https://deepbrainai-research.github.io/kodf/

 

Abstract

Authors: Patrick Kwon*, Jaeseong You*, Gyuhyeon Nam, Sungwoo Park, Gyeongsu Chae * Equal contributionArXiv: arXiv:2103.10094

deepbrainai-research.github.io

 

DF-TIMIT

The Deepfake TIMIT dataset comprises deepfake videos manipulated using FaceSwapGAN. The real videos used for manipulation have been sourced by sampling similarlooking identities from the VidTIMIT dataset. We use their higher-quality (HQ) version, which consists of 320 videos, in evaluating cross-dataset generalization performance.

https://zenodo.org/records/4068245

 

DeepfakeTIMIT

DeepfakeTIMIT is a database of videos where faces are swapped using the open source GAN-based approach (adapted from here: https://github.com/shaoanlu/faceswap-GAN), which, in turn, was developed from the original autoencoder-based Deepfake algorithm. When

zenodo.org

 

DFDC

The DeepFake Detection Challenge (DFDC) dataset is another deepfake dataset that consists of samples with fake audio besides FakeAVCeleb. It consists of over 100K video clips in total generated using deepfake algorithms such as MM/NN Face Swap, NTH, FaceSwapGAN, StyleGAN, and TTS Skins. We use a subset of this dataset consisting of 3215 videos, as used in [21, 22] to evaluate the model’s cross-dataset generalization performance.

https://ai.meta.com/datasets/dfdc/

 

Deepfake Detection Challenge Dataset

Overview We partnered with other industry leaders and academic experts in September 2019 to create the Deepfake Detection Challenge (DFDC) in order to accelerate development of new ways to detect deepfake videos. In doing so, we created and shared a unique

ai.meta.com

 

 

* 본 내용은 "AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection", CVPR, 2024 논문에서 발췌한 것.

https://arxiv.org/abs/2406.02951

 

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual m

arxiv.org