본문 바로가기
  • AI 개발자가 될래요
Deep Learning

[기술 정리] Speaker Recognition, 화자 인식에 대한 조사 및 알아둘 내용 정리 노트

by 꿀개 2025. 3. 13.

연구를 위한 기술 조사 과정에서 개인적으로 알아두면 좋을 것 같은 내용을 정리한 문서.

 

논문 조사

A review on speaker recognition: Technology and challenges

https://www.sciencedirect.com/science/article/pii/S0045790621000318

2021년도 서베이 논문

- Human speech can provide much information as the human voice forms a vital characteristic of an individual. Accent, language, speech, emotion, gender, and the speaker’s identity are some of the information contained in the human voice.

- Speaker recognition deals with recognizing the identity of people based on their voice.

- Since speech recognition deals with converting audio into text, its effectiveness depends heavily on the language and the text corpus.

- Speaker recognition involves the process of finding the identity of an unknown speaker and comparing his/her voice with those available on the database.

 

speaker identification system 프레임워크

 

- Pre-processing is the first step in speech signal processing, and it involves converting an analogue signal into a digital signal. Interference due to noise often occurs during speech recording, causing the performance to degrade.

- The main objective in the pre-processing stage is to modify the speech signal to be suitable for feature extraction analysis.

- Feature extraction retains useful and relevant information about the speech signal by rejecting redundant and irrelevant information

- Authentication is one of the most popular biometric applications as it allows the users to identify an individual based on his/her voice. Usually, to authenticate the speaker, a combination of techniques is used, such as a password or facial recognition.

 

- speaker recognition 응용

speaker recognition 응용

 

- speaker identification, verification 차이

speaker identification, verification 차이

 

Deep speaker embeddings for Speaker Verification: Review and experimental comparison

https://www.sciencedirect.com/science/article/pii/S0952197623014161

2024 논문

 

- In speaker identification, invariant features persist consistently, distinguishing individual voices even amidst variations such as emotions or health conditions

 

- speaker identification, verification 차이

- Speaker Identification (SI) is a task in which a person is recognized among multiple enrolled speakers.

- Speaker Verification (SV) involves assessing whether a person’s claimed identity is valid, leading to either acceptance or rejection based on the score.

 

- For almost a decade, the embeddings extracted from the generative probabilistic model were the dominant approach in automatic speaker recognition. However, in other machine learning domains, such as text or image processing, related embeddings were extracted through discriminatively trained Deep Neural Networks (DNNs). This deep learning paradigm was swiftly embraced in speech processing, allowing models to learn latent representations in a novel feature space.

 

- The performance of SV systems decreases significantly when the speech signal is damaged by interfering factors (interfering speech, background sounds, distortion of the transmission channel, and reverberation).

 

- In general, such an SV system consists of the following main parts:

 Signal preprocessing/Feature extraction: Commonly, the timedomain speech waveform is first converted into a spectro-temporal representation mimicking cochlear processing in the human ear. Then, a compressed spectral representation, usually Mel Filterbank (FB) short-time energy features or Mel-frequency Cepstral Coefficients (MFCCs), is computed from the spectra. Alternatively, the raw time-domain signal can be used as input, and the network will extract features.

• Speaker embedding network: A speaker embedding DNN takes FB or MFCC features as input and produces a compact vector representation of the speaker information in the speech. The network is trained with speaker labels to minimize the cross-entropy loss. In the training phase, DNN learns to distinguish different speakers at the frame level/utterance and to extract unique speaker features from the last hidden layer.

• Back-end Probabilistic Linear Discriminant Analysis (PLDA) model: Although DNN speaker embeddings are already speaker discriminative, researchers found that the embeddings can be further improved by using a PLDA model at the backend. The PLDA model is trained using the speaker embeddings extracted from speech signal samples and the corresponding speaker labels.

• PLDA scoring: Given the embeddings of two utterances, the PLDA model also produces a score. This score represents the likelihood ratio between the hypothesis that the two utterances (embeddings) are from the same speaker and the hypothesis that the two utterances are from different speakers.

 

- Creating discriminative speaker representations is probably the most important and challenging aspect of the SV problem.

- Currently, the trend is to use DNNs to learn a compact vector representation of the speaker as an alternative to the i-vectors. This compact representation of the speaker is often referred to as the speaker embedding.

 

- DNN 아키텍쳐로 연구할 때 극복해야 할 것들

:an appropriate time scale for such a signal representation needs to be adopted. One strategy involves learning speaker-specific features at a short-time level corresponding to speech frames of 100 to 400 ms (Li et al., 2017a).

 

- All the partial frame-level features are then averaged to extract a single embedding from the recording (this is the case of socalled d-vectors)

- An alternative strategy, which has become more popular among researchers, is learning a speaker embedding on a longer time scale of 2 to 10 s.

 

- An alternative approach is to use models originally developed for computer vision tasks to conduct speaker modelling. Time– frequency representations, (e.g., spectrograms), of speech utterances (or other audio segments) are treated as images and serve as input to the first (two-dimensional) CNN layer.

 

- dnn 구조에서 임베딩을 얻는 방법

In the training phase, DNN learns to classify different speakers at the frame level and speaker-specific characteristics are formed from a latent signal representation at the last hidden layer. In the evaluation phase, a dvector is obtained by averaging these speaker-specific features over the utterances of the same speaker. The d-vectors are extracted for each utterance and compared with the reference to make a verification decision. Cosine similarity scoring is applied between each of the speaker models and a given test d-vector.

 

Voxblink: A Large Scale Speaker Verification Dataset on Camera

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446780

 

Voxblink: A Large Scale Speaker Verification Dataset on Camera

In this paper, we introduce a large-scale and high-quality audiovisual speaker verification dataset, named VoxBlink. We propose an innovative and robust automatic audio-visual data mining pipeline to curate this dataset, which contains 1.45M utterances fro

ieeexplore.ieee.org

 

 

Deep Audio-visual Learning: A Survey

https://link.springer.com/article/10.1007/s11633-021-1293-0

 

 

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

https://arxiv.org/abs/2301.06375

 

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have

arxiv.org

https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=538

 

AI-Hub

샘플 데이터 ? ※샘플데이터는 데이터의 이해를 돕기 위해 별도로 가공하여 제공하는 정보로써 원본 데이터와 차이가 있을 수 있으며, 데이터에 따라서 민감한 정보는 일부 마스킹(*) 처리가 되

aihub.or.kr

 

Our results indicate that using both modalities brings the best of both worlds, enjoying the low error rate comparable to the audio-only model on the low noise regime, whereas being consistently better than both vision-only and audio-only baselines even if noise level increases.

https://github.com/IIP-Sogang/olkavs-avspeech

 

 

KMSAV: Korean Multi-speaker Spontaneous Audio-Visaul Speech Recognition Dataset

https://github.com/etri/kmsav?tab=readme-ov-file

 

 

The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset

https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html

 

Lip Reading Sentences 2 (LRS2) dataset

The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset Overview The dataset consists of thousands of spoken sentences from BBC television. Each sentences is up to 100 characters in length. The training, validation and test sets are divided according to broa

www.robots.ox.ac.uk