Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Main Menu

Journal Search

[

Research article

]

The Transactions of the Korean Institute of Electrical Engineers

KIEE Vol. 74, No. 10, p.1731-1739

ISSN (print) :

1975-8359

ISSN (online) :

2287-4364

Received : 25 Aug. 2025Revised : 30 Aug. 2025Accepted : 12 Sep. 2025

DOI :

https://doi.org/10.5370/KIEE.2025.74.10.1731

신경망의 노이즈 및 스푸핑에 대한 강인성 비교 : AI를 이용한 음성 생체 인식 인증

Voice Biometric Authentication Using AI : A Comparative Study on Neural Network Robustness to Noise andSpoofing

(Oralbek Bayazov) ¹iD (Anel Aidos) ²iD 강정원 (Jeong Won Kang) ^†iD (Assel Mukasheva) ^††iD

(School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan.)
(School of Science and Humanities, Nazarbayev University, Astana, Kazakhstan.)

^†Corresponding Author : Dept. of Transportation System Engineering, Korea National University of Transportation, Republic of Korea. E-mail : jwkang@ut.ac.kr

^††Corresponding Author : School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan. E-mail : a.mukasheva@kbtu.kz

License :

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0)which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Translated Abstract

Voice biometrics is emerging as a secure, intuitive, and contactless method of identity verification, offering key advantages over traditional PIN- or password-based systems. However, its effectiveness is often reduced by real-world factors such as background noise, device variability, and spoofing attacks including replay and synthetic voice input. This paper presents a comparative analysis of three neural network architectures-Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and transformer-based Wav2Vec 2.0-for voice biometric authentication under both clean and adverse conditions. Experiments were conducted using two large-scale datasets, Mozilla Common Voice and VoxCeleb, with audio represented as mel spectrograms, mel-frequency cepstral soefficients (MFCCs), and raw waveforms. Data augmentation included Gaussian noise, reverberation, background speech, and spoofing via text-to-speech (TTS) synthesis. Results show that Wav2Vec 2.0 consistently outperforms CNN and LSTM in terms of accuracy, robustness to noise, and partial resistance to spoofing, reaching up to 92% accuracy in clean scenarios. Despite these gains, none of the models proved fully resistant to high-fidelity synthetic voice attacks. To address this, we propose integrating explicit spoof detection modules and adversarial training techniques. Additionally, privacy-preserving frameworks such as federated learning and the use of multimodal biometrics are discussed as future directions for secure and ethical deployment.

Key words

Voice biometrics, Wav2Vec 2.0, Spoof detection, LSTM, Federated learning, Multimodal authentication, Deep learning., ㅍ

1. Introduction

The increasing reliance on digital platforms for banking, education, healthcare, and communication has significantly amplified the need for robust and seamless authentication systems. Voice biometric authentication is gaining momentum due to its convenience, contactless operation, and potential for integration into everyday technologies such as smart-phones, virtual assistants, and call centers. Unlike fingerprint or facial recognition systems, voice-based methods do not require physical contact or camera access, making them especially useful in low-resource or privacy-sensitive contexts.

Nevertheless, the practicality of voice biometrics is hindered by multiple factors. Variations in microphones, environmental noise, changes in speaker’s health, emotional state, and most critically, spoofing attacks, can undermine system performance. Spoofing-through replayed recordings or synthetic speech-presents a unique challenge because it can closely imitate legitimate input.

Artificial Intelligence, and particularly deep learning, has opened new avenues to enhance the accuracy and adaptability of voice authentication systems^[1,^2]. Unlike traditional signal processing techniques, deep neural networks can learn hierarchical representations of speech signals, enabling them to generalize across different conditions. This study seeks to evaluate and compare three distinct neural architectures- Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Wav2Vec 2.0-for their efficacy in authenticating speakers under challenging conditions.

2. Materials and Methods

This study adopts a rigorous mathematical and experimental framework to evaluate the robustness of CNN, LSTM, and Wav2Vec 2.0 models for voice biometric authentication. CNN was chosen for its ability to capture local spectral patterns from spectrograms, LSTM for modeling long-term temporal dependencies in speech, and Wav2Vec 2.0 as a state-of-the-art transformer that learns contextual representations directly from raw audio. The methodology is divided into several subsections covering data representation, model architectures, and evaluation metrics.

2.1 Data Representation

Each voice sample can be represented as a discrete-time signal:

(1)

$𝑥(𝑡) ∈ R^T, T – signal \; length.$

Here, 𝑥(𝑡) denotes the raw speech waveform, and 𝑇 is the total number of sampled points.

For CNN and LSTM models, these signals are transformed into mel-spectrograms and mel-frequency cepstral coefficient (MFCCs):

(2)

$X=STFT(x(t)),\: M=MFCC(x(t)),\:$

where STFT denotes the short-time Fourier transform, producing a time-frequency representation, while MFCC captures perceptually relevant spectral features. In contrast, Wav2Vec 2.0 directly consumes the raw waveform without handcrafted transformations.

2.2 Convolutional Neural Network (CNN)

CNNs are designed to capture local spatial features from spectrograms. A convolutional layer can be mathematically expressed as:

(3)

$h_{i,\: j}^{(l)}=\sigma\left(\sum_{m,\: n}w_{m,\: n}^{(l)}\cdot x_{i+m,\: j+n}^{(l-1)}+b^{(l)}\right),\:$

where $w_{m,\: n}^{(l)}$ are the convolutional kernel weights, $b^{(l)}$ is the bias, and 𝜎 denotes the ReLU activation function. This formulation describes how each feature map is generated by applying learnable filters to the spectrogram, enabling the extraction of edges, frequency bands, and temporal structures.

2.3 Long Short-Term Memory (LSTM)

LSTM networks are recurrent architectures specialized in modeling sequential dependencies. The state updates can be written as:

(4)

\begin{align*}f_{t}=\sigma\left(W_{f}\left[h_{t-1}, \: x_{t}\right]+b_{f}\right), \: \\i_{t}=\sigma\left(W_{i}\left[h_{t-1},\: x_{t}\right]+b_{i}\right), \: \end{align*}\begin{align*}o_{t}=\sigma\left(W_{o}\left[h_{t-1},\: x_{t}\right]+b_{o}\right), \: \\c_{t}=f_{t}\odot c_{t-1}+i_{t}\odot\tan h\left(W_{C}\left[h_{t-1}, \: x_{t}\right]+b_{c}\right),\:\end{align*}$h_{t}=o_{t}\odot\tan h\left(c_{t}\right).$

Here, $f_{t}$, $i_{t}$, and $o_{t}$ represent the forget, input, and output gates, respectively, while $c_{t}$ is the memory cell state. This mechanism allows the network to selectively retain or discard information, making it suitable for speech data with long-term temporal dependencies.

2.4 Wav2Vec 2.0

Wav2Vec 2.0 operates on raw audio using a transformer-based encoder and employs self-supervised pretraining. Its objective is expressed through a contrastive loss function:

(5)

$Lwav2vec = -\Sigma_{t\in M}\log\dfrac{\exp(sim (c_{t},\: q_{t})}{\Sigma_{q\in Q}\exp(sim (c_{t},\: q)}$,

where $c_{t}$ is the contextualized representation of the masked time step, $q_{t}$ is the correct (positive) quantized target, and $q$ represents negative distractor samples. This objective ensures that the model learns discriminative features directly from raw audio without relying on hand-crafted transformations.

2.5 Evaluation Metrics

To quantify performance, three main metrics are considered:

(6)

$Accuracy=\dfrac{TP+TN}{TP+TN+FP+FN},\: $ $FAR=\dfrac{FP}{FP+TN},\: $and $FRR=\dfrac{FN}{TP+FN}.$

Accuracy measures overall classification performance, while False Acceptance Rate (FAR) evaluates how often spoofed or unauthorized voices are incorrectly accepted. False Rejection Rate (FRR) measures how often genuine users are rejected.

In addition, robustness is defined as the relative performance drop when the models are evaluated under noisy or spoofed conditions compared to clean scenarios. This provides a more practical assessment of real-world deployment.

2.6 Data Augmentation

To improve robustness and generalization, several data augmentation techniques were applied to the training datasets. These augmentations simulate real-world variability and adversarial conditions:

- Additive Gaussian Noise:

(7)

$x^{'}(t)=x(t)+N\left(0,\: \sigma^{2}\right),\:$

where $x(t)$ is the clean speech signal and $\sigma^{2}$ is the noise variance. This method simulates microphone and environmental noise.

- Reverberation: Convolution of the signal with a room impulse response (RIR):

(8)

$x^{'}(t)=x(t)*h(t).$

- Pitch Shifting: Frequency modification applied via phase vocoder:

(9)

$X^{'}(f)=X(\alpha f),\:$

where α is the pitch scaling factor.

- Speed Perturbation: Temporal scaling applied to the waveform:

(10)

$x^{'}(t)=x(\beta t)$

- Background Speech Mixing: Random segments from other speakers are linearly mixed:

(11)

$x^{'}(t)=(1-\lambda)x(t)+\lambda y(t),\:$

where y(t) is another speaker’s audio and λ is a mixing coefficient.

These augmentation strategies increase dataset variability, reduce overfitting, and enhance spoofing resistance by exposing models to challenging acoustic conditions during training. Similar to other works where noise augmentation was applied to improve the robustness of CNN-based models in medical imaging tasks ^[3], our study incorporated Gaussian noise, reverberation, and pitch shifting to simulate realistic acoustic environments.

Figure 1 shows the spectrogram comparison of clean and augmented speech signals. The first spectrogram represents a clean speech waveform with distinct harmonic structures. The second spectrogram illustrates the effect of additive Gaussian noise, where random high-frequency components blur the formant patterns. The third spectrogram demonstrates reverberation, introducing temporal smearing caused by simulated room impulse responses. The fourth spectrogram shows pitch shifting, where the spectral bands are displaced due to frequency scaling. Finally, the fifth spectrogram presents speed perturbation, which compresses or stretches temporal features, altering the rhythm of speech. Together, these augmentations increase data variability and simulate realistic acoustic environments for model training.

Fig. 1. Examples of spectrograms after different augmentation techniques

3. Results and Discussion

The comparative results reveal significant differences in how each architecture handles noise and spoofing. CNN models, while effective on clean spectrograms, suffered considerable performance degradation when noise or distortions were introduced. Their reliance on static visual patterns limited adaptability.

LSTM networks showed superior noise handling due to their ability to model time-series dynamics. However, they struggled with spoofed inputs, especially those generated via high-quality TTS, suggesting their temporal memory alone is insufficient for spoof detection.

Wav2Vec 2.0, as hypothesized, delivered the highest accuracy overall. Its ability to process raw audio signals al lowed it to learn deep representations resilient to distortion and background interference^[2,^7]. In clean conditions, it achieved 92 percent accuracy, with minimal performance drop under noisy conditions. However, even Wav2Vec misclassified certain high-fidelity synthetic voices as genuine, indicating that spoofing remains a system-wide vulnerability^[1].

As shown in Figure 2, Wav2Vec 2.0 leads in all three key metrics-accuracy, noise robustness, and spoof resistance-compared to CNN and LSTM.

Fig. 2. Model comparison of performance for three key metrics

These results align with recent studies showing the superiority of transformer-based models in speech processing tasks^[3]. However, the persistent vulnerability to spoofing across all models confirms findings by other researchers that neural networks, regardless of their depth, can be deceived by audio crafted to imitate human speech^[7,^8].

Figure 3 presents the spoof resistance distribution across models, demonstrating that while Wav2Vec 2.0 performs better, it still accepts over 25 percent of spoofed samples. This confirms the need for integrating explicit spoof detection modules, such as Light Convolutional Neural Networks (LCNN), and training models with adversarial examples crafted from state-of-the-art voice cloning tools^[6,^8].

Furthermore, incorporating multi-modal biometric fusion-e.g., combining voice with facial or behavioral signals-could significantly reduce spoofing risk^[9]. Alternatively, privacy-aware architectures such as federated learning may allow decentralized training on user devices, mitigating the risk of centralized data leakage^[10].

The key evaluation metrics are summarized in Table 1, enabling a side-by-side comparison of the three architectures in numerical terms.

Fig. 3. Model comparison by performance metrics

Table 1 Summary of Model Evaluation Metrics

Model	Accuracy	Noise Robustness	Spoof Resistance
CNN	78%	65%	52%
LSTM	85%	78%	67%
Wav2Vec 2.0	92%	88%	75%

3.1 Cross-Dataset Evaluation

To evaluate the generalization capability of the models, a cross-dataset experiment was conducted: models were trained on the Mozilla Common Voice dataset and tested on VoxCeleb. Results show a significant performance drop for CNN and LSTM due to overfitting to spectrogram-specific features. Wav2Vec 2.0 demonstrated stronger generalization, maintaining 84% accuracy compared to 92% on in-domain data.

3.2 Spoof Generalization

An additional experiment was carried out where models were trained with one type of text-to-speech (TTS) synthesis and tested with another unseen TTS system. Both CNN and LSTM models failed to adapt, showing spoof acceptance rates above 40%. Wav2Vec 2.0 reduced the error but still accepted 28% of spoofed samples, highlighting the challenge of unseen synthetic voices.

3.3 Adversarial Attack Robustness

To simulate targeted spoofing, adversarial perturbations were generated using the Fast Gradient Sign Method (FGSM):

(12)

$x^{'}=x+ϵ\cdot sign\left(\nabla_{x}L(\theta ,\: x,\: y)\right),\:$

where 𝜖 is the perturbation budget. Even with small 𝜖 = 0.01, CNN and LSTM misclassified 45% and 39% of samples, respectively, while Wav2Vec 2.0 showed improved robustness with only 22% error.

Figure 4 compares the models performance, such as CNN, LSTM, and Wav2Vec 2.0, under different experimental conditions of accuracy across four scenarios: clean speech, noisy speech, cross-dataset evaluation, and adversarial attacks. Wav2Vec 2.0 consistently outperforms the other models, showing stronger robustness to noise, dataset variability, and adversarial perturbations.

Fig. 4. Model performance under different experimental conditions

3.4 Discussion

The comparative results presented in the previous section demonstrate that Wav2Vec 2.0 outperforms CNN and LSTM models across multiple evaluation metrics. However, a deeper analysis reveals crucial insights into why these models behave differently under clean, noisy, and spoofed conditions.

CNNs perform well on clean mel spectrograms due to their ability to extract spatial features. However, they lack temporal modeling capabilities, making them more vulnerable to variations in speech dynamics and environmental changes. When Gaussian noise or reverb is introduced, CNN performance drops sharply, indicating overreliance on static patterns^[11].

LSTM networks, being recurrent in nature, demonstrate stronger resistance to temporal distortion^[12]. They can adapt to background conversations or inconsistent pacing in speech. However, their inability to detect anomalies in spectral patterns leads to a higher false acceptance rate during spoofing attacks, especially when using high-quality TTS inputs^[13].

Wav2Vec 2.0 exhibits clear advantages by processing raw waveform data directly. Its transformer-based architecture allows it to extract multi-level, contextualized features^[2,^14]. This robustness contributes to its strong noise resistance and superior generalization. However, it too struggles with advanced spoofing, misclassifying some AI-generated voices as genuine^[1,^7,^15].

Further analysis of the confusion matrices indicates that while all models can distinguish between genuine and replayed inputs fairly well, they struggle when presented with deepfake voices generated using state-of-the-art TTS systems^[6,^16]. Figure 5 shows the confusion matrix for Wav2Vec 2.0 under spoofing conditions.

Fig. 5. Confusion matirix of Wave2Vec 2.0 under spoofing conditions (TTS vs. genuine)

Moreover, latency analysis showed that Wav2Vec 2.0 requires more computational resources due to the transformer layers, which could limit its deployment on edge devices^[17].

To summarize, while Wav2Vec 2.0 is a promising solution, deploying it in real-world scenarios requires complementing it with specialized anti-spoofing modules and optimizing it for lightweight inference.

3.4.1 Error Analysis

A detailed examination of the error cases shows that CNN frequently accepted spoofed samples with stable spectral envelopes, indicating its overreliance on static features. LSTM demonstrated difficulty when spoofed voices contained long pauses or irregular temporal dynamics, which disrupted its sequence modeling. Wav2Vec 2.0 performed better overall but still misclassified advanced TTS-based voices, especially those reproducing natural coarticulation and prosodic variations. These findings highlight that spoof detection remains an open challenge across all architectures.

3.4.2 Latency and Computational Efficiency

Alongside recognition accuracy, inference time was also evaluated. CNN achieved the fastest processing (≈12 ms per sample) owing to its lightweight convolutional layers, while LSTM required slightly longer (≈18 ms) due to sequential recurrence. Wav2Vec 2.0, despite offering the highest accuracy, was the slowest (≈45 ms), primarily because of its transformer layers and contextual embeddings. This trade-off indicates that while Wav2Vec 2.0 is most robust, CNN and LSTM may still be preferable for deployment on resource-constrained or edge devices. Recent works also emphasize optimization-oriented approaches in software engineering, such as the use of low-code platforms to improve efficiency^[17].

3.5 Mathematical Robustness Analysis

To provide a deeper theoretical perspective, the robustness of voice biometric models can be formalized using mathematical definitions and performance bounds.

3.5.1 Signal-to-Noise Ratio (SNR) and Degradation

The impact of noise on speech signals is quantified through SNR:

(13)

$SNR (d B)=10\log_{10}\dfrac{||x(t)||^{2}}{||x(t)-x'(t)||^{2}}$

where $x(t)$ is the clean signal and $x'(t)$ is the noisy version. A higher SNR corresponds to clearer input, while a lower SNR indicates stronger noise contamination. Model robustness can be expressed as the relative drop in accuracy with decreasing SNR.

3.5.2 False Acceptance and Rejection Bounds

The overall reliability of authentication can be expressed via the FAR and FRR. A robust system minimizes both rates simultaneously. However, in practice, there exists a trade-off: reducing one rate typically increases the other. To summarize overall system performance, the Equal Error Rate (EER) is commonly used.

The EER is defined as the operating point where FAR equals FRR :

(14)

$EER=FAR(\tau^{*})=FRR(\tau^{*})$

where $\tau^{*}$ is the decision threshold that balances the two errors. When an exact equality is not attainable due to discrete thresholds, $\tau^{*}$ is chosen as

(15)

$\tau^{*}=\arg\min | FAR(\tau)-FRR(\tau)|$

with the corresponding approximation:

(16)

$EER\approx\dfrac{1}{2}\left(FAR(\tau^{*})+FRR(\tau^{*})\right).$

This measure provides a single scalar value to compare biometric systems and is widely reported in speaker verification studies.

3.5.3 Spoof Detection under Adversarial Perturbations

Spoofing attacks can be formalized as adversarial perturbations to the input signal:

(17)

$x^{'}=x+\delta ,\: ∥\delta ∥ <\epsilon ,\:$

where 𝛿 is the perturbation bounded by ϵ. The adversarial objective maximizes classification error:

(18)

$\max_{\|\delta\| < \epsilon} L(\theta, x + \delta, y)$

CNN and LSTM models exhibit higher sensitivity to small perturbations, while transformer-based models like Wav2Vec 2.0 provide partial robustness but remain vulnerable when 𝜖 is sufficiently large.

3.5.4 Robust Training Objective

To mitigate these vulnerabilities, adversarial training modifies the loss function:

(19)

$L_{total} = L_{CE} + λL_{adv},$

where LCE is the standard cross-entropy loss, Ladv penalizes misclassification under adversarial perturbations, and 𝜆 balances the two objectives. This formulation ensures models not only fit clean data but also resist spoofed and noisy inputs.

3.6 Ethical and Privacy Considerations

The deployment of AI-driven voice biometric systems raises several ethical, legal, and privacy challenges that must not be overlooked.

3.6.1 Data Ownership and Consent

Voice recordings are inherently personal and can reveal far more than just identity - including health status, emotions, or even mental state. It is imperative that users retain control over their data. Consent must be explicit, revocable, and informed. Systems must be designed with privacy-by-default and privacy-by-design principles^[18].

3.6.2 Bias and Fairness

Bias in training data is a serious concern. Voice datasets such as Common Voice and VoxCeleb, while large, may not be fully balanced in terms of gender, age, dialect, or accent. This can lead to biased model behavior - for instance, better recognition accuracy for male over female speakers or for native English speakers over non-native. Ensuring equitable performance requires diverse training data and continuous auditing for fairness^[19].

3.6.3 Secure and Private Learning Approaches

Federated Learning offers a solution by enabling decentralized training. In this framework, models are trained locally on user devices without transmitting raw audio to central servers. This not only preserves privacy but also reduces the risk of data leaks and attacks on central repositories^[9,^20].

3.6.4 Regulatory Compliance

Deployment in real-world systems must comply with data protection regulations such as GDPR, CCPA, or national laws. Explainable AI and audit trails must be in place to ensure accountability and legal transparency^[21,^22].

4. Conclusion and Future Work

This study presented a comparative evaluation of CNN, LSTM, and Wav2Vec 2.0 architectures for voice biometric authentication under clean, noisy, and spoofed conditions. Wav2Vec 2.0 consistently outperformed the other models in accuracy and robustness, although none of the approaches achieved complete resistance to high-quality spoofing attacks.

In addition to the main findings, several broader conclusions can be drawn:

∙ Accuracy vs. Efficiency Trade-off: While CNN and LSTM provide faster inference suitable for deployment on edge devices, their robustness against spoofing remains limited. Wav2Vec 2.0, although computationally heavier, delivers superior generalization and noise resilience, suggesting its suitability for cloud-based or hybrid systems.

∙ Vulnerability to Emerging Attacks: All models demonstrated weaknesses against novel TTS systems and adversarial perturbations, confirming that spoof detection remains one of the most critical bottlenecks in voice biometrics.

∙ Importance of Data Diversity: Cross-dataset experiments revealed that limited domain coverage in training data reduces generalization. This emphasizes the necessity of large-scale, diverse, and continuously updated datasets for robust biometric authentication.

Future research directions include several promising avenues:

∙ Hybrid Architectures: Combining CNN’s efficient feature extraction, LSTM’s sequential modeling, and transformer-based contextual representations could lead to improved balance between robustness and efficiency.

∙ Adversarial and Spoof-Aware Training: Integrating adversarial training strategies and explicit spoof detection modules (e.g., LCNNs or spectro-temporal anomaly detectors) to mitigate vulnerabilities.

∙ Multimodal Biometric Fusion: Exploring fusion of voice with facial recognition, lip movements, or behavioral biometrics to enhance overall security.

∙ On-Device Deployment: Investigating lightweight transformer variants (e.g., DistilWav2Vec, quantization, pruning) for mobile and embedded systems.

∙ Privacy-Preserving Learning: Applying federated learning and differential privacy methods to protect sensitive voice data while maintaining high accuracy.

Ultimately, future systems must not only achieve technical robustness but also comply with ethical, legal, and privacy requirements to ensure trustworthy real-world deployment.

Acknowledgements

This research was supported by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan (grant no. BR28712579).

References

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” 2022. arXiv preprint arXiv:2202.12233

S. Novoselov, G. Lavrentyeva, A. Avdeeva, V. Volokhov and Aleksei Gusev, “Robust speaker recognition with transformers using wav2vec 2.0.,” 2022. arXiv preprint arXiv:2203.15095

A. Mukasheva, D. Koishiyeva, Z. Suimenbayeva, S. Rakhmetulayeva, A. Bolshibayeva and G. Sadikova, “Comparison Evaluation of Unet-Based Models with Noise Augmentation for Breast Cancer Segmentation on Ultrasound Images,” Eastern-European Journal of Enterprise Technologies, vol. 125, no. 9, 2023 10.15587/1729-4061.2023.289044

N. Vaessen and D. A. Van Leeuwen, “Fine-tuning wav2vec2 for speaker recognition,” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 7967-7971, 2022. 10.1109/ICASSP43922.2022.9746952

K. Li, C. Baird and D. Lin, “Defend data poisoning attacks on voice authentication,” IEEE Transactions on Dependable and Secure Computing, vol. 21, no. 4, pp. 1754-1769, 2023. 10.1109/TDSC.2023.3289446

J. W. Lee, E. Kim, J. Koo and K. Lee, “Representation selective self-distillation and wav2vec 2.0 feature exploration for spoof-aware speaker verification,” 2022. Preprint, Available at: arXiv:2204.02639

S. Salturk and N. Kahraman, “Deep learning-powered multimodal biometric authentication: integrating dynamic signatures and facial data for enhanced online security,” Neural Computing and Applications, vol. 36, no. 19, pp. 11311-11322, 2024. 10.1007/s00521-024-09690-2

K. Merit and M. Beladgham, “Enhancing Biometric Security with Bimodal Deep Learning and Feature-level Fusion of Facial and Voice Data,” Journal of Telecommunications and Information Technology, vol. 98, no. 4, pp. 31-42, 2024. 10.26636/jtit.2024.4.1754

Y. Elbayoumi (2024), “Applying machine learning and deep learning in the voice biometrics technology,” Master’s Thesis, Bahcesehir University, 22 January 2024. https://www.researchgate.net/publication/380131916

K. Koutini, H. Eghbal-zadeh and G. Widmer, “Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1987-2000, 2021. 10.1109/TASLP.2021.3082307

T. N. Sainath, O. Vinyals, A. Senior and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) IEEE, pp. 4580-4584. 2015. 10.1109/ICASSP.2015.7178838

G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang and W. Xu, “Dolphinattack: Inaudible voice commands,” In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, IEEE, pp. 103-117. 2017. 10.1145/3133956.3134052

A. Mohamed, H.-Y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179-1210, 2022. 10.1109/JSTSP.2022.3207050

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507-2522, 2023. 10.1109/TASLP.2023.3285283

S. Tuli and N. K. Jha, “EdgeTran: Device-aware co-search of transformers for efficient inference on mobile edge platforms,” IEEE Transactions on Mobile Computing, vol. 23, no. 6, pp. 7012-7029, 2023. 10.1109/TMC.2023.3328287

S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup and M. Shah, “A survey of on-device machine learning: An algorithms and learning theory perspective,” ACM Transactions on Internet of Things, vol. 2, no. 3, pp. 1-49, 2021. 10.1145/3450494

E. Seitzhan, A. Bissembayev, A. Mukasheva, H. S. Park and J. W. Kang, “A Study on the Optimization Efficiency of Software Development with Low-Code Platforms,” Transactions of the Korean Institute of Electrical Engineers, vol. 74, no. 5, pp. 957-968, 2025. 10.5370/KIEE.2025.74.5.957

L. H. X. Ng, A. C. M. Lim, A. X. W. Lim and A. Taeihagh, “Digital ethics for biometric applications in a smart city,” Digital Government: Research and Practice, vol. 4, no. 4, pp. 1-6, 2023. 10.1145/3630261

A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R. Rickford, D. Jurafsky and S. Goel, “Racial disparities in automated speech recognition,” Proceedings of the national academy of sciences, vol. 117, no. 14, pp. 7684-7689, 2020. 10.1073/pnas.1915768117

K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, et al., “Towards federated learning at scale: System design,” Proceedings of machine learning and systems, vol. 1, pp. 374-388, 2019. https://proceedings.mlsys.org/paper_files/paper/2019/file/7b770da633baf74895be22a8807f1a8f-Paper.pdf

P. Voigt and A. Von dem Bussche, “The EU General Data Protection Regulation (GDPR),” A Practical Guide, 1st ed., Cham: Springer International Publishing, 2017. 10.1007/978-3-319-57959-7

S. Wachter, B. Mittelstadt and C. Russell, “Counterfactual explanations without opening the black box: Automated decisions and the GDPR,” Harvard Journal of Law & Technology, vol. 31, no. 2, pp. 841-887, 2017. 10.2139/ssrn.3063289

저자소개

Oralbek Bayazov

He received the B.S. degree in Computer Systems and Software from Kazakh-British Technical University (KBTU), Almaty, Kazakhstan, in 2022. Since 2024, he has been pursuing the M.S. degree in Information Systems at the School of Information Technology and Engineering, KBTU. He is currently working as a freelance Java Backend developer. His research interests include machine learning, artificial intelligence, and voice biometric authentication.

Anel Aidos

She is studying at Nazarbayev University’s School of Sciences and Humanities and is currently in her junior year as a sociology student. Her research interests include a wide variety of subjects, including quantitative and qualitative research, as well as policy implementation.

강정원(Jeong Won Kang)

He received his B.S., M.S., and Ph.D. degrees in electronic engineering from Chung-Ang University, Seoul, Korea, in 1995, 1997, and 2002, respectively. In March 2008, he joined the Korea National University of Transportation, Republic of Korea, where he currently holds the position of Professor in the Department of Transportation System Engineering, the Department of SMART Railway System, and the Department of Smart Railway and Transportation Engineering.

Assel Mukasheva

She received the B.S., M.S., and PhD. degrees from Satbayev University, Almaty, Kazakhstan, in 2004, 2014, and 2020, respectively. In September 2023, she joined Kazakh-British Technical University, where she is currently an professor in School of Information Technology and Engineering. Big Data, cyber security, machine learning, and comparative study of deep learning methods.

KIEEThe Transactions ofthe Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Journal Search

Journal XML

Journal Information

Voice Biometric Authentication Using AI : A Comparative Study on Neural Network Robustness to Noise andSpoofing

Translated Abstract

Key words

1. Introduction

2. Materials and Methods

2.1 Data Representation

(1)

(2)

2.2 Convolutional Neural Network (CNN)

(3)

2.3 Long Short-Term Memory (LSTM)

(4)

2.4 Wav2Vec 2.0

(5)

2.5 Evaluation Metrics

(6)

2.6 Data Augmentation

(7)

(8)

(9)

(10)

(11)

3. Results and Discussion

3.1 Cross-Dataset Evaluation

3.2 Spoof Generalization

3.3 Adversarial Attack Robustness

(12)

3.4 Discussion

3.4.1 Error Analysis

3.4.2 Latency and Computational Efficiency

3.5 Mathematical Robustness Analysis

3.5.1 Signal-to-Noise Ratio (SNR) and Degradation

(13)

3.5.2 False Acceptance and Rejection Bounds

(14)

(15)

(16)

3.5.3 Spoof Detection under Adversarial Perturbations

(17)

(18)

3.5.4 Robust Training Objective

(19)

3.6 Ethical and Privacy Considerations

3.6.1 Data Ownership and Consent

3.6.2 Bias and Fairness

3.6.3 Secure and Private Learning Approaches

3.6.4 Regulatory Compliance

4. Conclusion and Future Work

Acknowledgements

References

저자소개

Oralbek Bayazov

Anel Aidos

강정원(Jeong Won Kang)

Assel Mukasheva

Article Information (continued)

Key words

KIEEThe Transactions of
the Korean Institute of Electrical Engineers