(Oralbek Bayazov)
                     1iD
                     
                     (Anel Aidos)
                     2iD
                     강정원
                     (Jeong Won Kang)
                     †iD
                     
                     (Assel Mukasheva)
                     ††iD
               
                  - 
                           
                        (School of Information Technology and Engineering, Kazakh-British Technical University,
                        Kazakhstan.)
                        
- 
                           
                        (School of Science and Humanities, Nazarbayev University, Astana, Kazakhstan.)
                        
 
            
            
            Copyright © The Korea Institute for Structural Maintenance and Inspection
            
            
            
            
            
               
                  
Key words
               
                Voice biometrics,  Wav2Vec 2.0,  Spoof detection,  LSTM,  Federated learning,  Multimodal authentication,  Deep learning.,  ㅍ
             
            
          
         
            
                  1. Introduction       	
                The increasing reliance on digital platforms for banking, education, healthcare,
                  and communication has significantly amplified the need for robust and seamless authentication
                  systems. Voice biometric authentication is gaining momentum due to its convenience,
                  contactless operation, and potential for integration into everyday technologies such
                  as smart-phones, virtual assistants, and call centers. Unlike fingerprint or facial
                  recognition systems, voice-based methods do not require physical contact or camera
                  access, making them especially useful in low-resource or privacy-sensitive contexts.
                  
               
               Nevertheless, the practicality of voice biometrics is hindered by multiple factors.
                  Variations in microphones, environmental noise, changes in speaker’s health, emotional
                  state, and most critically, spoofing attacks, can undermine system performance. Spoofing-through
                  replayed recordings or synthetic speech-presents a unique challenge because it can
                  closely imitate legitimate input.
               
               Artificial Intelligence, and particularly deep learning, has opened new avenues to
                  enhance the accuracy and adaptability of voice authentication systems[1,2]. Unlike traditional signal processing techniques, deep neural networks can learn
                  hierarchical representations of speech signals, enabling them to generalize across
                  different conditions. This study seeks to evaluate and compare three distinct neural
                  architectures- Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM),
                  and Wav2Vec 2.0-for their efficacy in authenticating speakers under challenging conditions.
               
             
            
                  2. Materials and Methods	
               This study adopts a rigorous mathematical and experimental framework to evaluate the
                  robustness of CNN, LSTM, and Wav2Vec 2.0 models for voice biometric authentication.
                  CNN was chosen for its ability to capture local spectral patterns from spectrograms,
                  LSTM for modeling long-term temporal dependencies in speech, and Wav2Vec 2.0 as a
                  state-of-the-art transformer that learns contextual representations directly from
                  raw audio. The methodology is divided into several subsections covering data representation,
                  model architectures, and evaluation metrics. 
               
               
                     2.1 Data Representation
                  Each voice sample can be represented as a discrete-time signal:
                  
                  Here, 𝑥(𝑡) denotes the raw speech waveform, and 𝑇 is the total number of sampled
                     points.
                  
                  For CNN and LSTM models, these signals are transformed into mel-spectrograms and mel-frequency
                     cepstral coefficient (MFCCs):
                  
                  
                  where STFT denotes the short-time Fourier transform, producing a time-frequency representation,
                     while MFCC captures perceptually relevant spectral features. In contrast, Wav2Vec
                     2.0 directly consumes the raw waveform without handcrafted transformations.
                  
                
               
                     2.2 Convolutional Neural Network (CNN)
                  CNNs are designed to capture local spatial features from spectrograms. A convolutional
                     layer can be mathematically expressed as:
                  
                  
                  where $w_{m,\: n}^{(l)}$ are the convolutional kernel weights, $b^{(l)}$ is the bias,
                     and 𝜎 denotes the ReLU activation function. This formulation describes how each feature
                     map is generated by applying learnable filters to the spectrogram, enabling the extraction
                     of edges, frequency bands, and temporal structures.
                  
                
               
                     2.3 Long Short-Term Memory (LSTM)
                  LSTM networks are recurrent architectures specialized in modeling sequential dependencies.
                     The state updates can be written as:
                  
                  
                  Here, $f_{t}$, $i_{t}$, and $o_{t}$ represent the forget, input, and output gates,
                     respectively, while $c_{t}$ is the memory cell state. This mechanism allows the network
                     to selectively retain or discard information, making it suitable for speech data with
                     long-term temporal dependencies.
                  
                
               
                     2.4 Wav2Vec 2.0
                  Wav2Vec 2.0 operates on raw audio using a transformer-based encoder and employs self-supervised
                     pretraining. Its objective is expressed through a contrastive loss function:
                  
                  
                  where $c_{t}$ is the contextualized representation of the masked time step, $q_{t}$
                     is the correct (positive) quantized target, and $q$ represents negative distractor
                     samples. This objective ensures that the model learns discriminative features directly
                     from raw audio without relying on hand-crafted transformations.
                  
                
               
                     2.5 Evaluation Metrics
                  
                  
                  To quantify performance, three main metrics are considered:
                  
                  
                  
                  Accuracy measures overall classification performance, while False Acceptance Rate
                     (FAR) evaluates how often spoofed or unauthorized voices are incorrectly accepted.
                     False Rejection Rate (FRR) measures how often genuine users are rejected.
                  
                  In addition, robustness is defined as the relative performance drop when the models
                     are evaluated under noisy or spoofed conditions compared to clean scenarios. This
                     provides a more practical assessment of real-world deployment.
                  
                
               
                     2.6 Data Augmentation
                  To improve robustness and generalization, several data augmentation techniques were
                     applied to the training datasets. These augmentations simulate real-world variability
                     and adversarial conditions:
                  
                  
                  - Additive Gaussian Noise:
                  
                  
                  
                  where $x(t)$ is the clean speech signal and $\sigma^{2}$ is the noise variance. This
                     method simulates microphone and environmental noise.
                  
                  - Reverberation: Convolution of the signal with a room impulse response (RIR):
                  
                  - Pitch Shifting: Frequency modification applied via phase vocoder:
                  
                  where α is the pitch scaling factor.
                  - Speed Perturbation: Temporal scaling applied to the waveform:
                  
                  - Background Speech Mixing: Random segments from other speakers are linearly mixed:
                     
                  
                  
                  where y(t) is another speaker’s audio and λ is a mixing coefficient.
                  These augmentation strategies increase dataset variability, reduce overfitting, and
                     enhance spoofing resistance by exposing models to challenging acoustic conditions
                     during training. Similar to other works where noise augmentation was applied to improve
                     the robustness of CNN-based models in medical imaging tasks [3], our study incorporated Gaussian noise, reverberation, and pitch shifting to simulate
                     realistic acoustic environments.
                  
                  Figure 1 shows the spectrogram comparison of clean and augmented speech signals. The first
                     spectrogram represents a clean speech waveform with distinct harmonic structures.
                     The second spectrogram illustrates the effect of additive Gaussian noise, where random
                     high-frequency components blur the formant patterns. The third spectrogram demonstrates
                     reverberation, introducing temporal smearing caused by simulated room impulse responses.
                     The fourth spectrogram shows pitch shifting, where the spectral bands are displaced
                     due to frequency scaling. Finally, the fifth spectrogram presents speed perturbation,
                     which compresses or stretches temporal features, altering the rhythm of speech. Together,
                     these augmentations increase data variability and simulate realistic acoustic environments
                     for model training.
                  
                  
                        
                        
Fig. 1. Examples of spectrograms after different augmentation techniques
                      
                
             
            
                  3. Results and Discussion	
               The comparative results reveal significant differences in how each architecture handles
                  noise and spoofing. CNN models, while effective on clean spectrograms, suffered considerable
                  performance degradation when noise or distortions were introduced. Their reliance
                  on static visual patterns limited adaptability. 
               
               LSTM networks showed superior noise handling due to their ability to model time-series
                  dynamics. However, they struggled with spoofed inputs, especially those generated
                  via high-quality TTS, suggesting their temporal memory alone is insufficient for spoof
                  detection.
               
               Wav2Vec 2.0, as hypothesized, delivered the highest accuracy overall. Its ability
                  to process raw audio signals al lowed it to learn deep representations resilient to
                  distortion and background interference[2,7]. In clean conditions, it achieved 92 percent accuracy, with minimal performance drop
                  under noisy conditions. However, even Wav2Vec misclassified certain high-fidelity
                  synthetic voices as genuine, indicating that spoofing remains a system-wide vulnerability[1].
               
               As shown in Figure 2, Wav2Vec 2.0 leads in all three key metrics-accuracy, noise robustness, and spoof
                  resistance-compared to CNN and LSTM.
               
               
                     
                     
Fig. 2. Model comparison of performance for three key metrics
                   
               These results align with recent studies showing the superiority of transformer-based
                  models in speech processing tasks[3]. However, the persistent vulnerability to spoofing across all models confirms findings
                  by other researchers that neural networks, regardless of their depth, can be deceived
                  by audio crafted to imitate human speech[7,8].
               
               Figure 3 presents the spoof resistance distribution across models, demonstrating that while
                  Wav2Vec 2.0 performs better, it still accepts over 25 percent of spoofed samples.
                  This confirms the need for integrating explicit spoof detection modules, such as Light
                  Convolutional Neural Networks (LCNN), and training models with adversarial examples
                  crafted from state-of-the-art voice cloning tools[6,8].
               
               Furthermore, incorporating multi-modal biometric fusion-e.g., combining voice with
                  facial or behavioral signals-could significantly reduce spoofing risk[9]. Alternatively, privacy-aware architectures such as federated learning may allow
                  decentralized training on user devices, mitigating the risk of centralized data leakage[10].
               
               The key evaluation metrics are summarized in Table 1, enabling a side-by-side comparison of the three architectures in numerical terms.
               
               
                     
                     
Fig. 3. Model comparison by performance metrics
                   
               
                     
                     
Table 1 Summary of Model Evaluation Metrics
                  
                  
                        
                           
                              | Model | Accuracy | Noise Robustness | Spoof Resistance | 
                        
                              | CNN | 78% | 65% | 52% | 
                        
                              | LSTM | 85% | 78% | 67% | 
                        
                              | Wav2Vec 2.0 | 92% | 88% | 75% | 
                     
                  
                
               
                     3.1 Cross-Dataset Evaluation
                  To evaluate the generalization capability of the models, a cross-dataset experiment
                     was conducted: models were trained on the Mozilla Common Voice dataset and tested
                     on VoxCeleb. Results show a significant performance drop for CNN and LSTM due to overfitting
                     to spectrogram-specific features. Wav2Vec 2.0 demonstrated stronger generalization,
                     maintaining 84% accuracy compared to 92% on in-domain data.
                  
                
               
                     3.2 Spoof Generalization
                  An additional experiment was carried out where models were trained with one type of
                     text-to-speech (TTS) synthesis and tested with another unseen TTS system. Both CNN
                     and LSTM models failed to adapt, showing spoof acceptance rates above 40%. Wav2Vec
                     2.0 reduced the error but still accepted 28% of spoofed samples, highlighting the
                     challenge of unseen synthetic voices.
                  
                
               
                     3.3 Adversarial Attack Robustness
                  To simulate targeted spoofing, adversarial perturbations were generated using the
                     Fast Gradient Sign Method (FGSM):
                  
                  
                  where 𝜖 is the perturbation budget. Even with small 𝜖 = 0.01, CNN and LSTM misclassified
                     45% and 39% of samples, respectively, while Wav2Vec 2.0 showed improved robustness
                     with only 22% error.
                  
                  Figure 4 compares the models performance, such as CNN, LSTM, and Wav2Vec 2.0, under different
                     experimental conditions of accuracy across four scenarios: clean speech, noisy speech,
                     cross-dataset evaluation, and adversarial attacks. Wav2Vec 2.0 consistently outperforms
                     the other models, showing stronger robustness to noise, dataset variability, and adversarial
                     perturbations.
                  
                  
                        
                        
Fig. 4. Model performance under different experimental conditions
                      
                
               
                     3.4 Discussion
                  The comparative results presented in the previous section demonstrate that Wav2Vec
                     2.0 outperforms CNN and LSTM models across multiple evaluation metrics. However, a
                     deeper analysis reveals crucial insights into why these models behave differently
                     under clean, noisy, and spoofed conditions.
                  
                  CNNs perform well on clean mel spectrograms due to their ability to extract spatial
                     features. However, they lack temporal modeling capabilities, making them more vulnerable
                     to variations in speech dynamics and environmental changes. When Gaussian noise or
                     reverb is introduced, CNN performance drops sharply, indicating overreliance on static
                     patterns[11].
                  
                  LSTM networks, being recurrent in nature, demonstrate stronger resistance to temporal
                     distortion[12]. They can adapt to background conversations or inconsistent pacing in speech. However,
                     their inability to detect anomalies in spectral patterns leads to a higher false acceptance
                     rate during spoofing attacks, especially when using high-quality TTS inputs[13].
                  
                  Wav2Vec 2.0 exhibits clear advantages by processing raw waveform data directly. Its
                     transformer-based architecture allows it to extract multi-level, contextualized features[2,14]. This robustness contributes to its strong noise resistance and superior generalization.
                     However, it too struggles with advanced spoofing, misclassifying some AI-generated
                     voices as genuine[1,7,15].
                  
                  Further analysis of the confusion matrices indicates that while all models can distinguish
                     between genuine and replayed inputs fairly well, they struggle when presented with
                     deepfake voices generated using state-of-the-art TTS systems[6,16]. Figure 5 shows the confusion matrix for Wav2Vec 2.0 under spoofing conditions.
                  
                  
                        
                        
Fig. 5. Confusion matirix of Wave2Vec 2.0 under spoofing conditions (TTS vs. genuine)
                      
                  Moreover, latency analysis showed that Wav2Vec 2.0 requires more computational resources
                     due to the transformer layers, which could limit its deployment on edge devices[17].
                  
                  To summarize, while Wav2Vec 2.0 is a promising solution, deploying it in real-world
                     scenarios requires complementing it with specialized anti-spoofing modules and optimizing
                     it for lightweight inference.
                  
                  
                        3.4.1 Error Analysis
                     A detailed examination of the error cases shows that CNN frequently accepted spoofed
                        samples with stable spectral envelopes, indicating its overreliance on static features.
                        LSTM demonstrated difficulty when spoofed voices contained long pauses or irregular
                        temporal dynamics, which disrupted its sequence modeling. Wav2Vec 2.0 performed better
                        overall but still misclassified advanced TTS-based voices, especially those reproducing
                        natural coarticulation and prosodic variations. These findings highlight that spoof
                        detection remains an open challenge across all architectures.
                     
                   
                  
                        3.4.2 Latency and Computational Efficiency
                     Alongside recognition accuracy, inference time was also evaluated. CNN achieved the
                        fastest processing (≈12 ms per sample) owing to its lightweight convolutional layers,
                        while LSTM required slightly longer (≈18 ms) due to sequential recurrence. Wav2Vec
                        2.0, despite offering the highest accuracy, was the slowest (≈45 ms), primarily because
                        of its transformer layers and contextual embeddings. This trade-off indicates that
                        while Wav2Vec 2.0 is most robust, CNN and LSTM may still be preferable for deployment
                        on resource-constrained or edge devices. Recent works also emphasize optimization-oriented
                        approaches in software engineering, such as the use of low-code platforms to improve
                        efficiency[17].
                     
                   
                
               
                     3.5 Mathematical Robustness Analysis
                  To provide a deeper theoretical perspective, the robustness of voice biometric models
                     can be formalized using mathematical definitions and performance bounds.
                  
                  
                        3.5.1 Signal-to-Noise Ratio (SNR) and Degradation
                     The impact of noise on speech signals is quantified through SNR:
                     
                     where $x(t)$ is the clean signal and $x'(t)$ is the noisy version. A higher SNR corresponds
                        to clearer input, while a lower SNR indicates stronger noise contamination. Model
                        robustness can be expressed as the relative drop in accuracy with decreasing SNR.
                     
                   
                  
                        3.5.2 False Acceptance and Rejection Bounds
                     The overall reliability of authentication can be expressed via the FAR and FRR. A
                        robust system minimizes both rates simultaneously. However, in practice, there exists
                        a trade-off: reducing one rate typically increases the other. To summarize overall
                        system performance, the Equal Error Rate (EER) is commonly used. 
                     
                     The EER is defined as the operating point where FAR equals FRR :
                     
                     where $\tau^{*}$ is the decision threshold that balances the two errors. When an exact
                        equality is not attainable due to discrete thresholds, $\tau^{*}$ is chosen as 
                     
                     
                     with the corresponding approximation:
                     
                     This measure provides a single scalar value to compare biometric systems and is widely
                        reported in speaker verification studies.
                     
                   
                  
                        3.5.3 Spoof Detection under Adversarial Perturbations
                     Spoofing attacks can be formalized as adversarial perturbations to the input signal:
                     
                     where 𝛿 is the perturbation bounded by ϵ. The adversarial objective maximizes classification
                        error:
                     
                     
                     CNN and LSTM models exhibit higher sensitivity to small perturbations, while transformer-based
                        models like Wav2Vec 2.0 provide partial robustness but remain vulnerable when 𝜖 is
                        sufficiently large.
                     
                   
                  
                        3.5.4 Robust Training Objective
                     To mitigate these vulnerabilities, adversarial training modifies the loss function:
                     
                     where LCE is the standard cross-entropy loss, Ladv penalizes misclassification under
                        adversarial perturbations, and 𝜆 balances the two objectives. This formulation ensures
                        models not only fit clean data but also resist spoofed and noisy inputs.
                     
                   
                
               
                     3.6 Ethical and Privacy Considerations
                  The deployment of AI-driven voice biometric systems raises several ethical, legal,
                     and privacy challenges that must not be overlooked.
                  
                  
                        3.6.1 Data Ownership and Consent
                     Voice recordings are inherently personal and can reveal far more than just identity
                        - including health status, emotions, or even mental state. It is imperative that users
                        retain control over their data. Consent must be explicit, revocable, and informed.
                        Systems must be designed with privacy-by-default and privacy-by-design principles[18].
                     
                   
                  
                        3.6.2 Bias and Fairness
                     Bias in training data is a serious concern. Voice datasets such as Common Voice and
                        VoxCeleb, while large, may not be fully balanced in terms of gender, age, dialect,
                        or accent. This can lead to biased model behavior - for instance, better recognition
                        accuracy for male over female speakers or for native English speakers over non-native.
                        Ensuring equitable performance requires diverse training data and continuous auditing
                        for fairness[19].
                     
                   
                  
                        3.6.3 Secure and Private Learning Approaches
                     Federated Learning offers a solution by enabling decentralized training. In this framework,
                        models are trained locally on user devices without transmitting raw audio to central
                        servers. This not only preserves privacy but also reduces the risk of data leaks and
                        attacks on central repositories[9,20].
                     
                   
                  
                        3.6.4 Regulatory Compliance
                     Deployment in real-world systems must comply with data protection regulations such
                        as GDPR, CCPA, or national laws. Explainable AI and audit trails must be in place
                        to ensure accountability and legal transparency[21,22].
                     
                   
                
             
            
                  4. Conclusion and Future Work	
               This study presented a comparative evaluation of CNN, LSTM, and Wav2Vec 2.0 architectures
                  for voice biometric authentication under clean, noisy, and spoofed conditions. Wav2Vec
                  2.0 consistently outperformed the other models in accuracy and robustness, although
                  none of the approaches achieved complete resistance to high-quality spoofing attacks.
               
               In addition to the main findings, several broader conclusions can be drawn:
               ∙ Accuracy vs. Efficiency Trade-off: While CNN and LSTM provide faster inference suitable
                  for deployment on edge devices, their robustness against spoofing remains limited.
                  Wav2Vec 2.0, although computationally heavier, delivers superior generalization and
                  noise resilience, suggesting its suitability for cloud-based or hybrid systems.
               
               ∙ Vulnerability to Emerging Attacks: All models demonstrated weaknesses against novel
                  TTS systems and adversarial perturbations, confirming that spoof detection remains
                  one of the most critical bottlenecks in voice biometrics.
               
               ∙ Importance of Data Diversity: Cross-dataset experiments revealed that limited domain
                  coverage in training data reduces generalization. This emphasizes the necessity of
                  large-scale, diverse, and continuously updated datasets for robust biometric authentication.
               
               Future research directions include several promising avenues:
               ∙ Hybrid Architectures: Combining CNN’s efficient feature extraction, LSTM’s sequential
                  modeling, and transformer-based contextual representations could lead to improved
                  balance between robustness and efficiency.
               
               ∙ Adversarial and Spoof-Aware Training: Integrating adversarial training strategies
                  and explicit spoof detection modules (e.g., LCNNs or spectro-temporal anomaly detectors)
                  to mitigate vulnerabilities.
               
               ∙ Multimodal Biometric Fusion: Exploring fusion of voice with facial recognition,
                  lip movements, or behavioral biometrics to enhance overall security.
               
               ∙ On-Device Deployment: Investigating lightweight transformer variants (e.g., DistilWav2Vec,
                  quantization, pruning) for mobile and embedded systems.
               
               ∙ Privacy-Preserving Learning: Applying federated learning and differential privacy
                  methods to protect sensitive voice data while maintaining high accuracy.
               
               Ultimately, future systems must not only achieve technical robustness but also comply
                  with ethical, legal, and privacy requirements to ensure trustworthy real-world deployment.
               
             
          
         
            
                  Acknowledgements
               
                  This research was supported by the Science Committee of the Ministry of Science and
                  Higher Education of the Republic of Kazakhstan (grant no. BR28712579).
                  
                  			
               
             
            
                  
                     References
                  
                     
                        
                        H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi and N. Evans, “Automatic speaker
                           verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,”
                           2022. arXiv preprint arXiv:2202.12233

 
                     
                        
                        S. Novoselov, G. Lavrentyeva, A. Avdeeva, V. Volokhov and Aleksei Gusev, “Robust speaker
                           recognition with transformers using wav2vec 2.0.,” 2022. arXiv preprint arXiv:2203.15095

 
                     
                        
                        A. Mukasheva, D. Koishiyeva, Z. Suimenbayeva, S. Rakhmetulayeva, A. Bolshibayeva and
                           G. Sadikova, “Comparison Evaluation of Unet-Based Models with Noise Augmentation for
                           Breast Cancer Segmentation on Ultrasound Images,” Eastern-European Journal of Enterprise
                           Technologies, vol. 125, no. 9, 2023 10.15587/1729-4061.2023.289044

 
                     
                        
                        N. Vaessen and D. A. Van Leeuwen, “Fine-tuning wav2vec2 for speaker recognition,”
                           In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal
                           Processing (ICASSP), IEEE, pp. 7967-7971, 2022. 10.1109/ICASSP43922.2022.9746952

 
                     
                        
                        K. Li, C. Baird and D. Lin, “Defend data poisoning attacks on voice authentication,”
                           IEEE Transactions on Dependable and Secure Computing, vol. 21, no. 4, pp. 1754-1769,
                           2023. 10.1109/TDSC.2023.3289446

 
                     
                        
                        J. W. Lee, E. Kim, J. Koo and K. Lee, “Representation selective self-distillation
                           and wav2vec 2.0 feature exploration for spoof-aware speaker verification,” 2022. 
                           Preprint, Available at: arXiv:2204.02639

 
                     
                        
                        S. Salturk and N. Kahraman, “Deep learning-powered multimodal biometric authentication:
                           integrating dynamic signatures and facial data for enhanced online security,” Neural
                           Computing and Applications, vol. 36, no. 19, pp. 11311-11322, 2024. 10.1007/s00521-024-09690-2

 
                     
                        
                        K. Merit and M. Beladgham, “Enhancing Biometric Security with Bimodal Deep Learning
                           and Feature-level Fusion of Facial and Voice Data,” Journal of Telecommunications
                           and Information Technology, vol. 98, no. 4, pp. 31-42, 2024. 10.26636/jtit.2024.4.1754

 
                     
                        
                        Y. Elbayoumi (2024), “Applying machine learning and deep learning in the voice biometrics
                           technology,” Master’s Thesis, Bahcesehir University, 22 January 2024.   https://www.researchgate.net/publication/380131916

 
                     
                        
                        K. Koutini, H. Eghbal-zadeh and G. Widmer, “Receptive field regularization techniques
                           for audio classification and tagging with deep convolutional neural networks,” IEEE/ACM
                           Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1987-2000, 2021.
                           10.1109/TASLP.2021.3082307

 
                     
                        
                        T. N. Sainath, O. Vinyals, A. Senior and H. Sak, “Convolutional, long short-term memory,
                           fully connected deep neural networks,” In 2015 IEEE international conference on acoustics,
                           speech and signal processing (ICASSP) IEEE, pp. 4580-4584. 2015. 10.1109/ICASSP.2015.7178838

 
                     
                        
                        G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang and W. Xu, “Dolphinattack: Inaudible voice
                           commands,” In Proceedings of the 2017 ACM SIGSAC conference on computer and communications
                           security, IEEE, pp. 103-117. 2017. 10.1145/3133956.3134052

 
                     
                        
                        A. Mohamed, H.-Y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff,
                           et al., “Self-supervised speech representation learning: A review,” IEEE Journal of
                           Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179-1210, 2022. 10.1109/JSTSP.2022.3207050

 
                     
                        
                        X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, et
                           al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE/ACM
                           Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507-2522, 2023.
                           10.1109/TASLP.2023.3285283

 
                     
                        
                        S. Tuli and N. K. Jha, “EdgeTran: Device-aware co-search of transformers for efficient
                           inference on mobile edge platforms,” IEEE Transactions on Mobile Computing, vol. 23,
                           no. 6, pp. 7012-7029, 2023. 10.1109/TMC.2023.3328287

 
                     
                        
                        S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup and M. Shah, “A survey of on-device
                           machine learning: An algorithms and learning theory perspective,” ACM Transactions
                           on Internet of Things, vol. 2, no. 3, pp. 1-49, 2021. 10.1145/3450494

 
                     
                        
                        E. Seitzhan, A. Bissembayev, A. Mukasheva, H. S. Park and J. W. Kang, “A Study on
                           the Optimization Efficiency of Software Development with Low-Code Platforms,” Transactions
                           of the Korean Institute of Electrical Engineers, vol. 74, no. 5, pp. 957-968, 2025.
                           10.5370/KIEE.2025.74.5.957

 
                     
                        
                        L. H. X. Ng, A. C. M. Lim, A. X. W. Lim and A. Taeihagh, “Digital ethics for biometric
                           applications in a smart city,” Digital Government: Research and Practice, vol. 4,
                           no. 4, pp. 1-6, 2023. 10.1145/3630261

 
                     
                        
                        A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R.
                           Rickford, D. Jurafsky and S. Goel, “Racial disparities in automated speech recognition,”
                           Proceedings of the national academy of sciences, vol. 117, no. 14, pp. 7684-7689,
                           2020. 10.1073/pnas.1915768117

 
                     
                        
                        K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon,
                           et al., “Towards federated learning at scale: System design,” Proceedings of machine
                           learning and systems, vol. 1, pp. 374-388, 2019. https://proceedings.mlsys.org/paper_files/paper/2019/file/7b770da633baf74895be22a8807f1a8f-Paper.pdf

 
                     
                        
                        P. Voigt and A. Von dem Bussche, “The EU General Data Protection Regulation (GDPR),”
                           A Practical Guide, 1st ed., Cham: Springer International Publishing, 2017. 10.1007/978-3-319-57959-7

 
                     
                        
                        S. Wachter, B. Mittelstadt and C. Russell, “Counterfactual explanations without opening
                           the black box: Automated decisions and the GDPR,” Harvard Journal of Law & Technology,
                           vol. 31, no. 2, pp. 841-887, 2017. 10.2139/ssrn.3063289

 
                   
                
             
            저자소개
            
            He received the B.S. degree in Computer Systems and Software from Kazakh-British Technical
               University (KBTU), Almaty, Kazakhstan, in 2022. Since 2024, he has been pursuing the
               M.S. degree in Information Systems at the School of Information Technology and Engineering,
               KBTU. He is currently working as a freelance Java Backend developer. His research
               interests include machine learning, artificial intelligence, and voice biometric authentication.
            
            
            She is studying at Nazarbayev University’s School of Sciences and Humanities and is
               currently in her junior year as a sociology student. Her research interests include
               a wide variety of subjects, including quantitative and qualitative research, as well
               as policy implementation.
            
            
            He received his B.S., M.S., and Ph.D. degrees in electronic engineering from Chung-Ang
               University, Seoul, Korea, in 1995, 1997, and 2002, respectively. In March 2008, he
               joined the Korea National University of Transportation, Republic of Korea, where he
               currently holds the position of Professor in the Department of Transportation System
               Engineering, the Department of SMART Railway System, and the Department of Smart Railway
               and Transportation Engineering.
            
            
            She received the B.S., M.S., and PhD. degrees from Satbayev University, Almaty, Kazakhstan,
               in 2004, 2014, and 2020, respectively. In September 2023, she joined Kazakh-British
               Technical University, where she is currently an professor in School of Information
               Technology and Engineering. Big Data, cyber security, machine learning, and comparative
               study of deep learning methods.