Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Main Menu

Journal Search

[

Research article

]

The Transactions of the Korean Institute of Electrical Engineers

KIEE Vol. 72, No. 11, p.1412-1419

ISSN (print) :

1975-8359

ISSN (online) :

2287-4364

Received : 6 October 2023Revised : 25 October 2023Accepted : 27 October 2023

DOI :

http://doi.org/10.5370/KIEE.2023.72.11.1412

Improvement of Kidney Tumor Stage Classification Performance using Machine Learning Methods

기계학습 기법을 사용한 신장암 병기 분류 성능의 개선

손호선 (Ho Sun Shon) ¹iD KongVungsovanreach (Kong Vungsovanreach) ²iD 윤석중 (Seok Joong Yun) ³iD 오진우 (Jin Woo Oh) ⁴iD 강태건 (Tae Gun Kang) ⁵iD 김경아 (Kyung Ah Kim) ^†iD

(Medical Research Institute, College of Medicine, Chungbuk National University, Korea.)
(Dept. of Big Data, Chungbuk National University, Korea.)
(Dept. of Urology, College of Medicine, Chungbuk National University and Chungbuk National University Hospital, Korea.)
(Dept. of Biomedical Engineering, College of Medicine, Chungbuk National University, Korea.)
(Institute for Trauma Research, College of Medicine, Korea University, Korea.)

^†Corresponding Author : Dept. of Biomedical Engineering, College of Medicine, Chungbuk National University, Korea.

E-mail : kimka@chungbuk.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.kiee.or.kr).

Abstract

Utilizing gene expression data from kidney cancer patients, we have developed a machine learning-based deep learning algorithm to extract significant genes for predicting the patients' prognosis and enhance classification performance while addressing data imbalance issues. Particularly, classification based on tumor stage plays a crucial role in determining appropriate treatment approaches for kidney cancer patients and predicting post-treatment prognosis. We classified kidney cancer tumor stages into four categories and evaluated their performance. The results demonstrated that the SVM algorithm, utilizing an autoencoder for feature extraction and addressing data imbalance through the SMOTE technique, exhibited the best performance in terms of accuracy, recall, precision, F1-score, and AUC. These results can be utilized to choose the most suitable treatment strategy at the current state and for predicting the prognosis and enabling early diagnosis of kidney cancer.

Key words

Kidney tumor, Stage classification, Deep learning, Feature extraction

1. Introduction

Analyzing genomic data related to gene expression of biological phenomena is challenging due to the greater number of genes compared to the number of patients. Recently, various studies have been conducted utilizing such bio data. In particular, the efficiency, accuracy, and speed of research are being enhanced using AI technology in the field of biotechnology for data storage, purification, and analysis ⁽¹⁾.

We aim to contribute to the early diagnosis, prognosis, and prediction of cancer in patients by extracting significant genes using genetic data from kidney cancer and developing a classification model based on the extracted genes. Kidney cancer is a rapidly increasing cancer and is often referred to as a silent cancer. The likelihood of all symptoms, including flank pain, bloody stool, and abdominal mass, appearing is only 10-15%. Specifically, kidney cancer typically presents with no noticeable symptoms, and in 3 out of 10 cases, it is found to have metastasized to other organs. Therefore, implementing appropriate treatment based on the tumor stage of kidney cancer patients is a crucial task that demands a strategic approach.

Kidney cancer is a primary tumor that occurs in the kidney, and renal cell carcinoma, a malignant tumor, accounts for more than 90% of cases. Kidney cancer typically does not show symptoms in the early stages, often already reaching a progressive stage by the time of diagnosis. According to data released by the National Cancer Information Center in 2022, kidney cancer accounted for 2.4% of all new cancer cases in Korea in 2020, ranking 10th in incidence ⁽²⁾. The incidence of cancer is higher in men than in women, and it most frequently occurs in people in their 60s. Additionally, kidney cancer places a significant disease burden due to a decline in the quality of life resulting from disease symptoms, treatment-related adverse events, and the subsequent increase in medical costs. Risk factors for kidney cancer include environmental habits, lifestyle factors, genetic predispositions, and existing kidney disease. Among these, lifestyle factors like smoking, obesity, high blood pressure, and dietary habits can be contributing causes ⁽³⁾. Recently, domestic researchers developed an algorithm to predict kidney cancer recurrence, and ongoing research is focusing on extracting features and implementing a classification algorithm using neighborhood component analysis and genomic data ⁽⁴⁾-⁽⁶⁾. Machine learning-related algorithms are being applied to various biodata analyses, including RNA sequencing, DNA methylation analysis in breast invasive carcinoma, thyroid carcinoma, and kidney renal papillary cell carcinoma data from The Cancer Genome Atlas (TCGA) ⁽⁷⁾. A data mining algorithm was employed to extract cancer-related genes by integrating the data. Additionally, a study predicted the risk of 20 cancers by applying machine learning techniques to analyze genetic big data ⁽⁸⁾. A Bayesian classifier has been utilized to classify proteins based on sequence and structure information, enhancing the functional prediction performance of genes by integrating diverse protein and gene-related information using a Bayesian network ⁽⁹⁾. Research efforts have also been directed towards accurately predicting major mutations responsible for spinal muscular atrophy, hereditary nasal polyposis, colorectal cancer, and autism. This is achieved by applying deep learning technology to predict the patient's disease state through the analysis of mutations present in the gene sequence ⁽¹⁰⁾. Various methods for extracting features from gene expression data are being studied, and recently, methods using deep learning and statistical techniques are being studied ⁽¹¹⁾-⁽¹³⁾. Machine learning-based deep learning algorithms are widely used as a method for feature extraction from biomedical images ⁽¹⁴⁾.

In this study, we extracted significant genes from gene expression datasets using two algorithms: autoencoder (AE) and variational autoencoder (VAE). We then compared and analyzed the tumor stage classification performance of kidney cancer. Classification analysis based on tumor stage allows for the analysis of complex data, such as gene expression data, and improves classification accuracy. This approach can serve as a foundation for analyzing other gene expression data, and various machine learning algorithms can be employed to analyze medical data.

2. Materials and methods

This section describes the dataset as well as all of the techniques applied in this study. Fig. 1 depicts the overall research flow, from dataset collection to model evaluation.

그림. 1. 신장암 단계 결정에 사용된 딥러닝 프레임워크 흐름도

Fig. 1. An end-to-end experimental flow of our deep learning framework used for staging the kidney tumor

2.1 Dataset

This study made use of a gene expression dataset obtained from TCGA website, which provides access to a wide range of biomedical datasets, including mRNA data. The dataset, which included information from 1,157 kidney cancer patients, was meticulously prepared for analysis. Although the original dataset included both gene expression and clinical data, only gene expression data was used in this study. To ensure accuracy and consistency, the dataset was cleaned and preprocessed to remove missing, duplicate, and invalid values, as well as clinical data. tables 1 and 2 provide detailed statistics about the dataset, such as the total number of datasets for each stage and information on data features. The "Before Cleaning" row refers to the original dataset, which was downloaded from the TCGA website without any preprocessing, whereas the "After Cleaning" row refers to the dataset after various cleaning and preprocessing techniques were applied. This meticulous preparation of the gene expression dataset contributed to the subsequent analyses producing reliable and meaningful results.

표 1. 데이터 정제 전후의 종양 단계 데이터 통계

Table 1. Tumor stage data statistic before and after data cleaning

Condition	Stage1	Stage2	Stage3	Stage4	Invalid	Total
Before Cleaning	528	183	261	146	39	1,157
After Cleaning	477	153	228	115	0	973

표 2. 데이터 정제 전후의 유전자 발현 데이터 통계

Table 2. Gene expression data statistic before and after data cleaning

Condition	Number of sample	Gene expression features	Clinical data features	Total features
Before Cleaning	1,157	60,483	29	60,512
After Cleaning	973	58,722	0	58,722

2.2 Correlation test

Naturally, a gene expression dataset consists of a large number of features, and each feature represents a unique gene of a patient. Among this massive number of features, some are highly correlated, while others are less correlated or not at all. By doing a correlation test on the dataset, we can significantly improve the performance of the classification models ⁽¹⁵⁾. Classification models might be hampered by redundant features. As a result, the Pearson correlation coefficient-based method is utilized to identify redundant features, which leads to their removal. We choose only the top 1000 features with the least amount of redundancy after filtering the data using Pearson correlation values. The following formula defines the Pearson correlation values:

(1)

$r=\frac{\sum\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{\sqrt{\sum\left(x_i-\bar{x}\right)^2 \sum\left(y_i-\bar{y}\right)^2}}$

2.3 Feature selection

After features were filtered using the correlation test, a second method was used to reduce the dimension of the features by applying the feature selection method. Feature selection can help in getting rid of irrelevant data as well as dealing with noise in the dataset ⁽¹⁶⁾, ⁽¹⁷⁾. Two feature selection techniques were selected and combined to choose a list of highly relevant features from thousands of features. These techniques are Least Absolute Shrinkage and Selection Operator (LASSO) and Analysis of Variance (ANOVA). Lasso is a linear regression in which shrinkage techniques are used to shrink the coefficients of determination toward zero ⁽¹⁸⁾. ANOVA is a famous statistical hypothesis test used to determine whether there are any significant differences in the variances of two or more groups ⁽¹⁹⁾.

Three different experiments, including using only LASSO, using only ANOVA, and combining both LASSO and ANOVA together, were conducted to find the best feature selection strategy that gave the best classification result. By evaluating the results produced by these three strategies, we can observe that combining LASSO with ANOVA can improve classification performance. We combined these two techniques' results by selecting the first 500 relevant features from the intersection of both results.

2.4 Feature extraction

Two popular methods can be used for dimension reduction when the dataset contains many features. Besides the earlier mentioned feature selection technique, feature extraction is another way to do it. The main goal of feature extraction is to create a smaller version of the original dataset while keeping the meaning of the original dataset ⁽²⁰⁾. Besides dimension reduction, improving training speed and better computation are well-known feature extraction advantages ⁽²¹⁾–⁽²³⁾. This research used two feature extraction techniques: AE, VAE.

2.4.1 AE

An unsupervised artificial neural network architecture that is used to compress and encode the data before reconstructing the original data from the encoded data ⁽²⁴⁾. It is widely used for dimension reduction or noise removal. This neural network consists of three components: the encoder, bottleneck, and decoder ⁽²⁵⁾. The encoder is responsible for compressing the original dataset into a small-dimension version while trying to keep the original meaning of the dataset. The bottleneck is accountable for storing the compressed version of the data. The decoder compresses the data from the bottleneck to reconstruct the original data. The AE learns by observing the reconstruction error and attempting to minimize it so that the original and reconstructed data look as similar as possible.

그림. 2. AE 모델의 아키텍처 ⁽²⁶⁾

Fig. 2. The architecture of the AE model ⁽²⁶⁾

2.4.2 VAE

It follows an encoder-bottleneck-decoder structure but modifies the bottleneck component to make it a generative model. VAE learns the probability distribution of the data instead of mapping the input to a numeric number ⁽²⁷⁾. Two latent spaces, mean and variance, were learned in the bottleneck component. So, a new dataset could be generated by choosing a random value from that distribution.

그림. 3. 3 VAE 모델의 아키텍처 ⁽²⁶⁾

Fig. 3. The architecture of the VAE model ⁽²⁶⁾

2.5 Synthetic minority over-sampling technique

Our dataset is skewed or imbalanced, which means that the number of observations for each class is not equal or close to each other. table 1 and 2, which describe the data statistics, show that the observations for tumor stage 1 cover around 50% of the dataset, leaving another 50% for the other three tumor stages. Imbalanced data can cause problems for machine learning models by biasing the model toward any class with more datasets ⁽²⁸⁾–⁽³⁰⁾. To solve this problem, we apply an oversampling technique called Synthetic Minority Over-sampling Technique (SMOTE) to generate a new set of datasets that contain the same amount of observation for all classes ⁽³¹⁾. Prior to utilizing SMOTE, the train dataset exhibited an unequal distribution across the kidney tumor stages. Stage 1 had the highest count with 312 records, followed by Stage 3 with 189, Stage 2 with 105, and Stage 4 being the least represented with 76 instances. In a bid to rectify this imbalance, we applied the SMOTE technique with the "auto" hyperparameter, which adopts an adaptive sampling strategy. After the SMOTE application, the total count of records for all tumor stages combined equaled 1248, with each individual stage-Stage 1 to Stage 4-having a balanced representation of 312 records.

2.6 Classifiers

We applied the top 9 classification algorithms, allowing us to assess how well they performed against one another. These methods include logistic regression (LR) ⁽³²⁾, support vector machine (SVM) ⁽³³⁾, decision tree (DT) ⁽³⁴⁾, random forest (RF) ⁽³⁵⁾, k-nearest neighbor (KNN) ⁽³⁶⁾, naïve bayes (NB) ⁽³⁷⁾, AdaBoost (ADA) ⁽³⁸⁾, XGBoost (XGB) ⁽³⁹⁾, and stochastic gradient descent classifier (SGD) ⁽⁴⁰⁾. For the selected classifiers, the hyperparameter configurations are as follows: For LR, the settings are C=1, penalty="l2", and solver="liblinear". SVM is set with C=1, kernel="rbf", and gamma="scale". DT is configured with max_depth=5. RF uses n_estimators=100 and max_depth=5. KNN utilizes n_neighbors=5. NB operates with default settings. ADA uses n_estimators=50. XGB is set with n_estimators=100 and learning_rate=0.1. For SGD, it's wrapped in a calibration with max_iter=1000 and tol=1e-3.

2.7 Model evaluation metrics

This section discusses the evaluation metrics for all the classifiers listed above. The seven most popular evaluation metrics were used to compare the performance of those classifiers, including accuracy, recall, precision, the f1-score, sensitivity, specificity, and area under the curve (AUC) ⁽⁴¹⁾. To balance the model performance across classes, we use macro-averaged precision, recall, and f1-score to get the overall average value for each metric. Besides these metrics, specificity, and sensitivity were also calculated to get a deeper understanding of the rates of true positive and true negative, respectively. In the below formula: TP, TN, FP, and FN are the numbers of true positive, true negative, false positive, and false negative, respectively.

(2)

$\begin{aligned} & \text { Accuracy }=\frac{T P+T N}{T P+T N+F P+F N} \\ & \text { Precision }=\frac{T P}{T P+F P} \\ & \text { Recall }=\frac{T P}{T P+F N} \\ & \text { F1-score }=\frac{2 \times \text { Recall } \times \text { Precision }}{\text { Recall }+ \text { Precision }} \\ & \text { Sensitivity }=\frac{T P}{T P+F N} \\ & \text { Specificity }=\frac{T N}{T N+F P}\end{aligned}$

3. Experiment results

This section presents the result of the experiment from the beginning to the end, including feature extraction with deep learning techniques and classification results with multiple classification algorithms.

3.1 Feature extraction with deep learning

We reduced the total number of features from thousands to 1000 by filtering using a correlation test. Then, we continue reducing the dataset dimension from 1000 to 500 by applying feature selection techniques. From these 500 features, we applied features extraction to reduce the dataset dimension to 50 features. We chose three feature extraction algorithms to conduct the experiment, including AE and VAE. Based on the experiment result, the original and reconstructed dataset was observed to be similar, which shows the efficiency in feature extraction. fig 4shows the models' loss values, and fig 5visualizes the reconstructed data point compared to the original data point when using AE and VAE models. Among these two algorithms, the AE model produced the most similar reconstructed dataset compared to the original dataset.

그림. 4. 특징추출기법에 따른 (a) AE와 (b) VAE의 훈련 및 평가 손실

Fig. 4. Training and validation loss of (a) AE and (b) VAE as feature extraction techniques

그림. 5. (a) AE와 (b) VAE를 사용한 표본 데이터 재구성

Fig. 5. Sample data reconstruction using (a) AE and (b) VAE

3.2 Classification results

This section shows the result of 9 machine learning classifiers that is used for classifying patient kidney tumor stages. Before training classifiers, we preprocessed data with correlation analysis, feature selection, feature extraction, and oversampling for the purpose of improving classification performance. table 3 and 4 indicate the evaluation metrics of all classifiers when using an AE and VAE as feature extraction with and without the oversampling technique applied. In these two tables, the "Sampling" column indicates whether the oversampling technique was applied.

table 3 shows the performance of all classifiers when using the AE model to extract the features. In this case, the support vector machine outperformed other classifiers in most evaluation metrics. XGBoost, Naïve Bayes, and Decision Tree are the next algorithms to show good results. We could see that SVM significantly improves the performance after applying the oversampling technique to the dataset.

표 3. 특징추출기법으로 AE를 사용한 예측 모델의 평가

Table 3. Evaluation of prediction models using AE as feature extraction

Classifier	Sampling	Accuracy	Recall	Precision	F1-Score	Sensitivity	Specificity	AUC
LR	Yes	0.890	0.860	0.863	0.858	0.860	0.964	0.968
LR	No	0.873	0.814	0.844	0.825	0.814	0.957	0.955
SVM	Yes	0.984	0.974	0.987	0.980	0.974	0.992	0.985
SVM	No	0.959	0.936	0.963	0.948	0.936	0.984	0.973
DT	Yes	0.964	0.974	0.987	0.980	0.974	0.992	0.983
DT	No	0.959	0.936	0.963	0.948	0.936	0.984	0.960
RF	Yes	0.952	0.945	0.945	0.944	0.945	0.983	0.983
RF	No	0.904	0.847	0.903	0.860	0.847	0.966	0.958
KNN	Yes	0.846	0.829	0.814	0.817	0.829	0.950	0.948
KNN	No	0.822	0.762	0.794	0.769	0.762	0.939	0.925
NB	Yes	0.973	0.974	0.987	0.980	0.974	0.992	0.983
NB	No	0.959	0.936	0.963	0.948	0.936	0.984	0.960
ADA	Yes	0.753	0.728	0.587	0.630	0.728	0.925	0.930
ADA	No	0.829	0.726	0.639	0.669	0.726	0.942	0.912
XGB	Yes	0.963	0.974	0.987	0.980	0.974	0.992	0.983
XGB	No	0.959	0.936	0.963	0.948	0.936	0.984	0.963
SGD	Yes	0.870	0.838	0.839	0.831	0.838	0.958	0.953
SGD	No	0.884	0.826	0.867	0.838	0.826	0.959	0.950

Moreover, another different result is shown in table 4, which indicates the result when we used a VAE to extract the features. The performance changed when we changed the feature extraction algorithm, which means it decreased in this case compared to the AE model. The table shows that SVM still outperforms most of the classifiers in terms of AUC value, followed by XGBoost, Naïve Bayes, and Decision Tree.

표 4. 특징추출 기법으로 VAE를 사용한 예측 모델의 평가

Table 4. Evaluation of prediction models using VAE as feature extraction

Classifier	Sampling	Accuracy	Recall	Precision	F1-Score	Sensitivity	Specificity	AUC
LR	Yes	0.863	0.810	0.865	0.833	0.810	0.949	0.928
LR	No	0.849	0.812	0.834	0.815	0.812	0.946	0.918
SVM	Yes	0.918	0.886	0.929	0.904	0.886	0.967	0.943
SVM	No	0.918	0.886	0.929	0.904	0.886	0.967	0.945
DT	Yes	0.918	0.903	0.929	0.903	0.903	0.968	0.935
DT	No	0.918	0.903	0.929	0.903	0.903	0.968	0.935
RF	Yes	0.870	0.844	0.888	0.830	0.844	0.952	0.935
RF	No	0.901	0.877	0.912	0.880	0.877	0.961	0.938
KNN	Yes	0.678	0.590	0.654	0.581	0.590	0.885	0.843
KNN	No	0.634	0.610	0.611	0.588	0.610	0.880	0.818
NB	Yes	0.918	0.903	0.929	0.903	0.903	0.968	0.935
NB	No	0.918	0.903	0.929	0.903	0.903	0.968	0.935
ADA	Yes	0.791	0.653	0.619	0.610	0.653	0.927	0.875
ADA	No	0.702	0.653	0.559	0.553	0.653	0.906	0.875
XGB	Yes	0.918	0.903	0.929	0.903	0.903	0.968	0.915
XGB	No	0.918	0.903	0.929	0.903	0.903	0.968	0.935
SGD	Yes	0.860	0.794	0.869	0.810	0.794	0.949	0.915
SGD	No	0.829	0.772	0.808	0.774	0.772	0.941	0.901

tables 3 and 4 show the evaluation result of all classifiers as an average of four tumor classes. fig 6can bring more detail to the Area Under the Curve (AUC) value listed in the above tables by showing the Receiver operating characteristic curve (ROC) and AUC value for each tumor stage without calculating it as an average value. We only show the ROC curve when AE is used as feature extraction since we already observed that AE is outperformed the other two feature extraction methods.

그림. 6. AE를 사용한 특징 추출과 SMOTE를 사용한 오버샘플링의 ROC와 AUC

Fig. 6. ROC and AUC for each classifier when using AE as the feature extraction and SMOTE as an oversampling

4. Discussion and conclusion

The survival rate of kidney cancer patients who received early treatment is higher than those who received treatment in the serious stage ⁽⁴²⁾. That is why technology is widely used to help patients receive health information and treatment as soon as possible by using patient data such as clinical and biomedical data ⁽⁴³⁾, ⁽⁴⁴⁾.

In this study, we used different algorithms to detect and classify the stage of kidney tumors. An open dataset published on the TCGA portal was cleaned, preprocessed, and used to build classification models. A correlation analysis, feature extraction, feature selection, and oversampling technique were also used for the purpose of boosting the classifiers’ performance.

With this gene expression dataset, we observed from the results that the AE outperformed the VAE as a feature extraction technique. Classifier performance has significantly improved in most evaluation metrics, including accuracy, recall, precision, and f1-score. So, feature extraction and feature selection can improve classification models by reducing the dimensionality of the input data, removing irrelevant and redundant features, and highlighting the most important features for the specific task. This leads to more efficient and accurate models and faster training and inference times. The above result shows the enhancement of the classification model when different techniques, including feature selection, feature extraction, and resampling, are applied.

In conclusion, the classification of kidney tumor stages using feature extraction techniques such as AE and VAE in conjunction with various classifiers such as LR, RF, DT, SVM, and others has been shown to be an effective method for accurately classifying kidney tumors. The combination of feature extraction techniques and classifiers has been used to extract the most important features from medical imaging data and build models that can accurately classify tumors into different stages.

The results obtained from these methods have been promising, with high accuracy and precision in classifying tumor stages. This is particularly important for the early detection and treatment of kidney tumors, which can improve patient outcomes and reduce healthcare costs. Furthermore, using multiple classifiers such as LR, RF, DT, and SVM allows us to compare the performance of the different models and select the best one for a specific dataset.

To continue contributing to the medical sector, we plan to improve classification performance by combining both clinical and biomedical data of patients for training classifiers. Moreover, we plan to detect and identify a list of genes that contributes the most to kidney cancer. As a future study, we intend to utilize explainable artificial intelligence (XAI) to explain the features of gene expression data that contributed to predictive models, so that users can understand and trust the results of machine learning algorithm analysis. XAI can be used to describe AI models, expected impacts and potential biases. This will be useful for characterizing model accuracy, fairness, transparency, and end result in AI-based decision making. With an accurate list of essential genes, a doctor could pay more attention to those genes rather than spending time on other less essential genes.

Acknowledgements

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(RS-2023-00245300, No. 2020R1I1A1A 01065199, 2020R1I1A3062508) and by "Regional Innovation Strategy (RIS)" through the NRF funded by the MOE (2021RIS-001)

References

L. A. Gottlieb, A. Kontorovich, R. Krauthgamer, 2016, Adaptive metric dimensionality reduction, Theoretical Computer Science, Vol. 620, pp. 105-118

17 August 2023, National Cancer Center. Available online, https://ncc.re.kr/index

H. Chi, I. H. Chang, 2018, The Overdiagnosis of Kidney Cancer in Koreans and the Active Surveillance on Small Renal Mass, Korean J Urol Oncol, Vol. 16, No. 1, pp. 15-24

A. M. Ali, H. Zhuang, A. Ibrahim, O. Rehman, M. Huang, A. Wu, Nov. 2018, A machine learning approach for the classification of kidney cancer subtypes using miRNA genome data, Appl Sci, Vol. 8, No. 2422, pp. 1-14

H. M. Kim, S. J. Lee, S. J. Park, I. Y. Choi, S. Hong, 2021, Machine Learning Approach to Predict the Probability of Recurrence of Renal Cell Carcinoma After Surgery: Prediction Model Development Study, JMIR Med Inform, Vol. 9, No. 3

A. J. Peired, R. Campi, M. L. Angelotti, G. Antonelli, C. Conte, E. Lazzeri, F. Becherucci, L. Calistri, S. Serni, P. Romagnani, 2021, Sex and Gender Differences in Kidney Cancer: Clinical and Experimental Evidence, Cancers, Vol. 13, No. 18, pp. 4588-

17 August, 2023, Genomic Data Commons. Available online, https://portal.gdc.cancer.gov

B. J. Kim, S. H. Kim, 2018, Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method, Proc Natl Acad Sci USA, Vol. 115, No. 6, pp. 1322-1327

O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, D. Botstein, 2003, A Bayesian framework for combining heterogeneous data sources for gene function prediction (in S. cerevisiae), Proc Natl Acad Sci USA, Vol. 100, No. 14, pp. 8348-8353

N. E. M. Khalifa, M. H. N. Taha, D. E. Ali, A. Slowik, A. E. Hassanien, 2020, Artificial intelligence technique for gene expression by tumor RNA-Seq data: a novel optimized deep learning approach, IEEE Access, Vol. 8, pp. 22874-22883

H. S. Shon, K. O. Kim, E. J. Cha, K. A. Kim, 2020, Classification of Kidney Cancer Data based on Feature Extraction Methods, The Transactions of the Korean Institute of Electrical Engineers, Vol. 69, No. 7, pp. 1061-1066

H. S. Shon, E. Batbaatar, E. J. Cha, T. G. Kang, S. G. Choi, K. A. Kim, 2022, Deep Autoencoder based Classification for Clinical Prediction of Kidney Cancer, The Transactions of the Korean Institute of Electrical Engineers, Vol. 71, No. 10, pp. 1393-1404

H. S. Shon, E. Batbaatar, K. O. Kim, E. J. Cha, K. A. Kim, 2020, Classification of kidney cancer data using cost-sensitive hybrid deep learning approach, Symmetry, Vol. 12, No. 1, pp. 1-21

Y. Bengio, E. Laufer, G. Alain, J. Yosinski, 2014, Deep generative stochastic networks trainable by backprop, Proceeding of the 31st International Conference on Machine Learning, Vol. 32, pp. 226-234

B. Kalaiselvi, M. Thangamani, 2020, An efficient Pearson correlation based improved random forest classification for protein structure prediction techniques, Measurement, Vol. 162

I. Jain, V. K. Jain, R. Jain, 2018, Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification, Applied Soft Computing, Vol. 62, pp. 203-215

Z. M. Hira, D. F. Gillies, 2015, A review of feature selection and feature extraction methods applied on microarray data, Adv Bioinformatics, Vol. 2015

R. Tibshirani, 1996, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical B, Vol. 58, No. 1, pp. 267-288

E. R. Girden, 1992, ANOVA: Repeated measures, sage

I. Guyon, M. Nikravesh, S. Gunn, L. A. Zadeh, 2008, Feature extraction: foundations and applications, Springer

A. Nakra, M. Duhan, 2020, Feature Extraction and Dimensionality Reduction Techniques with Their Advantages and Disadvantages for EEG-Based BCI System: A Review, IUP Journal of Computer Sciences, Vol. 14, No. 2020, pp. -

X. Zhang, W. Yang, X. Tang, J. Liu, 2018, A fast learning method for accurate and robust lane detection using two-stage feature extraction with YOLO v3, Sensors, Vol. 18, No. 12, pp. 4308-

M. Oravec, 2014, Feature extraction and classification by machine learning methods for biometric recognition of face and iris, Proceedings of ELMAR-2014

P. Baldi, 2011, Autoencoders, unsupervised learning, and deep architectures, Proceedings of machine learning research, Vol. 27, pp. -

M. Sewak, S. K. Sahay, H. Rathore, 2020, An overview of deep learning architecture of deep neural networks and autoencoders, Journal of Computational and Theoretical Nanoscience, Vol. 17, No. 1, pp. 182-188

L. Weng, From Autoencoder to Beta-VAE, Available from: https://lilianweng.github.io/posts/2018-08-12-vae/

D. P. Kingma, M. Welling, 2013, Auto-encoding variational bayes, arXiv preprint arXiv:13126114

S. M. A. Elrahman, A. Abraham, 2013, A review of class imbalance problem, Journal of Network and Innovative Computing, Vol. 1, pp. 332-340

K. M. Hasib, M. Iqbal, F. M. Shah, J. A. Mahmud, M. H. Popel, M. Showrov, S. Ahmed, O. Rahman, 2020, A survey of methods for managing the classification and solution of data imbalance problem, arXiv preprint arXiv:201211870

D. Li, C. Liu, S. C. Hu, 2010, A learning method for the class imbalance problem with medical data sets, Computers in biology and medicine, Vol. 40, No. 5, pp. 509-518

N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, 2002, SMOTE: synthetic minority over-sampling technique, Journal of artificial intelligence research, pp. 321-357

D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, M. Klein, 2002, Logistic regression, Springer

W. S. Noble, 2006, What is a support vector machine?, Nature biotechnology, Vol. 24, No. 12, pp. 1565-1567

B. Charbuty, A. Abdulazeez, 2021, Classification based on decision tree algorithm for machine learning, Journal of Applied Science and Technology Trends, Vol. 2, No. 1, pp. 20-28

Y. Qi, 2012, Random forest for bioinformatics. Ensemble machine learning: Methods and applications, Springer, pp. 307-323

N. S. Altman, 1992, An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician, Vol. 46, No. 3, pp. 175-185

K. P. Murphy, 2006, Naive bayes classifiers, University of British Columbia, Vol. 18, No. 60, pp. 1-8

T. Hastie, S. Rosset, J. Zhu, H. Zou, 2009, Multi-class adaboost, Statistics and its Interface, Vol. 2, No. 3, pp. 349-360

2016, Xgboost: A scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. -

S. Ruder, 2016, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:160904747

2020, Performance evaluation of supervised machine learning algorithms in prediction of heart disease, 2020 IEEE International Conference for Innovation in Technology, IEEE, pp. -

M. Viscaino, J. T. Bustos, P. Muñoz, C. A. Cheein, F. A. Cheein, 2021, Artificial intelligence for the early detection of colorectal cancer: A comprehensive review of its advantages and misconceptions, World Journal of Gastroenterology, Vol. 27, No. 38, pp. 6399-

A. N. Richter, T. M. Khoshgoftaar, 2018, A review of statistical and machine learning methods for modeling cancer risk using structured clinical data, Artificial intelligence in medicine, Vol. 90, No. , pp. 1-14

S. R. Stahlschmidt, B. Ulfenborg, J. Synnergren, 2022, Multimodal deep learning for biomedical data fusion: a review, Briefings in Bioinformatics, Vol. 23, No. 2, pp. bbab569-

저자소개

손호선 (Ho Sun Shon)

2010 : Ph.D. in Computer Science, Chungbuk National University, Korea.

2012 to present : Visiting professor in Medical Research Institute, School of Medicine, Chungbuk National University, Korea.

Kong Vungsovanreach

2022 to present : Ph.D. student in Big Data, Chungbuk National University, Korea.

Research interest : AI-driven techniques in object detection, recognition, segmentation, classification, and data analytics.

윤석중(Seok Joong Yun)

2004 : Ph.D. in Medicine, Chungbuk National University, Korea.

2005 to present : Professor in Department of Urology, College of Medicine, Chungbuk National University, Korea.

2008-2009: Visiting Professor, Department of Cancer Biology, MD Anderson Cancer Center, Houston, Texas.

오진우 (Jin Woo Oh)

2020 to present : Graduate student in Department of Biomedical Engineering, College of Medicine, Chungbuk National University, Korea.

강태건 (Tae Gun Kang)

2000 : Ph.D. in Industrial Engineering, Dongguk University, Korea

2021 to present : Research professor in Institute for Trauma Research, College of Medicine, Korea University, Korea.

김경아 (Kyung Ah Kim)

2001 : Ph.D. in Biomedical Engineering, Chungbuk National University, Korea.

2005 to present : Professor in Department of Biomedical Engineering, College of Medicine, Chungbuk National University, Korea.

KIEEThe Transactions of
the Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Journal Search

Journal XML

Journal Information

기계학습 기법을 사용한 신장암 병기 분류 성능의 개선

Abstract

Key words

1. Introduction

2. Materials and methods

2.1 Dataset

2.2 Correlation test

(1)

2.3 Feature selection

2.4 Feature extraction

2.4.1 AE

2.4.2 VAE

2.5 Synthetic minority over-sampling technique

2.6 Classifiers

2.7 Model evaluation metrics

(2)

3. Experiment results

3.1 Feature extraction with deep learning

3.2 Classification results

4. Discussion and conclusion

Acknowledgements

References

저자소개

손호선 (Ho Sun Shon)

Kong Vungsovanreach

윤석중(Seok Joong Yun)

오진우 (Jin Woo Oh)

강태건 (Tae Gun Kang)

김경아 (Kyung Ah Kim)

Article Information (continued)

Key words

KIEEThe Transactions ofthe Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Journal Search

Journal XML

Journal Information

기계학습 기법을 사용한 신장암 병기 분류 성능의 개선

Abstract

Key words

1. Introduction

2. Materials and methods

2.1 Dataset

2.2 Correlation test

(1)

2.3 Feature selection

2.4 Feature extraction

2.4.1 AE

2.4.2 VAE

2.5 Synthetic minority over-sampling technique

2.6 Classifiers

2.7 Model evaluation metrics

(2)

3. Experiment results

3.1 Feature extraction with deep learning

3.2 Classification results

4. Discussion and conclusion

Acknowledgements

References

저자소개

손호선 (Ho Sun Shon)

Kong Vungsovanreach

윤석중(Seok Joong Yun)

오진우 (Jin Woo Oh)

강태건 (Tae Gun Kang)

김경아 (Kyung Ah Kim)

Article Information (continued)

Key words

KIEEThe Transactions of
the Korean Institute of Electrical Engineers