손호선
(Ho Sun Shon)
1iD
KongVungsovanreach
(Kong Vungsovanreach)
2iD
윤석중
(Seok Joong Yun)
3iD
오진우
(Jin Woo Oh)
4iD
강태건
(Tae Gun Kang)
5iD
김경아
(Kyung Ah Kim)
†iD
-
(Medical Research Institute, College of Medicine, Chungbuk National University, Korea.)
-
(Dept. of Big Data, Chungbuk National University, Korea.)
-
(Dept. of Urology, College of Medicine, Chungbuk National University and Chungbuk National
University Hospital, Korea.)
-
(Dept. of Biomedical Engineering, College of Medicine, Chungbuk National University,
Korea.)
-
(Institute for Trauma Research, College of Medicine, Korea University, Korea.)
Copyright © The Korean Institute of Electrical Engineers(KIEE)
Key words
Kidney tumor, Stage classification, Deep learning, Feature extraction
1. Introduction
Analyzing genomic data related to gene expression of biological phenomena is challenging
due to the greater number of genes compared to the number of patients. Recently, various
studies have been conducted utilizing such bio data. In particular, the efficiency,
accuracy, and speed of research are being enhanced using AI technology in the field
of biotechnology for data storage, purification, and analysis (1).
We aim to contribute to the early diagnosis, prognosis, and prediction of cancer in
patients by extracting significant genes using genetic data from kidney cancer and
developing a classification model based on the extracted genes. Kidney cancer is a
rapidly increasing cancer and is often referred to as a silent cancer. The likelihood
of all symptoms, including flank pain, bloody stool, and abdominal mass, appearing
is only 10-15%. Specifically, kidney cancer typically presents with no noticeable
symptoms, and in 3 out of 10 cases, it is found to have metastasized to other organs.
Therefore, implementing appropriate treatment based on the tumor stage of kidney cancer
patients is a crucial task that demands a strategic approach.
Kidney cancer is a primary tumor that occurs in the kidney, and renal cell carcinoma,
a malignant tumor, accounts for more than 90% of cases. Kidney cancer typically does
not show symptoms in the early stages, often already reaching a progressive stage
by the time of diagnosis. According to data released by the National Cancer Information
Center in 2022, kidney cancer accounted for 2.4% of all new cancer cases in Korea
in 2020, ranking 10th in incidence (2). The incidence of cancer is higher in men than in women, and it most frequently occurs
in people in their 60s. Additionally, kidney cancer places a significant disease burden
due to a decline in the quality of life resulting from disease symptoms, treatment-related
adverse events, and the subsequent increase in medical costs. Risk factors for kidney
cancer include environmental habits, lifestyle factors, genetic predispositions, and
existing kidney disease. Among these, lifestyle factors like smoking, obesity, high
blood pressure, and dietary habits can be contributing causes (3). Recently, domestic researchers developed an algorithm to predict kidney cancer recurrence,
and ongoing research is focusing on extracting features and implementing a classification
algorithm using neighborhood component analysis and genomic data (4)-(6). Machine learning-related algorithms are being applied to various biodata analyses,
including RNA sequencing, DNA methylation analysis in breast invasive carcinoma, thyroid
carcinoma, and kidney renal papillary cell carcinoma data from The Cancer Genome Atlas
(TCGA) (7). A data mining algorithm was employed to extract cancer-related genes by integrating
the data. Additionally, a study predicted the risk of 20 cancers by applying machine
learning techniques to analyze genetic big data (8). A Bayesian classifier has been utilized to classify proteins based on sequence and
structure information, enhancing the functional prediction performance of genes by
integrating diverse protein and gene-related information using a Bayesian network
(9). Research efforts have also been directed towards accurately predicting major mutations
responsible for spinal muscular atrophy, hereditary nasal polyposis, colorectal cancer,
and autism. This is achieved by applying deep learning technology to predict the patient's
disease state through the analysis of mutations present in the gene sequence (10). Various methods for extracting features from gene expression data are being studied,
and recently, methods using deep learning and statistical techniques are being studied
(11)-(13). Machine learning-based deep learning algorithms are widely used as a method for
feature extraction from biomedical images (14).
In this study, we extracted significant genes from gene expression datasets using
two algorithms: autoencoder (AE) and variational autoencoder (VAE). We then compared
and analyzed the tumor stage classification performance of kidney cancer. Classification
analysis based on tumor stage allows for the analysis of complex data, such as gene
expression data, and improves classification accuracy. This approach can serve as
a foundation for analyzing other gene expression data, and various machine learning
algorithms can be employed to analyze medical data.
2. Materials and methods
This section describes the dataset as well as all of the techniques applied in this
study. Fig. 1 depicts the overall research flow, from dataset collection to model
evaluation.
그림. 1. 신장암 단계 결정에 사용된 딥러닝 프레임워크 흐름도
Fig. 1. An end-to-end experimental flow of our deep learning framework used for staging
the kidney tumor
2.1 Dataset
This study made use of a gene expression dataset obtained from TCGA website, which
provides access to a wide range of biomedical datasets, including mRNA data. The dataset,
which included information from 1,157 kidney cancer patients, was meticulously prepared
for analysis. Although the original dataset included both gene expression and clinical
data, only gene expression data was used in this study. To ensure accuracy and consistency,
the dataset was cleaned and preprocessed to remove missing, duplicate, and invalid
values, as well as clinical data. tables 1 and 2 provide detailed statistics about the dataset, such as the total number of datasets
for each stage and information on data features. The "Before Cleaning" row refers
to the original dataset, which was downloaded from the TCGA website without any preprocessing,
whereas the "After Cleaning" row refers to the dataset after various cleaning and
preprocessing techniques were applied. This meticulous preparation of the gene expression
dataset contributed to the subsequent analyses producing reliable and meaningful results.
표 1. 데이터 정제 전후의 종양 단계 데이터 통계
Table 1. Tumor stage data statistic before and after data cleaning
Condition
|
Stage1
|
Stage2
|
Stage3
|
Stage4
|
Invalid
|
Total
|
Before Cleaning
|
528
|
183
|
261
|
146
|
39
|
1,157
|
After Cleaning
|
477
|
153
|
228
|
115
|
0
|
973
|
표 2. 데이터 정제 전후의 유전자 발현 데이터 통계
Table 2. Gene expression data statistic before and after data cleaning
Condition
|
Number of sample
|
Gene expression features
|
Clinical data features
|
Total features
|
Before Cleaning
|
1,157
|
60,483
|
29
|
60,512
|
After Cleaning
|
973
|
58,722
|
0
|
58,722
|
2.2 Correlation test
Naturally, a gene expression dataset consists of a large number of features, and each
feature represents a unique gene of a patient. Among this massive number of features,
some are highly correlated, while others are less correlated or not at all. By doing
a correlation test on the dataset, we can significantly improve the performance of
the classification models (15). Classification models might be hampered by redundant features. As a result, the
Pearson correlation coefficient-based method is utilized to identify redundant features,
which leads to their removal. We choose only the top 1000 features with the least
amount of redundancy after filtering the data using Pearson correlation values. The
following formula defines the Pearson correlation values:
$r=\frac{\sum\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{\sqrt{\sum\left(x_i-\bar{x}\right)^2
\sum\left(y_i-\bar{y}\right)^2}}$
2.3 Feature selection
After features were filtered using the correlation test, a second method was used
to reduce the dimension of the features by applying the feature selection method.
Feature selection can help in getting rid of irrelevant data as well as dealing with
noise in the dataset (16), (17). Two feature selection techniques were selected and combined to choose a list of
highly relevant features from thousands of features. These techniques are Least Absolute
Shrinkage and Selection Operator (LASSO) and Analysis of Variance (ANOVA). Lasso is
a linear regression in which shrinkage techniques are used to shrink the coefficients
of determination toward zero (18). ANOVA is a famous statistical hypothesis test used to determine whether there are
any significant differences in the variances of two or more groups (19).
Three different experiments, including using only LASSO, using only ANOVA, and combining
both LASSO and ANOVA together, were conducted to find the best feature selection strategy
that gave the best classification result. By evaluating the results produced by these
three strategies, we can observe that combining LASSO with ANOVA can improve classification
performance. We combined these two techniques' results by selecting the first 500
relevant features from the intersection of both results.
2.4 Feature extraction
Two popular methods can be used for dimension reduction when the dataset contains
many features. Besides the earlier mentioned feature selection technique, feature
extraction is another way to do it. The main goal of feature extraction is to create
a smaller version of the original dataset while keeping the meaning of the original
dataset (20). Besides dimension reduction, improving training speed and better computation are
well-known feature extraction advantages (21)–(23). This research used two feature extraction techniques: AE, VAE.
2.4.1 AE
An unsupervised artificial neural network architecture that is used to compress and
encode the data before reconstructing the original data from the encoded data (24). It is widely used for dimension reduction or noise removal. This neural network
consists of three components: the encoder, bottleneck, and decoder (25). The encoder is responsible for compressing the original dataset into a small-dimension
version while trying to keep the original meaning of the dataset. The bottleneck is
accountable for storing the compressed version of the data. The decoder compresses
the data from the bottleneck to reconstruct the original data. The AE learns by observing
the reconstruction error and attempting to minimize it so that the original and reconstructed
data look as similar as possible.
그림. 2. AE 모델의 아키텍처 (26)
Fig. 2. The architecture of the AE model (26)
2.4.2 VAE
It follows an encoder-bottleneck-decoder structure but modifies the bottleneck component
to make it a generative model. VAE learns the probability distribution of the data
instead of mapping the input to a numeric number (27). Two latent spaces, mean and variance, were learned in the bottleneck component.
So, a new dataset could be generated by choosing a random value from that distribution.
그림. 3. 3 VAE 모델의 아키텍처 (26)
Fig. 3. The architecture of the VAE model (26)
2.5 Synthetic minority over-sampling technique
Our dataset is skewed or imbalanced, which means that the number of observations for
each class is not equal or close to each other. table 1 and 2, which describe the data statistics, show that the observations for tumor stage 1
cover around 50% of the dataset, leaving another 50% for the other three tumor stages.
Imbalanced data can cause problems for machine learning models by biasing the model
toward any class with more datasets (28)–(30). To solve this problem, we apply an oversampling technique called Synthetic Minority
Over-sampling Technique (SMOTE) to generate a new set of datasets that contain the
same amount of observation for all classes (31). Prior to utilizing SMOTE, the train dataset exhibited an unequal distribution across
the kidney tumor stages. Stage 1 had the highest count with 312 records, followed
by Stage 3 with 189, Stage 2 with 105, and Stage 4 being the least represented with
76 instances. In a bid to rectify this imbalance, we applied the SMOTE technique with
the "auto" hyperparameter, which adopts an adaptive sampling strategy. After the SMOTE
application, the total count of records for all tumor stages combined equaled 1248,
with each individual stage-Stage 1 to Stage 4-having a balanced representation of
312 records.
2.6 Classifiers
We applied the top 9 classification algorithms, allowing us to assess how well they
performed against one another. These methods include logistic regression (LR) (32), support vector machine (SVM) (33), decision tree (DT) (34), random forest (RF) (35), k-nearest neighbor (KNN) (36), naïve bayes (NB) (37), AdaBoost (ADA) (38), XGBoost (XGB) (39), and stochastic gradient descent classifier (SGD) (40). For the selected classifiers, the hyperparameter configurations are as follows:
For LR, the settings are C=1, penalty="l2", and solver="liblinear". SVM is set with
C=1, kernel="rbf", and gamma="scale". DT is configured with max_depth=5. RF uses n_estimators=100
and max_depth=5. KNN utilizes n_neighbors=5. NB operates with default settings. ADA
uses n_estimators=50. XGB is set with n_estimators=100 and learning_rate=0.1. For
SGD, it's wrapped in a calibration with max_iter=1000 and tol=1e-3.
2.7 Model evaluation metrics
This section discusses the evaluation metrics for all the classifiers listed above.
The seven most popular evaluation metrics were used to compare the performance of
those classifiers, including accuracy, recall, precision, the f1-score, sensitivity,
specificity, and area under the curve (AUC) (41). To balance the model performance across classes, we use macro-averaged precision,
recall, and f1-score to get the overall average value for each metric. Besides these
metrics, specificity, and sensitivity were also calculated to get a deeper understanding
of the rates of true positive and true negative, respectively. In the below formula:
TP, TN, FP, and FN are the numbers of true positive, true negative, false positive,
and false negative, respectively.
$\begin{aligned} & \text { Accuracy }=\frac{T P+T N}{T P+T N+F P+F N} \\ & \text {
Precision }=\frac{T P}{T P+F P} \\ & \text { Recall }=\frac{T P}{T P+F N} \\ & \text
{ F1-score }=\frac{2 \times \text { Recall } \times \text { Precision }}{\text { Recall
}+ \text { Precision }} \\ & \text { Sensitivity }=\frac{T P}{T P+F N} \\ & \text
{ Specificity }=\frac{T N}{T N+F P}\end{aligned}$
3. Experiment results
This section presents the result of the experiment from the beginning to the end,
including feature extraction with deep learning techniques and classification results
with multiple classification algorithms.
3.1 Feature extraction with deep learning
We reduced the total number of features from thousands to 1000 by filtering using
a correlation test. Then, we continue reducing the dataset dimension from 1000 to
500 by applying feature selection techniques. From these 500 features, we applied
features extraction to reduce the dataset dimension to 50 features. We chose three
feature extraction algorithms to conduct the experiment, including AE and VAE. Based
on the experiment result, the original and reconstructed dataset was observed to be
similar, which shows the efficiency in feature extraction. fig 4shows the models' loss values, and fig 5visualizes the reconstructed data point compared to the original data point when using
AE and VAE models. Among these two algorithms, the AE model produced the most similar
reconstructed dataset compared to the original dataset.
그림. 4. 특징추출기법에 따른 (a) AE와 (b) VAE의 훈련 및 평가 손실
Fig. 4. Training and validation loss of (a) AE and (b) VAE as feature extraction techniques
그림. 5. (a) AE와 (b) VAE를 사용한 표본 데이터 재구성
Fig. 5. Sample data reconstruction using (a) AE and (b) VAE
3.2 Classification results
This section shows the result of 9 machine learning classifiers that is used for classifying
patient kidney tumor stages. Before training classifiers, we preprocessed data with
correlation analysis, feature selection, feature extraction, and oversampling for
the purpose of improving classification performance. table 3 and 4 indicate the evaluation metrics of all classifiers when using an AE and VAE as feature
extraction with and without the oversampling technique applied. In these two tables,
the "Sampling" column indicates whether the oversampling technique was applied.
table 3 shows the performance of all classifiers when using the AE model to extract the features.
In this case, the support vector machine outperformed other classifiers in most evaluation
metrics. XGBoost, Naïve Bayes, and Decision Tree are the next algorithms to show good
results. We could see that SVM significantly improves the performance after applying
the oversampling technique to the dataset.
표 3. 특징추출기법으로 AE를 사용한 예측 모델의 평가
Table 3. Evaluation of prediction models using AE as feature extraction
Classifier
|
Sampling
|
Accuracy
|
Recall
|
Precision
|
F1-Score
|
Sensitivity
|
Specificity
|
AUC
|
LR
|
Yes
|
0.890
|
0.860
|
0.863
|
0.858
|
0.860
|
0.964
|
0.968
|
No
|
0.873
|
0.814
|
0.844
|
0.825
|
0.814
|
0.957
|
0.955
|
SVM
|
Yes
|
0.984
|
0.974
|
0.987
|
0.980
|
0.974
|
0.992
|
0.985
|
No
|
0.959
|
0.936
|
0.963
|
0.948
|
0.936
|
0.984
|
0.973
|
DT
|
Yes
|
0.964
|
0.974
|
0.987
|
0.980
|
0.974
|
0.992
|
0.983
|
No
|
0.959
|
0.936
|
0.963
|
0.948
|
0.936
|
0.984
|
0.960
|
RF
|
Yes
|
0.952
|
0.945
|
0.945
|
0.944
|
0.945
|
0.983
|
0.983
|
No
|
0.904
|
0.847
|
0.903
|
0.860
|
0.847
|
0.966
|
0.958
|
KNN
|
Yes
|
0.846
|
0.829
|
0.814
|
0.817
|
0.829
|
0.950
|
0.948
|
No
|
0.822
|
0.762
|
0.794
|
0.769
|
0.762
|
0.939
|
0.925
|
NB
|
Yes
|
0.973
|
0.974
|
0.987
|
0.980
|
0.974
|
0.992
|
0.983
|
No
|
0.959
|
0.936
|
0.963
|
0.948
|
0.936
|
0.984
|
0.960
|
ADA
|
Yes
|
0.753
|
0.728
|
0.587
|
0.630
|
0.728
|
0.925
|
0.930
|
No
|
0.829
|
0.726
|
0.639
|
0.669
|
0.726
|
0.942
|
0.912
|
XGB
|
Yes
|
0.963
|
0.974
|
0.987
|
0.980
|
0.974
|
0.992
|
0.983
|
No
|
0.959
|
0.936
|
0.963
|
0.948
|
0.936
|
0.984
|
0.963
|
SGD
|
Yes
|
0.870
|
0.838
|
0.839
|
0.831
|
0.838
|
0.958
|
0.953
|
No
|
0.884
|
0.826
|
0.867
|
0.838
|
0.826
|
0.959
|
0.950
|
Moreover, another different result is shown in
table 4, which indicates the result when we used a VAE to extract the features. The performance
changed when we changed the feature extraction algorithm, which means it decreased
in this case compared to the AE model. The table shows that SVM still outperforms
most of the classifiers in terms of AUC value, followed by XGBoost, Naïve Bayes, and
Decision Tree.
표 4. 특징추출 기법으로 VAE를 사용한 예측 모델의 평가
Table 4. Evaluation of prediction models using VAE as feature extraction
Classifier
|
Sampling
|
Accuracy
|
Recall
|
Precision
|
F1-Score
|
Sensitivity
|
Specificity
|
AUC
|
LR
|
Yes
|
0.863
|
0.810
|
0.865
|
0.833
|
0.810
|
0.949
|
0.928
|
No
|
0.849
|
0.812
|
0.834
|
0.815
|
0.812
|
0.946
|
0.918
|
SVM
|
Yes
|
0.918
|
0.886
|
0.929
|
0.904
|
0.886
|
0.967
|
0.943
|
No
|
0.918
|
0.886
|
0.929
|
0.904
|
0.886
|
0.967
|
0.945
|
DT
|
Yes
|
0.918
|
0.903
|
0.929
|
0.903
|
0.903
|
0.968
|
0.935
|
No
|
0.918
|
0.903
|
0.929
|
0.903
|
0.903
|
0.968
|
0.935
|
RF
|
Yes
|
0.870
|
0.844
|
0.888
|
0.830
|
0.844
|
0.952
|
0.935
|
No
|
0.901
|
0.877
|
0.912
|
0.880
|
0.877
|
0.961
|
0.938
|
KNN
|
Yes
|
0.678
|
0.590
|
0.654
|
0.581
|
0.590
|
0.885
|
0.843
|
No
|
0.634
|
0.610
|
0.611
|
0.588
|
0.610
|
0.880
|
0.818
|
NB
|
Yes
|
0.918
|
0.903
|
0.929
|
0.903
|
0.903
|
0.968
|
0.935
|
No
|
0.918
|
0.903
|
0.929
|
0.903
|
0.903
|
0.968
|
0.935
|
ADA
|
Yes
|
0.791
|
0.653
|
0.619
|
0.610
|
0.653
|
0.927
|
0.875
|
No
|
0.702
|
0.653
|
0.559
|
0.553
|
0.653
|
0.906
|
0.875
|
XGB
|
Yes
|
0.918
|
0.903
|
0.929
|
0.903
|
0.903
|
0.968
|
0.915
|
No
|
0.918
|
0.903
|
0.929
|
0.903
|
0.903
|
0.968
|
0.935
|
SGD
|
Yes
|
0.860
|
0.794
|
0.869
|
0.810
|
0.794
|
0.949
|
0.915
|
No
|
0.829
|
0.772
|
0.808
|
0.774
|
0.772
|
0.941
|
0.901
|
tables 3 and
4 show the evaluation result of all classifiers as an average of four tumor classes.
fig 6can bring more detail to the Area Under the Curve (AUC) value listed in the above
tables by showing the Receiver operating characteristic curve (ROC) and AUC value
for each tumor stage without calculating it as an average value. We only show the
ROC curve when AE is used as feature extraction since we already observed that AE
is outperformed the other two feature extraction methods.
그림. 6. AE를 사용한 특징 추출과 SMOTE를 사용한 오버샘플링의 ROC와 AUC
Fig. 6. ROC and AUC for each classifier when using AE as the feature extraction and
SMOTE as an oversampling
4. Discussion and conclusion
The survival rate of kidney cancer patients who received early treatment is higher
than those who received treatment in the serious stage (42). That is why technology is widely used to help patients receive health information
and treatment as soon as possible by using patient data such as clinical and biomedical
data (43), (44).
In this study, we used different algorithms to detect and classify the stage of kidney
tumors. An open dataset published on the TCGA portal was cleaned, preprocessed, and
used to build classification models. A correlation analysis, feature extraction, feature
selection, and oversampling technique were also used for the purpose of boosting the
classifiers’ performance.
With this gene expression dataset, we observed from the results that the AE outperformed
the VAE as a feature extraction technique. Classifier performance has significantly
improved in most evaluation metrics, including accuracy, recall, precision, and f1-score.
So, feature extraction and feature selection can improve classification models by
reducing the dimensionality of the input data, removing irrelevant and redundant features,
and highlighting the most important features for the specific task. This leads to
more efficient and accurate models and faster training and inference times. The above
result shows the enhancement of the classification model when different techniques,
including feature selection, feature extraction, and resampling, are applied.
In conclusion, the classification of kidney tumor stages using feature extraction
techniques such as AE and VAE in conjunction with various classifiers such as LR,
RF, DT, SVM, and others has been shown to be an effective method for accurately classifying
kidney tumors. The combination of feature extraction techniques and classifiers has
been used to extract the most important features from medical imaging data and build
models that can accurately classify tumors into different stages.
The results obtained from these methods have been promising, with high accuracy and
precision in classifying tumor stages. This is particularly important for the early
detection and treatment of kidney tumors, which can improve patient outcomes and reduce
healthcare costs. Furthermore, using multiple classifiers such as LR, RF, DT, and
SVM allows us to compare the performance of the different models and select the best
one for a specific dataset.
To continue contributing to the medical sector, we plan to improve classification
performance by combining both clinical and biomedical data of patients for training
classifiers. Moreover, we plan to detect and identify a list of genes that contributes
the most to kidney cancer. As a future study, we intend to utilize explainable artificial
intelligence (XAI) to explain the features of gene expression data that contributed
to predictive models, so that users can understand and trust the results of machine
learning algorithm analysis. XAI can be used to describe AI models, expected impacts
and potential biases. This will be useful for characterizing model accuracy, fairness,
transparency, and end result in AI-based decision making. With an accurate list of
essential genes, a doctor could pay more attention to those genes rather than spending
time on other less essential genes.
Acknowledgements
This research was supported by the Basic Science Research Program through the National
Research Foundation of Korea(NRF) funded by the Ministry of Education(RS-2023-00245300,
No. 2020R1I1A1A 01065199, 2020R1I1A3062508) and by "Regional Innovation Strategy
(RIS)" through the NRF funded by the MOE (2021RIS-001)
References
L. A. Gottlieb, A. Kontorovich, R. Krauthgamer, 2016, Adaptive metric dimensionality
reduction, Theoretical Computer Science, Vol. 620, pp. 105-118
17 August 2023, National Cancer Center. Available online, https://ncc.re.kr/index
H. Chi, I. H. Chang, 2018, The Overdiagnosis of Kidney Cancer in Koreans and the Active
Surveillance on Small Renal Mass, Korean J Urol Oncol, Vol. 16, No. 1, pp. 15-24
A. M. Ali, H. Zhuang, A. Ibrahim, O. Rehman, M. Huang, A. Wu, Nov. 2018, A machine
learning approach for the classification of kidney cancer subtypes using miRNA genome
data, Appl Sci, Vol. 8, No. 2422, pp. 1-14
H. M. Kim, S. J. Lee, S. J. Park, I. Y. Choi, S. Hong, 2021, Machine Learning Approach
to Predict the Probability of Recurrence of Renal Cell Carcinoma After Surgery: Prediction
Model Development Study, JMIR Med Inform, Vol. 9, No. 3
A. J. Peired, R. Campi, M. L. Angelotti, G. Antonelli, C. Conte, E. Lazzeri, F. Becherucci,
L. Calistri, S. Serni, P. Romagnani, 2021, Sex and Gender Differences in Kidney Cancer:
Clinical and Experimental Evidence, Cancers, Vol. 13, No. 18, pp. 4588-
17 August, 2023, Genomic Data Commons. Available online, https://portal.gdc.cancer.gov
B. J. Kim, S. H. Kim, 2018, Prediction of inherited genomic susceptibility to 20 common
cancer types by a supervised machine-learning method, Proc Natl Acad Sci USA, Vol.
115, No. 6, pp. 1322-1327
O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, D. Botstein, 2003, A Bayesian
framework for combining heterogeneous data sources for gene function prediction (in
S. cerevisiae), Proc Natl Acad Sci USA, Vol. 100, No. 14, pp. 8348-8353
N. E. M. Khalifa, M. H. N. Taha, D. E. Ali, A. Slowik, A. E. Hassanien, 2020, Artificial
intelligence technique for gene expression by tumor RNA-Seq data: a novel optimized
deep learning approach, IEEE Access, Vol. 8, pp. 22874-22883
H. S. Shon, K. O. Kim, E. J. Cha, K. A. Kim, 2020, Classification of Kidney Cancer
Data based on Feature Extraction Methods, The Transactions of the Korean Institute
of Electrical Engineers, Vol. 69, No. 7, pp. 1061-1066
H. S. Shon, E. Batbaatar, E. J. Cha, T. G. Kang, S. G. Choi, K. A. Kim, 2022, Deep
Autoencoder based Classification for Clinical Prediction of Kidney Cancer, The Transactions
of the Korean Institute of Electrical Engineers, Vol. 71, No. 10, pp. 1393-1404
H. S. Shon, E. Batbaatar, K. O. Kim, E. J. Cha, K. A. Kim, 2020, Classification of
kidney cancer data using cost-sensitive hybrid deep learning approach, Symmetry, Vol.
12, No. 1, pp. 1-21
Y. Bengio, E. Laufer, G. Alain, J. Yosinski, 2014, Deep generative stochastic networks
trainable by backprop, Proceeding of the 31st International Conference on Machine
Learning, Vol. 32, pp. 226-234
B. Kalaiselvi, M. Thangamani, 2020, An efficient Pearson correlation based improved
random forest classification for protein structure prediction techniques, Measurement,
Vol. 162
I. Jain, V. K. Jain, R. Jain, 2018, Correlation feature selection based improved-binary
particle swarm optimization for gene selection and cancer classification, Applied
Soft Computing, Vol. 62, pp. 203-215
Z. M. Hira, D. F. Gillies, 2015, A review of feature selection and feature extraction
methods applied on microarray data, Adv Bioinformatics, Vol. 2015
R. Tibshirani, 1996, Regression shrinkage and selection via the lasso, Journal of
the Royal Statistical B, Vol. 58, No. 1, pp. 267-288
E. R. Girden, 1992, ANOVA: Repeated measures, sage
I. Guyon, M. Nikravesh, S. Gunn, L. A. Zadeh, 2008, Feature extraction: foundations
and applications, Springer
A. Nakra, M. Duhan, 2020, Feature Extraction and Dimensionality Reduction Techniques
with Their Advantages and Disadvantages for EEG-Based BCI System: A Review, IUP Journal
of Computer Sciences, Vol. 14, No. 2020, pp. -
X. Zhang, W. Yang, X. Tang, J. Liu, 2018, A fast learning method for accurate and
robust lane detection using two-stage feature extraction with YOLO v3, Sensors, Vol.
18, No. 12, pp. 4308-
M. Oravec, 2014, Feature extraction and classification by machine learning methods
for biometric recognition of face and iris, Proceedings of ELMAR-2014
P. Baldi, 2011, Autoencoders, unsupervised learning, and deep architectures, Proceedings
of machine learning research, Vol. 27, pp. -
M. Sewak, S. K. Sahay, H. Rathore, 2020, An overview of deep learning architecture
of deep neural networks and autoencoders, Journal of Computational and Theoretical
Nanoscience, Vol. 17, No. 1, pp. 182-188
L. Weng, From Autoencoder to Beta-VAE, Available from: https://lilianweng.github.io/posts/2018-08-12-vae/
D. P. Kingma, M. Welling, 2013, Auto-encoding variational bayes, arXiv preprint arXiv:13126114
S. M. A. Elrahman, A. Abraham, 2013, A review of class imbalance problem, Journal
of Network and Innovative Computing, Vol. 1, pp. 332-340
K. M. Hasib, M. Iqbal, F. M. Shah, J. A. Mahmud, M. H. Popel, M. Showrov, S. Ahmed,
O. Rahman, 2020, A survey of methods for managing the classification and solution
of data imbalance problem, arXiv preprint arXiv:201211870
D. Li, C. Liu, S. C. Hu, 2010, A learning method for the class imbalance problem with
medical data sets, Computers in biology and medicine, Vol. 40, No. 5, pp. 509-518
N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, 2002, SMOTE: synthetic minority
over-sampling technique, Journal of artificial intelligence research, pp. 321-357
D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, M. Klein, 2002, Logistic regression,
Springer
W. S. Noble, 2006, What is a support vector machine?, Nature biotechnology, Vol. 24,
No. 12, pp. 1565-1567
B. Charbuty, A. Abdulazeez, 2021, Classification based on decision tree algorithm
for machine learning, Journal of Applied Science and Technology Trends, Vol. 2, No.
1, pp. 20-28
Y. Qi, 2012, Random forest for bioinformatics. Ensemble machine learning: Methods
and applications, Springer, pp. 307-323
N. S. Altman, 1992, An introduction to kernel and nearest-neighbor nonparametric regression,
The American Statistician, Vol. 46, No. 3, pp. 175-185
K. P. Murphy, 2006, Naive bayes classifiers, University of British Columbia, Vol.
18, No. 60, pp. 1-8
T. Hastie, S. Rosset, J. Zhu, H. Zou, 2009, Multi-class adaboost, Statistics and its
Interface, Vol. 2, No. 3, pp. 349-360
2016, Xgboost: A scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. -
S. Ruder, 2016, An overview of gradient descent optimization algorithms, arXiv preprint
arXiv:160904747
2020, Performance evaluation of supervised machine learning algorithms in prediction
of heart disease, 2020 IEEE International Conference for Innovation in Technology,
IEEE, pp. -
M. Viscaino, J. T. Bustos, P. Muñoz, C. A. Cheein, F. A. Cheein, 2021, Artificial
intelligence for the early detection of colorectal cancer: A comprehensive review
of its advantages and misconceptions, World Journal of Gastroenterology, Vol. 27,
No. 38, pp. 6399-
A. N. Richter, T. M. Khoshgoftaar, 2018, A review of statistical and machine learning
methods for modeling cancer risk using structured clinical data, Artificial intelligence
in medicine, Vol. 90, No. , pp. 1-14
S. R. Stahlschmidt, B. Ulfenborg, J. Synnergren, 2022, Multimodal deep learning for
biomedical data fusion: a review, Briefings in Bioinformatics, Vol. 23, No. 2, pp.
bbab569-
저자소개
2010 : Ph.D. in Computer Science, Chungbuk National University, Korea.
2012 to present : Visiting professor in Medical Research Institute, School of Medicine,
Chungbuk National University, Korea.
2022 to present : Ph.D. student in Big Data, Chungbuk National University, Korea.
Research interest : AI-driven techniques in object detection, recognition, segmentation,
classification, and data analytics.
2004 : Ph.D. in Medicine, Chungbuk National University, Korea.
2005 to present : Professor in Department of Urology, College of Medicine, Chungbuk
National University, Korea.
2008-2009: Visiting Professor, Department of Cancer Biology, MD Anderson Cancer Center,
Houston, Texas.
2020 to present : Graduate student in Department of Biomedical Engineering, College
of Medicine, Chungbuk National University, Korea.
2000 : Ph.D. in Industrial Engineering, Dongguk University, Korea
2021 to present : Research professor in Institute for Trauma Research, College of
Medicine, Korea University, Korea.
2001 : Ph.D. in Biomedical Engineering, Chungbuk National University, Korea.
2005 to present : Professor in Department of Biomedical Engineering, College of Medicine,
Chungbuk National University, Korea.