1. Introduction
As Artificial Intelligence(AI) shows excellent performances, various industries develop
and apply AI models to the fields. The electric power industry also applies AI to
various fields based on the accumulated big data, and the AI-based algorithms show
excellent performances. Not only statistical techniques such as ARIMA but LSTM-based
deep learning model are used in predicting maximum demand of electricity to maintain
stability in power supply and demand. By utilizing AI, SMP(System Marginal Price)
and maximum amount of supply are also predicted, as well as a health index that predicts
life of an equipment based on facility sensor data. In addition to simply developing
table-data-based models, AI is also being applied to natural language processing (NLP)
fields based on text data and computer vision areas that deal with images. QA (Question
& Answer) system in power generation field and legal chatbot are developed based on
various text data in the fields. In the field of computer vision, the detection model
of the insulator based on YOLO (1) and the recognition model of the safety helmet at the construction site based on
Mask R-CNN (2) are developed.
Even though accumulating tremendous data and preparing infrastructures for data analysis,
AI experts that analyze real data and develop models are lacking. In the case of the
electric power field, it is only possible to develop good models when they have enough
domain knowledge due to the characteristics of the field. It is difficult for an analyst
without the domain knowledge to create an excellent algorithm that can be applied
to real-world. On the contrary, it is not easy for domain experts in the field to
develop models with deep understanding of AI, beyond simply using machine learning
models. As such, AI experts in the electric power industry are insufficient, but there
are many areas where AI should be applied, which causes the imbalance in demand/supply
of AI specialists. In addition, the process of data analysis and model development
takes a lot of time, such as the feature engineering step to find best features, the
optimal algorithm selection step and hyper-parameter tuning step, making it more difficult
to apply AI to the electric power industry with a limited number of the specialists.
Automated machine learning can be the solution in the electric power industry to apply
excellence of AI to many fields despite the shortage of professionals. Automated machine
learning is a technology that automatically performs a series of processes of feature
engineering, model selection, and hyper-parameter tuning to provide optimal models
by simply entering data and defining problems. Automated machine learning can solve
the problem of shortage of AI experts in the electric power industry, so that optimal
algorithms can be applied to many areas in a short time. Recently, Microsoft AutoML
(3), Data Robot (4), and H2O (5)-(7) have shown excellent performances on field of automated machine learning system.
However, due to the specificity of data, they does not perform well in the electric
power industry. Therefore, we develope the leading automated machine learning pipeline
system specialized in the electric power industry to confirm excellence.
2. Related Work
2.1 Automated Feature Engineering
Process of creating new features and selecting only good features has a profound impact
on the performance of the prediction model. Through various methods such as NaN processing,
outlier detection, one-hot encoding, and logarithm transformation, data is modified
to reflect the characteristics of the data into the prediction model. More features
than the optimal cause the most machine learning algorithms to reduce their performance
(8). Unconditional overuse of features can degrade the performance of the model due to
unrelated variables. It is also very important to extract the right subset of entire
features through the feature selection step. The method of feature selection is to
select important features based on the feature importance score generated by prediction
model (8) (9), feature selection index such as Gain Information (10), correlations among features (11), and recursive variable removal method (12) (13).
In an attempt to automate EDA(Exploratory Data Analysis)- based feature engineering
in which data analysts invest most of the time, various studies have been carried
out, including featuretools (14), explorekit (15), autolearn (16), and autofeat (17). Although different methodologies were proposed like the above, they have a similarity
in terms of automatically generating new features through pre-defined transformation
and extracting only excellent features through feature selection techniques. For autofeat
(17), it automatically generates features with pre-defined functions such as log(x), 1/x,
sin, and cos. Then, it uses the L1-regulated linear model to extract only significant
features based on the size of the coefficient of each variable. Only variables with
large absolute value of the model coefficient are included in the optimal feature
set. In this study, some modules of our system for feature generation and feature
selection are implemented by utilizing autofeat. The proposed system adds various
feature preprocessing techniques, such as NaN processing, outlier removal, data skewness
processing, categorical variable processing, categorization of continuous variables,
and text embedding, which are not provided by the existing studies. In addition, the
boruta module and the Domain knowledge based rule insertion module are added to compensate
for the insufficient feature selection functions.
2.2 Automated Algorithm Search & Hyper-parameter Tuning
Most automated algorithm search methods use heuristic approaches such as genetic algorithms
and Bayesian approach to select the optimal algorithm from the list of countless applicable
algorithms. Packages like AutoSklearn (16), H2O (3)-(5), and TPOT (19) are representative packages for finding the optimal algorithm. Especially in the
case of TPOT, genetic algorithm is used to efficiently search the algorithms search
space, and degrees of the search space can be adjusted by the user (19). In this study, a subset of machine learning algorithm search is implemented by using
TPOT.
In the step of hyper-parameter tuning, the hyper- parameters of the selected model
are optimized through various methods such as grid search, Bayesian search, and random
search. By applying optimal hyper-parameters, the performance of the model can be
improved.
Recently, Neural Architecture Search(NAS) has been in the spotlight in AutoML field
as a method that finds optimal deep learning architecture with only input data. Since
NAS needs more computation resources than other AutoML fields, many researches are
conducted on distributed processing and efficient search techniques to reduce computation
(20)-(22). One of most popular package is Auto-keras (23). It covers various areas such as Image Classification, Image Regression, Text Classification,
Structured Data Classification, Structured Data Regression based on keras (24), and provides various additional functions such as setting a limited search space
considering computation time for jobs. The proposed system also includes the light-weighted
NAS using the search space setting function of Auto-keras to apply common properties
of existing AI models in electric power industry to NAS and reduce computation time.
그림. 1. 제안 시스템 구조
Fig. 1. Architecture of the proposed system
3. The Proposed System
3.1 Architecture
The system is mainly composed of three parts: feature engineering module, model selection
& optimization module, and beam search module. In order to reduce the computation
costs for searching optimal model, the system finds the best model using beam search
(29). The beam search process is separated into 2steps: the feature engineering step and
the model selection & optimization step. The k promising datasets are generated via
the beam-search-based optimization, and the best models are selected using the k datasets.
That is, the model selection & optimization step is performed only for the k promising
datasets, the results of the feature engineering step.
3.2 Beam Search Module
The system optimizes the machine learning pipeline based on the beam search strategy,
which is a kind of best-first search. Beam search explores the search spaces by expanding
the most promising node in a limited set. Only a predetermined number of best partial
solutions are kept as candidates, which reduces its memory requirements. Beam search
is suitable for cases that require lots of computation resources and time such as
an automated machine learning pipeline.
In the proposed automated machine learning pipeline, each level is composed of different
types of processing steps with the same purpose, and the pipeline search space is
composed of a set of levels with different work attributes. The order in which the
levels are applied is determined based on the heuristic rules from the domain knowledge
database, considering the source of data, amount of data, data attribute, and type
of work.
At each level, we select k steps with the highest cross validation score to be the
candidate input of the next level. Once the steps at each level are executed, the
system is going to select k optimal steps of the levels performed so far. After exploring
all levels based on the beam search strategy, the best machine learning pipeline consisting
of feature engineering steps and the optimal model is created.
After the original dataset is uploaded, the search space of the tree structure is
determined based on the heuristic rule of domain knowledge. The search begins with
steps in first level. After all steps of first level are done, k steps with highest
validation score are selected and inserted into priority queue. k steps in priority
queue are used as the input of steps of next level. After the steps at next level
are completed, the priority queue is updated with only optimal k steps of the level
performed so far, which means pop all other steps except the best k steps.
알고리즘 1 제안 시스템의 빔서치
Algorithm 1 Beam Search in the Proposed System
INPUT: Ordered Levels /* The order of levels and steps in each levels are already
determined by the heuristic rule of domain knowledge database*/
OUTPUT: Best machine learning pipeline
/* Best_pq : priority queue to store k best steps*/
k ← 2 /* Number of steps to keep after each level */
Insert the original data into Best_pq
for Level in Ordered Levels
for previous best in Best_pq
Run all steps in each Level using previous best as input
Insert {steps, results} into Bes_pq
Pop all steps of Best except best k cases
|
그림. 2. 최적 머신러닝 파이프라인을 찾기 위한 빔서치 전략
Fig. 2. Beam search strategy for finding the optimal machine learning pipeline
For example,
figure 2 is a beam search strategy where k is 2. and are inserted into priority queue after
level 1 is finished. Data processed with and are used as inputs of steps of level
2. After level 2 is done, data with and applied sequentially and with only applied
have best scores. Only {, } and are kept in the priority queue. As above, two optimal
lists of steps are determined for each level when the search is completed. When the
last layer is finished, the list of steps with the highest score in the priority queue
is adopted as the final best pipeline. The parallel processing of beam search leads
quickly to pruning unproductive searches and rather use its resources to where the
most progress is being done. All logs of the search process to find the optimal pipeline
are stored in the domain knowledge database and used as data for better heuristic
rule that determines the order of the levels that make up the search space of machine
learning pipeline.
3.3 Feature Engineering Module
The feature engineering module consists of feature generation module, feature selection
module, and scoring module. The feature generation module and feature selection modules
are classified by functional characteristics, but are executed to find the best dataset
without distinction in the execution. In the system, the module automatically infers
the type of each variable and performs a feature engineering procedure suitable for
each type of variables. Through the scoring module, the performance score of the data
set applied to each step is calculated, and the optimal data set is searched based
on the beam search strategy using the score.
The feature generation module consists of a custom module and a non-linear feature
generation module. The custom module provides feature engineering techniques that
are not provided by the existing packages such as AutoSklearn (18), TPOT (19), Auto-keras (23), autofeat (17), etc. The above packages will not run if Na value is present. In addition, automatic
feature generation packages including autofeat(17) only generate features through pre-defined data transformation. Moreover, their transformation
target should be numeric, which causes the usage to be limited. Unlike these packages,
the proposed system provides various feature generation functions such as embedding
text data, categorizing continuous variables, and creating embedding variables in
various ways such as clustering. The custom module applies more than 30 data preprocessing
strategies of 10 types to create new dataset candidates that can bring out the best
performance. In addition, it adds functions to refine the data set to reduce the manual
steps and operate robustly through exception handling.
Non-linear feature generations module provides functions to generate non-linear features
such as log(x), √x, 1/x, x 2 , x 3 , |x|, exp(x), 2 x, sin(x), cos(x)) and to apply
arithmetic operators to pairs of features. The module uses the autofeat to create
new non-linear features exponentially (17). The module support to find more features with hidden insight, which has properties
of non-linearity of data.
표 1. 특성 생성 커스텀 모듈
Table 1. Custom module of feature generation
Type
|
Contents
|
NaN processing
|
- Fill with median
- Fill with mean
- Fill with specific value (e.g. 0)
- Propagate last valid
- Fill with next valid observation
- Drop row with Na
|
Outlier processing
|
- Boxplot
- Isolation forest with considering class
- Isolation forest without considering class
- KNN
- DBSCAN
- Z-score
|
Lagging processing
|
- Generate k lagged data
|
Skewness processing
|
- Box-cox
- Root transformation
- Logarithm transformation
|
Encoding of categorical variables
|
- One-hot encoding
- Ordinal encoding
|
Categorization of numeric variables
|
- Discretize into buckets with equal size
- Bin variables into predefined buckets
|
Class imbalance processing
|
- Variations of SMOTE
- Random under-sampling
- Random over-sampling
- Adaptive Synthetic over-sampling
- Under-sampling with Tomek’s links
|
Clustering based feature generation
|
- K-means
- DBSCAN
- HDBSCAN
|
Date type processing
|
- Extract year, month, day, week of day, weekday, weekend
|
Text embedding
|
- Bag of Words vectorization
- TF-IDF vectorization
- Pre-trained Fasttext (30) + simple sentence2vec
- Pre-trained Fasttext + SIF(Smooth -Inverse Frequency) (31)
- Hidden latent embedding of KoBert (32)
|
Feature selection module consists of custom module, boruta module
(33)-
(35), and autofeat module. The feature selection steps are applied in the order of custom
module, boruta module, and autofeat module in the middle of the feature generation
steps. By applying the selection module between feature generation steps, the out-of-memory
problem can be solved and the performance can be further improved by removing features
that do not contribute to the performance improvement among newly generated features.
The custom module applies feature selection rules created based on domain expertise
through analysis of the existing AI models and systems. When data of the same source
or similar system is used, the preprocessing rule of the previously developed system
is applied. The more logs accumulated in the log database of the system, the more
rules of the custom module will have. This may contribute to reduce time for feature
engineering and improve the model performance.
The boruta module is executed before executing the non-linear feature generations
module to remove noise features and secure free memory. The boruta module trains a
random forest model on both the original features and noise features randomly drawn
from a normal distribution or made by shuffling the original. The module maintains
only original with larger coefficients of model compared to the largest coefficient
of noise feature (33)-(35).
The autofeat module uses the L1-regularized linear model to extract only meaningful
features based on the size of the coefficient of each variable (17). The larger the absolute value of the coefficient of the variable, the greater the
influence on the prediction, so the module selects features with a large coefficient
to a certain range. The reason for using the L1 regularized linear model as a feature
selection module is to induce coefficients of features that are not very helpful to
0, so features with a coefficient of 0 do not affect the outcome of prediction no
matter how much the value is. The module is normally executed in the last step to
generate the final promising dataset candidates after performing the non-linear feature
generations module.
3.4 Model Selection & Optimization Module
Model selection & optimization module is responsible for finding and training the
most optimal algorithm for k best preprocessed datasets. The module consists of customized
TPOT (19), customized Auto-keras (23), and ensemble module.
Customized TPOT module adds not only the latest ensemble modules such as catboost
and light GBM to the original TPOT package, but also excellent algorithms previously
developed in the fields of transmission, distribution, and power generation to the
optimal algorithm search space.
Customized Auto-keras module basically performs neural network architecture search
based on StructuredDataClassifier and StructuredDataRegressor of the original AutoKeras
package. When data is related to specific fields or systems, the module applies customized
rules and limits search space using the AutoModel class based on the domain knowledge
in the database.
Ensemble module combines the several best outcomes of the previous steps including
the customized TPOT and the customized Auto-keras steps. Various ensemble algorithms
are applied such as voting, stacking, and bagging. After ensemble module is completed,
the final best model is selected to provide the optimal machine learning pipeline.
3.5 Infrastructure
To support the efficiency and scalability of the proposed system, the infrastructure
of the system is constructed on virtual machines of the Openstack train version as
figure 3. A virtual server is allocated for WEB and WAS where users can upload data and check
the results. The message queue is responsible for scheduling jobs requested by users
and allocating jobs to multiple virtual machines for computing. Virtual machines with
GPU are also provided depending on the characteristics of the assigned jobs.
그림. 3. 제안 시스템의 클라우드 기반 인프라구조
Fig. 3. Cloud-based infrastructure of the proposed system
Logs of each step are stored in the database combining with the attributes of data
and job requests. For example, in the case of power demand forecasting, the number
of rows of data, types of columns, the basic statistics, the applied steps and results
of jobs are stored. This is used to reduce the search space of the automated machine
learning system and create a better performance model that reflects domain characteristics.
In the future research, heuristic approach will be replaced with Bayesian approach
or reinforcement learning based on accumulated logs.
4. Experiments and Results
We shows the excellence of our system by applying our automated pipeline system to
real-word problems by regression, classification, and text classification. One for
each case is applied; prediction of maximum demand of electricity by regression, detection
of illegal use of electricity for Bitcoin mining by classification, and classification
of the questions’ indents of chatbot of legal expert system by text classification
4.1 Prediction of Maximum Demand of Electricity
Forecasting demand of electricity is an important issue directly related to facility
investment, stability of supply and demand, and costs of purchasing electricity, also
has a great impact on the national economy. In the short term, if the power demand
is overestimated, the price will rise, and the cost of demand management will increase.
Conversely, if power demand is underestimated, instability in power supply will be
led, and additional costs related to power generation in the power market will rise.
Accordingly, KEPCO (Korea Electric Power Corporation) and KPX (Korea Power Exchange)
are applying various AI models to predict more accurate maximum power demand per hour.
The Existing Model: LSTM-based models are currently used to predict peak power demand
in KEPCO. Experts correct the values predicted by the model to derive the final
predicted values.
Dataset: Data from various sources is used as independent variables; statistics of
hourly power usage by industry/contract type, hourly power usage by customers, statistics
of loads by power plants, hourly loads by headquarters, one-minute weather observations
at 675 points, 3-hour weather observations at 3600 points provided by Meteorological
forecasts, 766 statistical indicators provided by the National Statistical Office,
and 101 economic statistics provided by the Bank of Korea. Data from January 1, 2016
to December 31, 2019 are used as training data, and data from January 1, 2020 to March
28, 2020 are used as test datasets.
Evaluation Metric: RMSE (Root Mean Square Error) The RMSE of predicted values for
times i of a dependent variable, with variables observed over N times, is computed
for N different predictions as the square root of the mean of the squares of the errors,
The final machine learning pipeline: 7 feature generation steps, 3 feature selection
steps, and voting & stacking ensemble algorithm are included in the pipeline.
{Step1: feature selection based on the heuristic rule}
{Step2: generated 20 lagged data}
{Step3: fill NA with last valid observation}
{Step4: fill NA with next valid observation}
{Step5: date type processing}
{Step6: one hot encoding}
{Step7: discretize continuous variables into buckets with equal size}
{Step8: feature selection with boruta module}
{Step9: non-linear feature generation}
{Step10: feature selection with autofeat module}
{Step11: Voting(stacking(linearSVR+ lassoLarsCV + xgboost), lightGBM)}
Results: The RMSE of our machine learning pipeline is 0.01443, which improved 3.42%
performance compared to the existing model. figure 4 shows the average value of the hourly RMSE per day. In most days, the error in our
pipeline is smaller than the existing model.
표 2. 특성 생성 커스텀 모듈
Table 2. The Comparison of Performances between Models
Metric
|
Existing model
(LSTM)
|
Ours
|
RMSE
|
0.04557
|
0.01296
|
그림. 4. 현존 및 본 연구 모델의 일별 RMSE
Fig. 4. Daily RMSE of the existing model and ours
4.2 Detection of Illegal Use of Electricity for Bitcoin Mining
When the customer enters into an electricity use contract with KEPCO, the contract
is always designated for the predetermined use and it is called a contract type. The
contract type is a classification in which the rate of electricity is different depending
on the main economic activity performed by the customer of the contract. Therefore,
electricity rates vary widely depending on the type of contract. For example, electricity
charges are quite cheap for the use for residential street lighting, agricultural,
and general education compared to the charges of residential use.
If a customer who contracts with KEPCO for agricultural use illegally uses electricity
for residential purposes, KEPCO will inevitably suffer losses as much as the difference
between housing and farming charges. If this happens nationwide, KEPCO will suffer
huge economic losses every year.
Bitcoin mining takes tremendous amount of power because the computation is done by
many GPUs, also produces virtual currency, not tangible assets unlike other manufacturing
industries. This is why contract type of Bitcoin mining belongs to the general contract
type without any discount of electricity charges. However, if customers apply for
a contract type for industrial or agricultural use and do bitcoin mining, KEPCO will
suffer as much as the difference of the charges between them. The tremendous amount
of electricity is consumed by bitcoin mining, which causes bigger loss to any other
cases for KEPCO by the illegal use for bitcoin mining. Therefore, several algorithms
were developed to automatically detect sites that illegally mine bitcoin because there
is not enough workers to check all the sites.
The Existing Model: Ensemble model based on random forest is recently developed. The
model is planned to be applied in practice.
Dataset: Since there were very few 279 cases where the actual illegal use checks of
bitcoin mining were conducted, pseudo- labeled data based on business insight is used
for training the model. By analyzing electricity usage patterns for contracted customers
of agricultural use in Busan from 2016 to 2019, data was pseudo-labeled to normal
or abnormal by determining whether electricity is used as contracted with KEPCO based
on business insight. Box plot and clustering algorithms are used to distinguish between
customers with normal patterns and with abnormal patterns. The actual 279 cases are
used as test dataset, which conducted in Busan.
Evaluation Metric: We evaluated the performance of the model through several evaluation
metrics; accuracy, F1 score, ROC- AUC score, precision, and recall.
The final machine learning pipeline: 4 feature generation steps, 1 feature selection
steps, and random forest algorithm are included in the pipeline.
{Step1: oversampling with SMOTE and SVM}
{Step2: discretize continuous variables into buckets with equal size}
{Step3: one hot encoding}
{Step4: feature selection with boruta module}
{Step5: non-linear feature generation}
{Step6: random forest}
Results: In all evaluation metrics, our pipeline outperformed the existing algorithm.
In particular, the F1 score of ours is improved by 3.23% compared to the existing
model.
표 3. 모델 간 성능 비교
Table 3. The Comparison of Performances between Models
Metric
|
Existing model
(Random Forest)
|
Ours
|
Accuracy
|
0.96057
|
0.97849
|
F1 score
|
0.93023
|
0.96026
|
ROC-AUC
|
0.91988
|
0.96876
|
Precision
|
0.94150
|
0.95223
|
Recall
|
0.91988
|
0.96876
|
4.3 Classification of Questions Indents of Chatbot
KEPCO develops and runs a chatbot service in Korean that informs employees of company
regulations and legal information as a part of legal expert system. The chatbot uses
a commercial chatbot engine that mainly operates based on rule-based method. In order
to replace the existing rule-based engine, new algorithm is needed to identify the
intent of questions from users and match them with appropriate answers. Moreover,
the new algorithm should solve the problem of the existing engine called ‘Out of vocabulary
(OOV) problem’, which new words that are not in the training data can’t be handled
in the engine.
The Existing Model: The chatbot uses a commercial chatbot engine that mainly operates
based on rule-based method.
Dataset: A dataset with {question-answer} structure is made based on a list of questions
and answers about in-house precedents by the in-house legal department. By using crowdsourcing,
the dataset is augmented by creating different question sentences with the same meaning
as the existing questions. Examples of datasets are {“Please explain the relocation
work of the distribution line”, answer 1}, {when is it necessary to relocate the distribution
line?”, Answer 1}, {“Who will pay for the relocation work of the distribution line”,
Answer 2}. Therefore, it is necessary to apply the classification algorithm after
performing the embedding of text data. Total dataset consists of 35788 questions,
and there are 332 indents as classes. Namely, the dataset consists of about 100 questions
per class. The test dataset is 20% of the shuffled entire dataset
Evaluation Metric: We evaluated the performance of the model through several evaluation
metrics; accuracy, F1 score, ROC- AUC score, precision, and recall.
The final machine learning pipeline: 1 tokenization step, 2 sentence embedding steps,
and soft voting ensemble algorithm are included in the pipeline.
{Step1: tokenization with okt of konlpy (36)}
{Step2: TF-IDF embedding}
{Step3: sentence embedding with pre-trained Korean fastText module}
{Step4: soft voting(step2, step3)}
Results: We combined the sentence embedding based on pre-trained fastText model that
solves the OOV problem and the TF-IDF with high performance by using the soft voting
ensemble method. The voting ensemble model has an accuracy of 95.226%, which shows
the possibility of replacing the existing commercial engine.
표 4. 모델 간 성능 비교
Table 4. The Comparison of Performances between Models
Metric
|
TF-IDF
|
fastText
|
TF-IDF + fastText
|
Accuracy
|
0.94191
|
0.89434
|
0.95226
|
F1 score
|
0.93503
|
0.88978
|
0.93494
|
ROC-AUC
|
0.97075
|
0.94881
|
0.97073
|
Precision
|
0.93473
|
0.89043
|
0.93473
|
Recall
|
0.93885
|
0.89524
|
0.93880
|