Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Main Menu

Journal Search

[

Research article

]

The Transactions of the Korean Institute of Electrical Engineers

KIEE Vol. 70, No. 12, p.1914-1923

ISSN (print) :

1975-8359

ISSN (online) :

2287-4364

Received : 14 October 2021Revised : 27 November 2021Accepted : 28 November 2021

DOI :

http://doi.org/10.5370/KIEE.2021.70.12.1914

Based on Beam Search for Electric Power Industry

빔서치 기반 전력산업용 머신러닝 자동화 파이프라인 시스템

장광선 (Gwangseon Jang) ¹iD 황명하 (Myeong-Ha Hwang) ^†iD

(NTIS Center, Division of National S&T Data, Korea Institute of Science and Technology Information(KISTI), Korea. E-mail: gsjang@kisti.re.kr)

^†Corresponding Author : Digital Solution Laboratory, Korea Electric Power Research Institute (KEPRI), Korea.

E-mail : mh.hwang@kepco.co.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.kiee.or.kr).

Abstract

As Artificial Intelligence (AI) shows excellent performances, electric power industry also applies AI to various fields. Even though a lot of data and infrastructures for data analysis are prepared, AI experts in the industry are insufficient. Automated machine learning can be a solution in the industry to apply the excellence of AI to many fields despite the shortage of professionals. Recently, several automated machine learning services have shown good performances on general purpose. However, due to the specificity of data, they does not perform well in the electric power industry. Therefore, we develop an automated machine learning pipeline system based on beam search especially for electric power industry. The proposed system is applied to three real-word problems, one for each of regression, classification, and text classification. The significance of our work is first to apply automated machine learning to electric power industry with high performance. The performances of the models to predict hourly peak power demand and to detect illegal use of electricity for Bitcoin mining are improved by 3.42% and 3.23% respectively compared to the existing models. Moreover, our work shows 95.23% accuracy in classification of the questions’ indents of chatbot, which gives the possibility to replace the existing model.

Key words

Automated machine learning, AutoML, Beam search, Electric power industry

1. Introduction

As Artificial Intelligence(AI) shows excellent performances, various industries develop and apply AI models to the fields. The electric power industry also applies AI to various fields based on the accumulated big data, and the AI-based algorithms show excellent performances. Not only statistical techniques such as ARIMA but LSTM-based deep learning model are used in predicting maximum demand of electricity to maintain stability in power supply and demand. By utilizing AI, SMP(System Marginal Price) and maximum amount of supply are also predicted, as well as a health index that predicts life of an equipment based on facility sensor data. In addition to simply developing table-data-based models, AI is also being applied to natural language processing (NLP) fields based on text data and computer vision areas that deal with images. QA (Question & Answer) system in power generation field and legal chatbot are developed based on various text data in the fields. In the field of computer vision, the detection model of the insulator based on YOLO ⁽¹⁾ and the recognition model of the safety helmet at the construction site based on Mask R-CNN ⁽²⁾ are developed.

Even though accumulating tremendous data and preparing infrastructures for data analysis, AI experts that analyze real data and develop models are lacking. In the case of the electric power field, it is only possible to develop good models when they have enough domain knowledge due to the characteristics of the field. It is difficult for an analyst without the domain knowledge to create an excellent algorithm that can be applied to real-world. On the contrary, it is not easy for domain experts in the field to develop models with deep understanding of AI, beyond simply using machine learning models. As such, AI experts in the electric power industry are insufficient, but there are many areas where AI should be applied, which causes the imbalance in demand/supply of AI specialists. In addition, the process of data analysis and model development takes a lot of time, such as the feature engineering step to find best features, the optimal algorithm selection step and hyper-parameter tuning step, making it more difficult to apply AI to the electric power industry with a limited number of the specialists.

Automated machine learning can be the solution in the electric power industry to apply excellence of AI to many fields despite the shortage of professionals. Automated machine learning is a technology that automatically performs a series of processes of feature engineering, model selection, and hyper-parameter tuning to provide optimal models by simply entering data and defining problems. Automated machine learning can solve the problem of shortage of AI experts in the electric power industry, so that optimal algorithms can be applied to many areas in a short time. Recently, Microsoft AutoML ⁽³⁾, Data Robot ⁽⁴⁾, and H2O ⁽⁵⁾-⁽⁷⁾ have shown excellent performances on field of automated machine learning system. However, due to the specificity of data, they does not perform well in the electric power industry. Therefore, we develope the leading automated machine learning pipeline system specialized in the electric power industry to confirm excellence.

2. Related Work

2.1 Automated Feature Engineering

Process of creating new features and selecting only good features has a profound impact on the performance of the prediction model. Through various methods such as NaN processing, outlier detection, one-hot encoding, and logarithm transformation, data is modified to reflect the characteristics of the data into the prediction model. More features than the optimal cause the most machine learning algorithms to reduce their performance ⁽⁸⁾. Unconditional overuse of features can degrade the performance of the model due to unrelated variables. It is also very important to extract the right subset of entire features through the feature selection step. The method of feature selection is to select important features based on the feature importance score generated by prediction model ⁽⁸⁾ ⁽⁹⁾, feature selection index such as Gain Information ⁽¹⁰⁾, correlations among features ⁽¹¹⁾, and recursive variable removal method ⁽¹²⁾ ⁽¹³⁾.

In an attempt to automate EDA(Exploratory Data Analysis)- based feature engineering in which data analysts invest most of the time, various studies have been carried out, including featuretools ⁽¹⁴⁾, explorekit ⁽¹⁵⁾, autolearn ⁽¹⁶⁾, and autofeat ⁽¹⁷⁾. Although different methodologies were proposed like the above, they have a similarity in terms of automatically generating new features through pre-defined transformation and extracting only excellent features through feature selection techniques. For autofeat ⁽¹⁷⁾, it automatically generates features with pre-defined functions such as log(x), 1/x, sin, and cos. Then, it uses the L1-regulated linear model to extract only significant features based on the size of the coefficient of each variable. Only variables with large absolute value of the model coefficient are included in the optimal feature set. In this study, some modules of our system for feature generation and feature selection are implemented by utilizing autofeat. The proposed system adds various feature preprocessing techniques, such as NaN processing, outlier removal, data skewness processing, categorical variable processing, categorization of continuous variables, and text embedding, which are not provided by the existing studies. In addition, the boruta module and the Domain knowledge based rule insertion module are added to compensate for the insufficient feature selection functions.

2.2 Automated Algorithm Search & Hyper-parameter Tuning

Most automated algorithm search methods use heuristic approaches such as genetic algorithms and Bayesian approach to select the optimal algorithm from the list of countless applicable algorithms. Packages like AutoSklearn ⁽¹⁶⁾, H2O ⁽³⁾-⁽⁵⁾, and TPOT ⁽¹⁹⁾ are representative packages for finding the optimal algorithm. Especially in the case of TPOT, genetic algorithm is used to efficiently search the algorithms search space, and degrees of the search space can be adjusted by the user ⁽¹⁹⁾. In this study, a subset of machine learning algorithm search is implemented by using TPOT.

In the step of hyper-parameter tuning, the hyper- parameters of the selected model are optimized through various methods such as grid search, Bayesian search, and random search. By applying optimal hyper-parameters, the performance of the model can be improved.

Recently, Neural Architecture Search(NAS) has been in the spotlight in AutoML field as a method that finds optimal deep learning architecture with only input data. Since NAS needs more computation resources than other AutoML fields, many researches are conducted on distributed processing and efficient search techniques to reduce computation ⁽²⁰⁾-⁽²²⁾. One of most popular package is Auto-keras ⁽²³⁾. It covers various areas such as Image Classification, Image Regression, Text Classification, Structured Data Classification, Structured Data Regression based on keras ⁽²⁴⁾, and provides various additional functions such as setting a limited search space considering computation time for jobs. The proposed system also includes the light-weighted NAS using the search space setting function of Auto-keras to apply common properties of existing AI models in electric power industry to NAS and reduce computation time.

그림. 1. 제안 시스템 구조

Fig. 1. Architecture of the proposed system

3. The Proposed System

3.1 Architecture

The system is mainly composed of three parts: feature engineering module, model selection & optimization module, and beam search module. In order to reduce the computation costs for searching optimal model, the system finds the best model using beam search ⁽²⁹⁾. The beam search process is separated into 2steps: the feature engineering step and the model selection & optimization step. The k promising datasets are generated via the beam-search-based optimization, and the best models are selected using the k datasets. That is, the model selection & optimization step is performed only for the k promising datasets, the results of the feature engineering step.

3.2 Beam Search Module

The system optimizes the machine learning pipeline based on the beam search strategy, which is a kind of best-first search. Beam search explores the search spaces by expanding the most promising node in a limited set. Only a predetermined number of best partial solutions are kept as candidates, which reduces its memory requirements. Beam search is suitable for cases that require lots of computation resources and time such as an automated machine learning pipeline.

In the proposed automated machine learning pipeline, each level is composed of different types of processing steps with the same purpose, and the pipeline search space is composed of a set of levels with different work attributes. The order in which the levels are applied is determined based on the heuristic rules from the domain knowledge database, considering the source of data, amount of data, data attribute, and type of work.

At each level, we select k steps with the highest cross validation score to be the candidate input of the next level. Once the steps at each level are executed, the system is going to select k optimal steps of the levels performed so far. After exploring all levels based on the beam search strategy, the best machine learning pipeline consisting of feature engineering steps and the optimal model is created.

After the original dataset is uploaded, the search space of the tree structure is determined based on the heuristic rule of domain knowledge. The search begins with steps in first level. After all steps of first level are done, k steps with highest validation score are selected and inserted into priority queue. k steps in priority queue are used as the input of steps of next level. After the steps at next level are completed, the priority queue is updated with only optimal k steps of the level performed so far, which means pop all other steps except the best k steps.

알고리즘 1 제안 시스템의 빔서치

Algorithm 1 Beam Search in the Proposed System

INPUT: Ordered Levels /* The order of levels and steps in each levels are already determined by the heuristic rule of domain knowledge database*/

OUTPUT: Best machine learning pipeline

/* Best_pq : priority queue to store k best steps*/

k ← 2 /* Number of steps to keep after each level */

Insert the original data into Best_pq

for Level in Ordered Levels

for previous best in Best_pq

Run all steps in each Level using previous best as input

Insert {steps, results} into Bes_pq

Pop all steps of Best except best k cases

그림. 2. 최적 머신러닝 파이프라인을 찾기 위한 빔서치 전략

Fig. 2. Beam search strategy for finding the optimal machine learning pipeline

For example, figure 2 is a beam search strategy where k is 2. and are inserted into priority queue after level 1 is finished. Data processed with and are used as inputs of steps of level 2. After level 2 is done, data with and applied sequentially and with only applied have best scores. Only {, } and are kept in the priority queue. As above, two optimal lists of steps are determined for each level when the search is completed. When the last layer is finished, the list of steps with the highest score in the priority queue is adopted as the final best pipeline. The parallel processing of beam search leads quickly to pruning unproductive searches and rather use its resources to where the most progress is being done. All logs of the search process to find the optimal pipeline are stored in the domain knowledge database and used as data for better heuristic rule that determines the order of the levels that make up the search space of machine learning pipeline.

3.3 Feature Engineering Module

The feature engineering module consists of feature generation module, feature selection module, and scoring module. The feature generation module and feature selection modules are classified by functional characteristics, but are executed to find the best dataset without distinction in the execution. In the system, the module automatically infers the type of each variable and performs a feature engineering procedure suitable for each type of variables. Through the scoring module, the performance score of the data set applied to each step is calculated, and the optimal data set is searched based on the beam search strategy using the score.

The feature generation module consists of a custom module and a non-linear feature generation module. The custom module provides feature engineering techniques that are not provided by the existing packages such as AutoSklearn ⁽¹⁸⁾, TPOT ⁽¹⁹⁾, Auto-keras ⁽²³⁾, autofeat ⁽¹⁷⁾, etc. The above packages will not run if Na value is present. In addition, automatic feature generation packages including autofeat⁽¹⁷⁾ only generate features through pre-defined data transformation. Moreover, their transformation target should be numeric, which causes the usage to be limited. Unlike these packages, the proposed system provides various feature generation functions such as embedding text data, categorizing continuous variables, and creating embedding variables in various ways such as clustering. The custom module applies more than 30 data preprocessing strategies of 10 types to create new dataset candidates that can bring out the best performance. In addition, it adds functions to refine the data set to reduce the manual steps and operate robustly through exception handling.

Non-linear feature generations module provides functions to generate non-linear features such as log(x), √x, 1/x, x 2 , x 3 , |x|, exp(x), 2 x, sin(x), cos(x)) and to apply arithmetic operators to pairs of features. The module uses the autofeat to create new non-linear features exponentially ⁽¹⁷⁾. The module support to find more features with hidden insight, which has properties of non-linearity of data.

표 1. 특성 생성 커스텀 모듈

Table 1. Custom module of feature generation

Type	Contents
NaN processing	- Fill with median - Fill with mean - Fill with specific value (e.g. 0) - Propagate last valid - Fill with next valid observation - Drop row with Na
Outlier processing	- Boxplot - Isolation forest with considering class - Isolation forest without considering class - KNN - DBSCAN - Z-score
Lagging processing	- Generate k lagged data
Skewness processing	- Box-cox - Root transformation - Logarithm transformation
Encoding of categorical variables	- One-hot encoding - Ordinal encoding
Categorization of numeric variables	- Discretize into buckets with equal size - Bin variables into predefined buckets
Class imbalance processing	- Variations of SMOTE - Random under-sampling - Random over-sampling - Adaptive Synthetic over-sampling - Under-sampling with Tomek’s links
Clustering based feature generation	- K-means - DBSCAN - HDBSCAN
Date type processing	- Extract year, month, day, week of day, weekday, weekend
Text embedding	- Bag of Words vectorization - TF-IDF vectorization - Pre-trained Fasttext ⁽³⁰⁾ + simple sentence2vec - Pre-trained Fasttext + SIF(Smooth -Inverse Frequency) ⁽³¹⁾ - Hidden latent embedding of KoBert ⁽³²⁾

Feature selection module consists of custom module, boruta module ⁽³³⁾-⁽³⁵⁾, and autofeat module. The feature selection steps are applied in the order of custom module, boruta module, and autofeat module in the middle of the feature generation steps. By applying the selection module between feature generation steps, the out-of-memory problem can be solved and the performance can be further improved by removing features that do not contribute to the performance improvement among newly generated features.

The custom module applies feature selection rules created based on domain expertise through analysis of the existing AI models and systems. When data of the same source or similar system is used, the preprocessing rule of the previously developed system is applied. The more logs accumulated in the log database of the system, the more rules of the custom module will have. This may contribute to reduce time for feature engineering and improve the model performance.

The boruta module is executed before executing the non-linear feature generations module to remove noise features and secure free memory. The boruta module trains a random forest model on both the original features and noise features randomly drawn from a normal distribution or made by shuffling the original. The module maintains only original with larger coefficients of model compared to the largest coefficient of noise feature ⁽³³⁾-⁽³⁵⁾.

The autofeat module uses the L1-regularized linear model to extract only meaningful features based on the size of the coefficient of each variable ⁽¹⁷⁾. The larger the absolute value of the coefficient of the variable, the greater the influence on the prediction, so the module selects features with a large coefficient to a certain range. The reason for using the L1 regularized linear model as a feature selection module is to induce coefficients of features that are not very helpful to 0, so features with a coefficient of 0 do not affect the outcome of prediction no matter how much the value is. The module is normally executed in the last step to generate the final promising dataset candidates after performing the non-linear feature generations module.

3.4 Model Selection & Optimization Module

Model selection & optimization module is responsible for finding and training the most optimal algorithm for k best preprocessed datasets. The module consists of customized TPOT ⁽¹⁹⁾, customized Auto-keras ⁽²³⁾, and ensemble module.

Customized TPOT module adds not only the latest ensemble modules such as catboost and light GBM to the original TPOT package, but also excellent algorithms previously developed in the fields of transmission, distribution, and power generation to the optimal algorithm search space.

Customized Auto-keras module basically performs neural network architecture search based on StructuredDataClassifier and StructuredDataRegressor of the original AutoKeras package. When data is related to specific fields or systems, the module applies customized rules and limits search space using the AutoModel class based on the domain knowledge in the database.

Ensemble module combines the several best outcomes of the previous steps including the customized TPOT and the customized Auto-keras steps. Various ensemble algorithms are applied such as voting, stacking, and bagging. After ensemble module is completed, the final best model is selected to provide the optimal machine learning pipeline.

3.5 Infrastructure

To support the efficiency and scalability of the proposed system, the infrastructure of the system is constructed on virtual machines of the Openstack train version as figure 3. A virtual server is allocated for WEB and WAS where users can upload data and check the results. The message queue is responsible for scheduling jobs requested by users and allocating jobs to multiple virtual machines for computing. Virtual machines with GPU are also provided depending on the characteristics of the assigned jobs.

그림. 3. 제안 시스템의 클라우드 기반 인프라구조

Fig. 3. Cloud-based infrastructure of the proposed system

Logs of each step are stored in the database combining with the attributes of data and job requests. For example, in the case of power demand forecasting, the number of rows of data, types of columns, the basic statistics, the applied steps and results of jobs are stored. This is used to reduce the search space of the automated machine learning system and create a better performance model that reflects domain characteristics. In the future research, heuristic approach will be replaced with Bayesian approach or reinforcement learning based on accumulated logs.

4. Experiments and Results

We shows the excellence of our system by applying our automated pipeline system to real-word problems by regression, classification, and text classification. One for each case is applied; prediction of maximum demand of electricity by regression, detection of illegal use of electricity for Bitcoin mining by classification, and classification of the questions’ indents of chatbot of legal expert system by text classification

4.1 Prediction of Maximum Demand of Electricity

Forecasting demand of electricity is an important issue directly related to facility investment, stability of supply and demand, and costs of purchasing electricity, also has a great impact on the national economy. In the short term, if the power demand is overestimated, the price will rise, and the cost of demand management will increase. Conversely, if power demand is underestimated, instability in power supply will be led, and additional costs related to power generation in the power market will rise. Accordingly, KEPCO (Korea Electric Power Corporation) and KPX (Korea Power Exchange) are applying various AI models to predict more accurate maximum power demand per hour.

The Existing Model: LSTM-based models are currently used to predict peak power demand in KEPCO. Experts correct the values predicted by the model to derive the final predicted values.

Dataset: Data from various sources is used as independent variables; statistics of hourly power usage by industry/contract type, hourly power usage by customers, statistics of loads by power plants, hourly loads by headquarters, one-minute weather observations at 675 points, 3-hour weather observations at 3600 points provided by Meteorological forecasts, 766 statistical indicators provided by the National Statistical Office, and 101 economic statistics provided by the Bank of Korea. Data from January 1, 2016 to December 31, 2019 are used as training data, and data from January 1, 2020 to March 28, 2020 are used as test datasets.

Evaluation Metric: RMSE (Root Mean Square Error) The RMSE of predicted values for times i of a dependent variable, with variables observed over N times, is computed for N different predictions as the square root of the mean of the squares of the errors,

The final machine learning pipeline: 7 feature generation steps, 3 feature selection steps, and voting & stacking ensemble algorithm are included in the pipeline.

{Step1: feature selection based on the heuristic rule}

{Step2: generated 20 lagged data}

{Step3: fill NA with last valid observation}

{Step4: fill NA with next valid observation}

{Step5: date type processing}

{Step6: one hot encoding}

{Step7: discretize continuous variables into buckets with equal size}

{Step8: feature selection with boruta module}

{Step9: non-linear feature generation}

{Step10: feature selection with autofeat module}

{Step11: Voting(stacking(linearSVR+ lassoLarsCV + xgboost), lightGBM)}

Results: The RMSE of our machine learning pipeline is 0.01443, which improved 3.42% performance compared to the existing model. figure 4 shows the average value of the hourly RMSE per day. In most days, the error in our pipeline is smaller than the existing model.

표 2. 특성 생성 커스텀 모듈

Table 2. The Comparison of Performances between Models

Metric

Existing model

(LSTM)

Ours

RMSE

0.04557

0.01296

그림. 4. 현존 및 본 연구 모델의 일별 RMSE

Fig. 4. Daily RMSE of the existing model and ours

4.2 Detection of Illegal Use of Electricity for Bitcoin Mining

When the customer enters into an electricity use contract with KEPCO, the contract is always designated for the predetermined use and it is called a contract type. The contract type is a classification in which the rate of electricity is different depending on the main economic activity performed by the customer of the contract. Therefore, electricity rates vary widely depending on the type of contract. For example, electricity charges are quite cheap for the use for residential street lighting, agricultural, and general education compared to the charges of residential use.

If a customer who contracts with KEPCO for agricultural use illegally uses electricity for residential purposes, KEPCO will inevitably suffer losses as much as the difference between housing and farming charges. If this happens nationwide, KEPCO will suffer huge economic losses every year.

Bitcoin mining takes tremendous amount of power because the computation is done by many GPUs, also produces virtual currency, not tangible assets unlike other manufacturing industries. This is why contract type of Bitcoin mining belongs to the general contract type without any discount of electricity charges. However, if customers apply for a contract type for industrial or agricultural use and do bitcoin mining, KEPCO will suffer as much as the difference of the charges between them. The tremendous amount of electricity is consumed by bitcoin mining, which causes bigger loss to any other cases for KEPCO by the illegal use for bitcoin mining. Therefore, several algorithms were developed to automatically detect sites that illegally mine bitcoin because there is not enough workers to check all the sites.

The Existing Model: Ensemble model based on random forest is recently developed. The model is planned to be applied in practice.

Dataset: Since there were very few 279 cases where the actual illegal use checks of bitcoin mining were conducted, pseudo- labeled data based on business insight is used for training the model. By analyzing electricity usage patterns for contracted customers of agricultural use in Busan from 2016 to 2019, data was pseudo-labeled to normal or abnormal by determining whether electricity is used as contracted with KEPCO based on business insight. Box plot and clustering algorithms are used to distinguish between customers with normal patterns and with abnormal patterns. The actual 279 cases are used as test dataset, which conducted in Busan.

Evaluation Metric: We evaluated the performance of the model through several evaluation metrics; accuracy, F1 score, ROC- AUC score, precision, and recall.

The final machine learning pipeline: 4 feature generation steps, 1 feature selection steps, and random forest algorithm are included in the pipeline.

{Step1: oversampling with SMOTE and SVM}

{Step2: discretize continuous variables into buckets with equal size}

{Step3: one hot encoding}

{Step4: feature selection with boruta module}

{Step5: non-linear feature generation}

{Step6: random forest}

Results: In all evaluation metrics, our pipeline outperformed the existing algorithm. In particular, the F1 score of ours is improved by 3.23% compared to the existing model.

표 3. 모델 간 성능 비교

Table 3. The Comparison of Performances between Models

Metric	Existing model (Random Forest)	Ours
Accuracy	0.96057	0.97849
F1 score	0.93023	0.96026
ROC-AUC	0.91988	0.96876
Precision	0.94150	0.95223
Recall	0.91988	0.96876

4.3 Classification of Questions Indents of Chatbot

KEPCO develops and runs a chatbot service in Korean that informs employees of company regulations and legal information as a part of legal expert system. The chatbot uses a commercial chatbot engine that mainly operates based on rule-based method. In order to replace the existing rule-based engine, new algorithm is needed to identify the intent of questions from users and match them with appropriate answers. Moreover, the new algorithm should solve the problem of the existing engine called ‘Out of vocabulary (OOV) problem’, which new words that are not in the training data can’t be handled in the engine.

The Existing Model: The chatbot uses a commercial chatbot engine that mainly operates based on rule-based method.

Dataset: A dataset with {question-answer} structure is made based on a list of questions and answers about in-house precedents by the in-house legal department. By using crowdsourcing, the dataset is augmented by creating different question sentences with the same meaning as the existing questions. Examples of datasets are {“Please explain the relocation work of the distribution line”, answer 1}, {when is it necessary to relocate the distribution line?”, Answer 1}, {“Who will pay for the relocation work of the distribution line”, Answer 2}. Therefore, it is necessary to apply the classification algorithm after performing the embedding of text data. Total dataset consists of 35788 questions, and there are 332 indents as classes. Namely, the dataset consists of about 100 questions per class. The test dataset is 20% of the shuffled entire dataset

Evaluation Metric: We evaluated the performance of the model through several evaluation metrics; accuracy, F1 score, ROC- AUC score, precision, and recall.

The final machine learning pipeline: 1 tokenization step, 2 sentence embedding steps, and soft voting ensemble algorithm are included in the pipeline.

{Step1: tokenization with okt of konlpy ⁽³⁶⁾}

{Step2: TF-IDF embedding}

{Step3: sentence embedding with pre-trained Korean fastText module}

{Step4: soft voting(step2, step3)}

Results: We combined the sentence embedding based on pre-trained fastText model that solves the OOV problem and the TF-IDF with high performance by using the soft voting ensemble method. The voting ensemble model has an accuracy of 95.226%, which shows the possibility of replacing the existing commercial engine.

표 4. 모델 간 성능 비교

Table 4. The Comparison of Performances between Models

Metric	TF-IDF	fastText	TF-IDF + fastText
Accuracy	0.94191	0.89434	0.95226
F1 score	0.93503	0.88978	0.93494
ROC-AUC	0.97075	0.94881	0.97073
Precision	0.93473	0.89043	0.93473
Recall	0.93885	0.89524	0.93880

5. Conclusion

In this paper, we presents our on-going effort to develop an automated machine learning pipeline based on beam search especially for electric power industry. The proposed method has been applied to three real-word problems, one for each of regression, classification, and text classification. The significance of our work is first to apply automated machine learning to power industry with high performance. The performances of the model to predict hourly peak power demand and to detect illegal use of electricity for Bitcoin mining are improved by 3.42% and 3.23% respectively compared to the existing models. In addition, our work shows 95.23% accuracy in classification of the questions’ indents of chatbot, which gives the possibility to replace the existing model. We believe that, as a result of this study, we have made a significant contribution to the automated machine learning in power electricity industry, but there is still room for improvement. As future work, we plan to replace the heuristic approach with Bayesian approach or reinforcement learning based on accumulated logs.

Acknowledgements

This work was funded by the Korea Electric Power Corporation (KEPCO).

References

J. Redmon, S. Divvala, R. Girshick, A. Farhadi, 2016, You only look once: Unified real-time object detection, Proceedings of the IEEE conference on computer vision and pattern recognition

K. He, G. Gkioxari, P. Dollár, R. Girshick, 2017, Mask r-cnn, Proceedings of the IEEE international conference on computer vision

2019, Automated machine learning with azureml, https://github.com/ Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning

Data Robot, DataRobot, www.datarobot.com

2019, H2o.ai automl github, https://github.com/h2oai/h2o-3

A. Arora, A. Candel, J. Lanford, E. LeDell, V. Parmar, 2016, Deep Learning with H2O, http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf

C. Click, J. Lanford, M. Malohlava, V. Parmar, H. Roark, October 2016, Gradient Boosted Models with H2O, http://docs.h2o.ai/h2o/latest- stable/h2o-docs/booklets/GBMBooklet.pdf

Ron Kohavi, G. H. John, 1997, Wrappers for feature subset selection, Artificial intelligence, Vol. 97, pp. 273-324

J. Rogers, G. Steve, 2005, Identifying feature relevance using a random forest, International Statistical and Optimization Perspectives Workshop, Berlin, Heidelberg

A. Janecek, W. Gansterer, M. Demel, 2008, On the relationship between feature selection and classification accuracy, New challenges for feature selection in data mining and knowledge discovery, PMLR

K. Miyahara, M. Pazzani, 2000, Collaborative filtering with the simple bayesian classifier, Pacific Rim International conference on artificial intelligence, Berlin, Heidelberg

A. Bahl, B. Hellack, M. Balas, A. Dinischiotu, M. Wiemann, J. Brinkmann, A. Haase, 2019, Recursive feature elimination in random forest classification supports nanomaterial grouping, NanoImpact, Vol. 15

James Max Kanter, K. Veeramachaneni, 2015, Deep feature synthesis: Towards automating data science endeavors, 2015 IEEE international conference on data science and advanced analytics (DSAA)

G. Katz, E. Shin, D. Song, 2016, Explorekit: Automatic feature generation and selection, 2016 IEEE 16th International Conference on Data Mining (ICDM)

A. Kaul, S. Maheshwary, V. Pudi, 2017, Autolearn—Automated feature generation and selection, 2017 IEEE International Conference on data mining (ICDM)

Franziska Horn, R. Pack, M. Rieger, 2019, The autofeat Python Library for Automated Feature Engineering and Selection, Joint European Conference on Machine Learning and Knowledge Discovery in Databases

M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, F. Hutter, 2019, Auto-sklearn: efficient and robust automated machine learning, Automated Machine Learning, Vol. , No. , pp. 113-134

R. S. Olson, N. Bartley, R. J. Urbanowicz, J. H. Moore, 2016, Evaluation of a tree-based pipeline optimization tool for automating data science, in Proceedings of the Genetic and Evolutionary Computation Conference(GECCO) 2016. New York, NY

Zoph Barret, V. Le. Quoc, 2016, Neural architecture search with reinforcement learning, arXiv preprint arXiv:1611.01578

H. Pham, M. Guan, B. Zoph, Q. Le, J. Dean, 2018, Efficient neural architecture search via parameters sharing, International Conference on Machine Learning

Elsken Thomas, J. H. Metzen, 2018, Neural architecture search: A survey, arXiv preprint arXiv:1808.05377

Jin Haifeng, Q. Song, 2019, Auto-keras: An efficient neural architecture search system, Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Nikhil Ketkar, 2017, Introduction to keras, Deep learning with Python. Apress, Berkeley, CA, pp. 97-111

Sefraoui Omar, M. Aissaoui, 2012, OpenStack: toward an open-source solution for cloud computing, International Journal of Computer Applications, pp. 38-42

B. Burns, B. Grant, D. Oppenheimer, E. Brewer, J. Wilkes, 2016, Borg, omega, and kubernetes, Queue 14.1, pp. 70-93

D. Bernstein, 2014, Containers and cloud: From lxc to docker to kubernetes, IEEE Cloud Computing 1.3, pp. 81-84

B. Burns, J. Beda, K. Hightower, 2019, Kubernetes: up and running: dive into the future of infrastructure, O'Reilly Media

P. S. Ow, T. E. Morton, 1988, Filtered beam search in scheduling, The International Journal Of Production Research 26.1, pp. 35-62

A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, 2016, Fasttext. zip: Compressing text classification models, arXiv preprint arXiv:1612.03651

Arora Sanjeev, Y. Liang, T. Ma, 2017, A simple but tough-to-beat baseline for sentence embeddings, International conference on learning representations

SKTBrain, , SKTBrain/KoBERT, https://github.com/SKTBrain/ KoBERT.

W. R. Rudnicki, M. Kierczak, J. Koronacki, J. Komorowski, 2006, A statistical method for determining importance of variables in an information system, International Conference on Rough Sets and Current Trends in Computing, Berlin, Heidelberg

M. B. Kursa, A. Jankowski, W. R. Rudnicki, 2010, Boruta–a system for feature selection, Fundamenta Informaticae 101.4, pp. 271-285

M. B. Kursa, W. R. Rudnicki, 2010, Feature selection with the Boruta package, J Stat Softw 36.11, pp. 1-13

E. L. Park, S. Cho, 2014, KoNLPy: Korean natural language processing in Python, Proceedings of the 26th Annual Conference on Human and Cognitive Language Technology, pp. 133-136

저자소개

장광선(Gwangseon Jang)

Gwangseon Jang received M.S. degree in Department of Knowledge Service Engineering, from Korea Advanced Institute of Science and Technology(KAIST) in 2018.

He was in charge of developing cloud-based AI platforms and various deep learning models from 2018 to 2020 at Korea Electric Power Research Institute (KEPRI).

Currently, he works as a researcher at Korea Institute of Science and Technology Information(KISTI).

His current research focuses on domain adaptation of language model and information retrieval.

황명하(Myeong-Ha Hwang)

Myeong-Ha Hwang received B.S. degree in Department of Information and Communication Engineering, from Chungnam National University (CNU), South Korea in 2015 and M.E. degree in Information and Communication Network Technology from University of Science and Technology(UST), South Korea in 2018.

He currently works as researcher at Korea Electric Power Research Institute(KEPRI).

His current research focuses on Deep Learning and Natural Language Processing(NLP).

KIEEThe Transactions of
the Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Journal Search

Journal XML

Journal Information

빔서치 기반 전력산업용 머신러닝 자동화 파이프라인 시스템

Abstract

Key words

1. Introduction

2. Related Work

2.1 Automated Feature Engineering

2.2 Automated Algorithm Search & Hyper-parameter Tuning

3. The Proposed System

3.1 Architecture

3.2 Beam Search Module

3.3 Feature Engineering Module

3.4 Model Selection & Optimization Module

3.5 Infrastructure

4. Experiments and Results

4.1 Prediction of Maximum Demand of Electricity

4.2 Detection of Illegal Use of Electricity for Bitcoin Mining

4.3 Classification of Questions Indents of Chatbot

5. Conclusion

Acknowledgements

References

저자소개

장광선(Gwangseon Jang)

황명하(Myeong-Ha Hwang)

Article Information (continued)

Key words

KIEEThe Transactions ofthe Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Journal Search

Journal XML

Journal Information

빔서치 기반 전력산업용 머신러닝 자동화 파이프라인 시스템

Abstract

Key words

1. Introduction

2. Related Work

2.1 Automated Feature Engineering

2.2 Automated Algorithm Search & Hyper-parameter Tuning

3. The Proposed System

3.1 Architecture

3.2 Beam Search Module

3.3 Feature Engineering Module

3.4 Model Selection & Optimization Module

3.5 Infrastructure

4. Experiments and Results

4.1 Prediction of Maximum Demand of Electricity

4.2 Detection of Illegal Use of Electricity for Bitcoin Mining

4.3 Classification of Questions Indents of Chatbot

5. Conclusion

Acknowledgements

References

저자소개

장광선(Gwangseon Jang)

황명하(Myeong-Ha Hwang)

Article Information (continued)

Key words

KIEEThe Transactions of
the Korean Institute of Electrical Engineers