Title |
Text Encoding and Language Identification via N-Gram Feature and UTF-8 Encoding Detection |
Authors |
홍채희(Chaehui Hong) ; 조현지(Hyunji Cho) ; 유훈(Hoon Yoo) |
DOI |
https://doi.org/10.5370/KIEE.2024.73.9.1574 |
Keywords |
Language Identification; Text Data Encoding; N-gram; Machine Learning; UTF-8; Code Page |
Abstract |
This paper presents a method for automatically identifying the encoding and language of documents. In the online world, a technique for document encoding and language identification plays an important role in providing users with easy access to the information they need, and in improving the efficiency of data processing and analysis. In this paper, the proposed method first identifies whether a document is encoded in UTF-8 or not by analyzing the bit pattern. For the UTF-8, the language is identified by calculating the percentage of each language in the document through Unicode range analysis. If the document is found to be not UTF-8, it is determined to be a code page document, and the languages in the document are identified by our machine learning technique using N-grams. To evaluate the proposed method, we conducted experiments. The experimental results indicate that the proposed method improves the encoding and language identification performance compared to the existing methods. |