To read this content please select one of the options below:

Machine learning for Asian language text classification

Fuchun Peng (Yahoo! Inc., Sunnyvale, California, USA)
Xiangji Huang (School of Information Technology, York University, Toronto, Canada)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 1 May 2007

963

Abstract

Purpose

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.

Design/methodology/approach

Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.

Findings

There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.

Practical implications

Apply the findings to real web text classification is ongoing work.

Originality/value

The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Keywords

Citation

Peng, F. and Huang, X. (2007), "Machine learning for Asian language text classification", Journal of Documentation, Vol. 63 No. 3, pp. 378-397. https://doi.org/10.1108/00220410710743306

Publisher

:

Emerald Group Publishing Limited

Copyright © 2007, Emerald Group Publishing Limited

Related articles