n-Gram-Based Text CompressionReport as inadecuate




n-Gram-Based Text Compression - Download this document for free, or read online. Document in PDF available to download.

Computational Intelligence and Neuroscience - Volume 2016 2016, Article ID 9483646, 11 pages -

Research Article

Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam

Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam

Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, Czech Republic

Received 21 May 2016; Revised 2 August 2016; Accepted 25 September 2016

Academic Editor: Geun S. Jo

Copyright © 2016 Vu H. Nguyen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.





Author: Vu H. Nguyen, Hien T. Nguyen, Hieu N. Duong, and Vaclav Snasel

Source: https://www.hindawi.com/



DOWNLOAD PDF




Related documents