Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books - Computer Science > Digital LibrariesReportar como inadecuado




Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books - Computer Science > Digital Libraries - Descarga este documento en PDF. Documentación en PDF para descargar gratis. Disponible también para leer online.

Abstract: Collaborative work on unstructured or semi-structured documents, such as inliterature corpora or source code, often involves agreed upon templatescontaining metadata. These templates are not consistent across users and overtime. Rule-based parsing of these templates is expensive to maintain and tendsto fail as new documents are added. Statistical techniques based on frequentoccurrences have the potential to identify automatically a large fraction ofthe templates, thus reducing the burden on the programmers. We investigate thecase of the Project Gutenberg corpus, where most documents are in ASCII formatwith preambles and epilogues that are often copied and pasted or manuallytyped. We show that a statistical approach can solve most cases though somedocuments require knowledge of English. We also survey various technicalsolutions that make our approach applicable to large data sets.



Autor: Owen Kaser, Daniel Lemire

Fuente: https://arxiv.org/







Documentos relacionados