Efficient construction of metadata-enhanced web corporaReport as inadecuate

Efficient construction of metadata-enhanced web corpora - Download this document for free, or read online. Document in PDF available to download.

* Corresponding author 1 Berlin-Brandenburg Academy of Sciences 2 OeAW - Austrian Academy of Sciences

Abstract : Metadata extraction is known to be a problem in general-purpose Web corpora, and so is extensive crawling with little yield. The contributions of this paper are threefold: a method to find and download large numbers of WordPress pages; a targeted extraction of content featuring much needed metadata; and an analysis of the documents in the corpus with insights of actual blog uses. The study focuses on a publishing software WordPress, which allows for reliable extraction of structural elements such as metadata, posts, and comments. The download of about 9 million documents in the course of two experiments leads after processing to 2.7 billion tokens with usable metadata. This comparatively high yield is a step towards more efficiency with respect to machine power and - Hi-Fi - web corpora. The resulting corpus complies with formal requirements on metadata-enhanced corpora and on weblogs considered as a series of dated entries. However, existing typologies on Web texts have to be revised in the light of this hybrid genre.

Keywords : Web For Corpus Corpus Linguistics Web Corpus Construction Focused Crawling

Author: Adrien Barbaresi -

Source: https://hal.archives-ouvertes.fr/


Related documents