Collection, Description, and Visualization of the German Reddit CorpusReportar como inadecuado




Collection, Description, and Visualization of the German Reddit Corpus - Descarga este documento en PDF. Documentación en PDF para descargar gratis. Disponible también para leer online.

1 OeAW - Austrian Academy of Sciences 2 Berlin-Brandenburg Academy of Sciences

Abstract : Reddit is a major social bookmarking and microblogging platform. An extensive dataset of Reddit comments has recently been made publicly available. I use a two-tiered filter to single out comments in German in order to build a linguistic corpus which is then tokenized and annotated. This article offers first insights of both nature and quality of data at the lexical level. Additionally, a visualization makes it possible to grasp the possible geographical distribution of German users of the platform.

Keywords : Computer-mediated Communication Web corpus construction Information Visualization Language Identification





Autor: Adrien Barbaresi -

Fuente: https://hal.archives-ouvertes.fr/



DESCARGAR PDF




Documentos relacionados