Abstract: The focused web-harvesting is deployed to realize an automated andcomprehensive index databases as an alternative way for virtual topical dataintegration. The web-harvesting has been implemented and extended by not onlyspecifying the targeted URLs, but also predefining human-edited harvestingparameters to improve the speed and accuracy. The harvesting parameter setcomprises three main components. First, the depth-scale of being harvestedfinal pages containing desired information counted from the first page at thetargeted URLs. Secondly, the focus-point number to determine the exact boxcontaining relevant information. Lastly, the combination of keywords torecognize encountered hyperlinks of relevant images or full-texts embedded inthose final pages. All parameters are accessible and fully customizable foreach target by the administrators of participating institutions over anintegrated web interface. A real implementation to the Indonesian ScientificIndex which covers all scientific information across Indonesia is also brieflyintroduced.

Author: Z. Akbar, L.T. Handoko

Source: https://arxiv.org/

