Yuliana, Oviliani Yenty and Chang, Chia-Hui (2016) AFIS: Annotation-Free Induction of Full Schema for Detail-Pages. In: The 2016 Conference on Technologies and Applications of Artificial Intelligence, 27-11-2016 - 27-11-2016, Hsinchu - Taiwan.
PDF Download (1333Kb) | |
PDF Download (4Mb) |
Abstract
Web data extraction is an essential task for web data integration. Most researches focus on data extraction from list-pages by detecting data-rich section and record boundary segmentation. The problem of data alignment in records is small scale since only a couple data attributes need to be aligned. However, for detail-pages which contain all-inclusive product information in each page, the number of data attributes need to be aligned is much larger. In this paper, we formulate the data extraction problem as alignment of leaf nodes from DOM Trees. We propose AFIS, an Annotation-Free Induction of full Schema for detail-pages. AFIS applies Divide-and-Conquer and Longest Increasing Sequence (LIS) algorithms to mine landmarks from input. The experiments show that AFIS outperforms Road-Runner, FivaTech and TEX (with precision 0.994, recall 0.987, and F1 0.990) in terms of selected (data) columns. For full schema evaluation (all data columns), AFIS also represents the highest average performance (with precision 0.946, recall 0.930, and F1 0.937) compared with TEX and RoadRunner.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Additional Information: | Turnitin baru dilakukan setelah paper terpublish |
Uncontrolled Keywords: | Web data extraction, Semi-structured data, Detail-pages alignment, Divide-conquer alignment, Landmark equivalence class |
Subjects: | Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4450 Databases Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science |
Divisions: | Graduate Program > Economic Management |
Depositing User: | Admin |
Date Deposited: | 12 Nov 2023 00:33 |
Last Modified: | 12 Dec 2023 21:13 |
URI: | https://repository.petra.ac.id/id/eprint/20670 |
Actions (login required)
View Item |