Bitte warten...
Bitte warten...
English
Hilfe
Login
Forschungsportal
Suche
Forschungsprofile
Forschungsprojekte
Projektvollmacht
Lehre
Forschung
Organisation
Large-Scale Probabilistic Information Integration from Web Tables
01.03.2007 - 30.11.2009
Forschungsförderungsprojekt
The goal of this thesis is to mark the state of the art in probabilistic information integration from web tables. The research builds upon our previous work, a state-of-the art web table recognition system, and aims the following contributions: (1) Table data model: A principal characteristic of an automated large-scale information extraction system is that it makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. This approach demands a new data model for tables which allows for multiple simultaneously existing interpretations of the contained data. Our aim is to develop such a new table data model that allows separation of the data itself from alterable schemata. (2) Probabilistic table integration: A recent shift from machine-learning and logic- or rule-based web information extraction systems to probabilistic approaches can be observed. Most of them focus on creating statistical relational models from unstructured text. As we dispose of a table recognition system with previously unattained accuracy, our aim, in contrast, is to develop prbabilistic methods that allow for probabilistic information extraction from a large number of web tables. (3) Optimizing the value of human interaction: Most automatic and semi-automatic information extraction approaches use humans in the beginning of the process by either defining rules, or training the system à priori of the extraction step. Completely automatic approaches that use bootstrapping methods in order to cancel the human in the loop all together pay the high cost of poor coverage in order to attain an acceptable level of precision. Our aim is to develop a framework and an appropriate method that shows how to leverage the limited and valuable resource of human interaction in the most efficient way and, therefore, allows to combine the high recall of scalable automatic approaches with high precision due ti human interaction.
Personen
Projektleiter_in
Georg Gottlob
(E184)
Projektmitarbeiter_innen
Wolfgang Karl Gatterbauer
(E184)
Institut
E184 - Institut für Informationssysteme
Förderungsmittel
FFG - Österr. Forschungsförderungs- gesellschaft mbH (National)
Österreichische Forschungsförderungsgesellschaft mbH (FFG)
Forschungsschwerpunkte
Information and Communication Technology
Schlagwörter
Deutsch
Englisch
web information extraction
web information extraction
web tables
web tables
probabilistic information integration
probabilistic information integration
Publikationen
Publikationsliste