Large-Scale Probabilistic Information Integration from Web Tables

01.03.2007 - 30.11.2009
The goal of this thesis is to mark the state of the art in probabilistic information integration from web tables. The research builds upon our previous work, a state-of-the art web table recognition system, and aims the following contributions: (1) Table data model: A principal characteristic of an automated large-scale information extraction system is that it makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. This approach demands a new data model for tables which allows for multiple simultaneously existing interpretations of the contained data. Our aim is to develop such a new table data model that allows separation of the data itself from alterable schemata. (2) Probabilistic table integration: A recent shift from machine-learning and logic- or rule-based web information extraction systems to probabilistic approaches can be observed. Most of them focus on creating statistical relational models from unstructured text. As we dispose of a table recognition system with previously unattained accuracy, our aim, in contrast, is to develop prbabilistic methods that allow for probabilistic information extraction from a large number of web tables. (3) Optimizing the value of human interaction: Most automatic and semi-automatic information extraction approaches use humans in the beginning of the process by either defining rules, or training the system à priori of the extraction step. Completely automatic approaches that use bootstrapping methods in order to cancel the human in the loop all together pay the high cost of poor coverage in order to attain an acceptable level of precision. Our aim is to develop a framework and an appropriate method that shows how to leverage the limited and valuable resource of human interaction in the most efficient way and, therefore, allows to combine the high recall of scalable automatic approaches with high precision due ti human interaction.






  • FFG - Österr. Forschungsförderungs- gesellschaft mbH (National) Österreichische Forschungsförderungsgesellschaft mbH (FFG) Fördergeber Typ Forschungsförderungsinstitutionen


  • Information and Communication Technology


web information extractionweb information extraction
web tablesweb tables
probabilistic information integrationprobabilistic information integration