Bitte warten...
Bitte warten...
English
Hilfe
Login
Forschungsportal
Suche
Forschungsprofile
Forschungsprojekte
Projektvollmacht
Lehre
Forschung
Organisation
NEXTWRAP Next Generation Web Wrapper Technologies
01.01.2005 - 31.12.2007
Forschungsförderungsprojekt
This project aims at significant scientific and technological improvements of Web information extraction and annotation technology. Current systems for automated Web information extraction allow an application designer to visually specify extraction patterns on sample HTML documents. Pattern instances are then automatically extracted from production documents and translated into XML. In this project we want to pave the way to a next generation extraction technology by performing basic and experimental research towards the following goals: ¿ Enabling visual data extraction from poorly structured sources such as plain character documents, IBM 3270 screen images, and PDF documents. Based on general and specific document structure ontologies, algorithms and methods will be developed for imposing a tree structure on such source documents and thus making them accessible to visual wrapping methods. ¿ Enabling a visual information extraction system to deliver information into RDF repositories and other ontological knowledge bases through a tight coupling of the system's pattern hierarchy with an ontologically mapping mechanism. This would result in the first extraction technology being able to directly deliver knowledge content. ¿ Enabling the automated correction of tree-based wrappers in case of major changes in the structure of the input document(s). So far, techniques for an automated adaptation and repair of wrappers were considered for text-based grammatical wrappers only. A major research effort - and an improved understanding of "change ontologies" ¿ is necessary before repair techniques can be developed for the more powerful tree-based wrappers. ¿ Investigating new interface paradigms for wrapper generation. Currently, the specification of nontrivial data extraction is very hard for non-experts. We want to investigate novel simplified interfaces to facilitate wrapper construction by lay users. At the same time, we want to facilitate the creation and maintenance of community-based ontologies. All tasks are centred on tree-based wrapping and have as their common denominator the use of ontologies. We will develop a strong competence as a team of researchers in establishing a common ontological framework that we believe will form the basis of next generation extraction technology.
Personen
Projektleiter_in
Reinhard Pichler
(E184)
Projektmitarbeiter_innen
Robert Baumgartner
(E184)
Robert Julian Chandradoss
(E184)
Tamir Hassan
(E184)
Bibiana Kristoficova
(E184)
Daniil Kurushin
(E184)
Peter Szinek
(E184)
Institut
E184 - Institut für Informationssysteme
Förderungsmittel
FFG - Österr. Forschungsförderungs- gesellschaft mbH (National)
Österreichische Forschungsförderungsgesellschaft mbH (FFG)
Forschungsschwerpunkte
Information and Communication Technology
Schlagwörter
Deutsch
Englisch
Web Informations Extraktion
Web information extraction
Unstrukturierte Dokumente
Unstructured Documents
Externe Partner_innen
Technische Universität Graz Institut für Softwaretechnologie
Lixto Software GmbH
Publikationen
Publikationsliste