GraphWrap - Graph-Based Wrapping from PDF Documents

01.05.2008 - 28.02.2010
This project aims to investigate a new method of supervised information extraction from unstructured documents such as PDF files. It builds upon our achievements and knowledge derived from the NextWrap project, where we devised a basic graph structure to represent the physical objects on the page. In GraphWrap we plan to investigate the use of graph matching techniques to wrap data directly from this graph structure, instead of from an intermediary representation. This brings many tangible benefits: First, it enables the purely geometric as well as the logical structure to be used for locating instances of data to be wrapped. Secondly, it is far more intuitive for the user to interact with, giving the impression of ¿wrapping directly on the document¿. Thirdly, it is not as rigid, enabling the document understanding process to be partly influenced during wrapping. In this way, we plan to overcome the greatest limitations of current PDF wrapping approaches, which use an intermediate representation. The main contributions of this work are as follows: * A suitable graph representation of a PDF file, enhanced from our current representation to include logical as well as geometric relations between nodes * A suitable error-tolerant graph matching algorithm, which can locate the desired instances on the document in suitable time, and is invariant to common document structure changes * An intuitive prototype user interface, using both rendered and graph views of the page, where the user can select desired example instances, fine-tune wrapper parameters and be given immediate feedback on the result, allowing user interaction through the system to be researched.






  • Österreichische Forschungsförderungsgesellschaft mbH (FFG)


  • Information and Communication Technology


graph matchinggraph matching
InformationsextraktionInformation Extraction