AUTOMATING DATA EXTRACTION FROM TEXT USING XML TAGGING

CURRY, Gordon B., Earth Sciences, Centre for Geosciences, Univ of Glasgow, Gregory Building, Lilybank Gardens, Glasgow, G12 8QQ and CONNOR, Richard, Computer and Information Sciences, Univ of Strathclyde, Livingstone Tower, Richmond Street, Glasgow, G1 1XH, United Kingdom, g.curry@earthsci.gla.ac.uk

Despite the advent of the age of Information Technology, much valuable Earth Science information is not available digitally. The main bottleneck in acquiring such digital information is the enormous and often unrewarding effort needed to manually enter data into computer databases. However recent advances in computing technology open up the possibility of automating or semi-automating the digitisation of significant sections of this extensive and valuable information. The process exploits stylistic and organisational conventions in text documents, allowing the preparation of dedicated software that automatically scans and generates xml-tags around discrete subsections of the information being presented. Such tags allow complex queries to be run across the information, because computers can use them to find records that fulfil multiple search criteria. The ability to execute complex queries is the main justification for the existence of databases; this new approach to data extraction, storage, and retrieval represents an intermediate stage in what should be seen as a spectrum of data processing techniques. The main advantages are speed (no manual data entry), fidelity (no recoding of information), completeness (none of the original information is lost) and flexibility (information of any nature can be tagged irrespective of whether or not it is fully consistent with a standardised format). Automatic tagging of text makes digitisation practicable for a wide range of information that is currently not available in this format. It can be applied to formal taxonomic descriptions of species (a rich source of important biodiversity information) or to any description (rocks, minerals, etc.) that follows a standardised format. The talk will demonstrate automatically tagged taxonomic descriptions from paleontological monographs, showing how complex queries involving morphology, stratigraphy, biogeogeography can be run across the tagged text. The tagged information can be readily displayed using www browsers, and can be made available via the www to greatly increase the range of users.

Session No. 113

T112. Geologic Time and CHRONOS: Databases, Tools, Outreach, Education, and the Geoinformatics Revolution II

Monday, 8 November 2004: 1:30 PM-5:30 PM

Geological Society of America Abstracts with Programs. Vol. 36, No. 5, p.272

© Copyright 2004 The Geological Society of America (GSA), all rights reserved. Permission is hereby granted to the author(s) of this abstract to reproduce and distribute it freely, for noncommercial purposes. Permission is hereby granted to any individual scientist to download a single copy of this electronic file and reproduce up to 20 paper copies for noncommercial purposes advancing science and education, including classroom use, providing all reproductions include the complete content shown here, including the author information. All other forms of reproduction and/or transmittal are prohibited without written permission from GSA Copyright Permissions.

Back to: T112. Geologic Time and CHRONOS: Databases, Tools, Outreach, Education, and the Geoinformatics Revolution II

<< Previous Abstract | Next Abstract >>

2004 Denver Annual Meeting (November 7–10, 2004)

AUTOMATING DATA EXTRACTION FROM TEXT USING XML TAGGING