2004 Denver Annual Meeting (November 7–10, 2004)

Paper No. 4
Presentation Time: 2:15 PM

AUTOMATING DATA EXTRACTION FROM TEXT USING XML TAGGING


CURRY, Gordon B., Earth Sciences, Centre for Geosciences, Univ of Glasgow, Gregory Building, Lilybank Gardens, Glasgow, G12 8QQ and CONNOR, Richard, Computer and Information Sciences, Univ of Strathclyde, Livingstone Tower, Richmond Street, Glasgow, G1 1XH, United Kingdom, g.curry@earthsci.gla.ac.uk

Despite the advent of the age of Information Technology, much valuable Earth Science information is not available digitally. The main bottleneck in acquiring such digital information is the enormous and often unrewarding effort needed to manually enter data into computer databases. However recent advances in computing technology open up the possibility of automating or semi-automating the digitisation of significant sections of this extensive and valuable information. The process exploits stylistic and organisational conventions in text documents, allowing the preparation of dedicated software that automatically scans and generates xml-tags around discrete subsections of the information being presented. Such tags allow complex queries to be run across the information, because computers can use them to find records that fulfil multiple search criteria. The ability to execute complex queries is the main justification for the existence of databases; this new approach to data extraction, storage, and retrieval represents an intermediate stage in what should be seen as a spectrum of data processing techniques. The main advantages are speed (no manual data entry), fidelity (no recoding of information), completeness (none of the original information is lost) and flexibility (information of any nature can be tagged irrespective of whether or not it is fully consistent with a standardised format). Automatic tagging of text makes digitisation practicable for a wide range of information that is currently not available in this format. It can be applied to formal taxonomic descriptions of species (a rich source of important biodiversity information) or to any description (rocks, minerals, etc.) that follows a standardised format. The talk will demonstrate automatically tagged taxonomic descriptions from paleontological monographs, showing how complex queries involving morphology, stratigraphy, biogeogeography can be run across the tagged text. The tagged information can be readily displayed using www browsers, and can be made available via the www to greatly increase the range of users.