TDWG 2016 has ended
Back To Schedule
Tuesday, December 6 • 09:00 - 09:15
Geographic entities extraction from biological textual sources

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

This work is focused on the exploration and application of entity extraction techniques for the codification and identification of geographical locations present in the geographic distribution section within botanic documents, such as the plant species manual of Costa Rica. Several technologies must be combined to achieve such objective, among them is Natural Language Processing (NLP) that helps in the extraction of entities such as the module ANNIE in the GATE framework, which uses gazetteers. Another technology is the usage of rules (regular expressions, Deterministic Automata, context-free grammars), Freeling is an example of it.
Additional to the identification and codification, it is very important to bind the geocoding to authorized sources such as geonames. Furthermore, this work identifies and enriches the entry text with extra information extracted from the paragraphs where the distribution is defined. An algorithm using Freeling 3.1 and Solr 5.5 is presented. Some values of interest for this work are: Holdridge life zones, world distribution, Costa Rica distribution, elevation and flowering months of the year. After those values are identified, the information is structured so that can be processed and become useful for diverse applications, such as geographic information systems. Other research projects might be interested in the results of this project.
The results obtained were evaluated by manually judging a randomly selected sample to establish whether or not the algorithm yielded useful data. The judgment consisted in assigning three possible values (GOOD, BAD, UNKNOWN) to the entities extracted and geocoded from the world distribution and Costa Rica distribution using the source’s context. The ideal is to have the least BAD percentage. The algorithm is relatively good to geocode and bind the world distribution and life zones. More work needs to be done for distribution in Costa Rica.

Tuesday December 6, 2016 09:00 - 09:15 CST
Auditorium CTEC

Attendees (8)