TDWG 2016: Full Schedule

16:00 CST

Clustering botanical collections data with a minimised set of features drawn from aggregated specimen data

[Current state of play] Numerous digitisation and data aggregation efforts are mobilising botanical specimen data. Although digitisation is not yet complete, it is likely that we now have a critical mass of data available from which we can determine patterns.
[Problem] We know that many duplicate specimens exist, shared between separate botanical collections: these are digitised and transcribed in different herbaria and are yet to be comprehensively linked. Parallel digitisation efforts mean that the transcription of label data also happens in parallel, this results in some critical data fields (such as collector name) being much too variable to be easily used to resolve duplicates. Although not explicitly managed, we have the concept of a collecting trip (a sequence of collections from a particular individual or team). This research aims to uncover this implicit trip data from the aggregated whole. Once we have identified a collecting trip, we should be able to more easily resolve duplicates by cross linking on the trip identifier, along with the record number and date - i.e. avoiding the transcription variations that we often see in the collector field.
[Method and input data] This talk will show the output of a clustering analysis run in Python using the machine learning library scikit-learn. The data analysed were drawn from aggregated botanical specimen data accessed via the GBIF portal. Input to the analysis was optimised to use numeric features wherever possible (collection date and record number) along with minimal textual features extracted from the collector team.
[Results] The outputs of this clustering analysis will be used in a research context - to identify different kinds of collector trip â€“ but also have immediate practical applications in data management: to identify duplicate specimens between herbaria, and to identify outliers and label transcription errors. Examples of each of these kinds of outliers will be shown. Numbers of geo-references which can be shared between institutions will also be included. Other applications of this clustering technique within problem domains relevant to biodiversity informatics (e.g. bibliographic reference management) will also be discussed.

Speakers

Nicky Nicolson

Allan Tucker

Wednesday December 7, 2016 16:00 - 16:15 CST
Auditorium CTEC

Symposium 05, Big Data Analysis Methods and Techniques as Applied to Biocollections

16:30 CST

Large-scale Evaluation of Multimedia Analysis Techniques for the Monitoring of Biodiversity

More information...

Computer-assisted identification of living organisms is considered as one of the most promising solutions to help bridging the taxonomic gap and build accurate knowledge of the geographic distribution and evolution of species. LifeCLEF (www.lifeclef.org) is a worldscale research forum dedicated to the evaluation of multimedia-oriented identification systems. Its principle is to measure and boost the performance of the state-of-the-art by sharing large-scale experimental data covering thousands of species. Each year, hundreds of research groups specialized in computer vision, audio processing, machine learning or data management register to the proposed challenges. Tens of them succeed in processing the whole data and submit technical papers describing their running system. Results are then synthetized and further analysed in joint research papers. The LifeCLEF research platform is globally organized around 3 tasks related to multimedia information retrieval and fine-grained classification problems in 3 subdomains. Each task is based on large and collaboratively revised data and the measured challenges are defined in collaboration with biologists and environmental stakeholders in order to reflect realistic usage scenarios.
The first task deals with image-based plant identification and is organized since 2011. It is based on a growing collaborative data collection produced by tens of thousands of members of a French social network of amateur and expert botanists. In 2015, this dataset contained 113,205 picturesÂ of herb, tree and fern specimens belonging toÂ 1,000 speciesÂ (living in France and neighbouring countries). The second task deals with audio-based bird identification and is based on the audio recordings collected by a very active nature watchers network called Xeno-canto (http://www.xeno-canto.org/). This web-oriented community of bird sound recordists accounts for about 2,000 contributors that have already collected more than 180,000 recordings of about 9,000 species. Dataset used for the BirdCLEF task is focused on more than 20,000 audio recordings belonging to the 1000 bird species represented in the South-American region. The last task deals with the identification of sea organisms in general, from fish to whales to dolphins to sea beds to corals.
In this talk, we will report the main outcomes of the 2016-th edition of LifeCLEF including a comprehensive description of the best performing methods. Â We will then discuss perspectives of future developments according to the growing available datasets, and interest of the scientific community for this lab.

Speakers

Pierre Bonnet

HervÃ© Goeau

Alexis Joly

Sponsors

French Governement

Wednesday December 7, 2016 16:30 - 16:45 CST
Auditorium CTEC

Symposium 05, Big Data Analysis Methods and Techniques as Applied to Biocollections

16:45 CST

GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data

More information...

Managing research data has always been challenging but the recent availability of multi-gigabyte and larger datasets from major aggregators has created new problems, especially for individual and small institution researchers. A recent collaboration between the Integrated Digitized Biocollections (iDigBio) and the Encyclopedia of Life (EOL) called Global Unified Open Data Access (GUODA) aims to bring new techniques and resources for working with large biodiversity datasets to the widest community of researchers possible.
GUODA is both a computing infrastructure built and hosted by iDigBio and a community for collaboration in using the infrastructure. Our collaboration focuses on developing tools and workflows using Apache Spark for highly parallelized data analysis, a repository of pre-formatted and ready to use biodiversity datasets, and a resource management system capable of exposing these resources to the full skill range of software developers and data analysts.
This presentation will outline the software and hardware used in GUODA, the process and formats for transforming common biodiversity data such as the Global Biodiversity Information Facility (GBIF), iDigBio, and the Biodiversity Heritage Library (BHL) into computable data structures, and demonstrate the Jupyter Notebook interface to GUODA that is designed for researchers to interact with directly.

Speakers

Matthew Collins

Jennifer Hammock

Jorrit Poelen

Alexander Thompson

Sponsors

EOL

iDigBio

Wednesday December 7, 2016 16:45 - 17:00 CST
Auditorium CTEC

Symposium 05, Big Data Analysis Methods and Techniques as Applied to Biocollections

17:00 CST

Data Quality at Scale: Bridging the Gap between Datum and Data

More information...

This talk will provide a practical look at implementing high throughput, high volume data quality processing to tackle the task of providing efficient and effective feedback on data quality at the scale of an aggregator with tens of millions of records. Topics covered will include looking at the tradeoffs between coverage and accuracy, using the Apache Spark processing framework to rapidly iterate on data quality workflows across large volumes of data and methods for effectively capturing the results of large scale data quality work for distribution back to data providers. The examples given in this talk are driven by from work the iDigBio team has done on implementing data quality workflows across all of the data we have collected, as well as comparing and contrasting our methods with those of other projects.

Speakers

Matthew Collins

Alexander Thompson

Wednesday December 7, 2016 17:00 - 17:15 CST
Auditorium CTEC

Symposium 05, Big Data Analysis Methods and Techniques as Applied to Biocollections

17:15 CST

Fresh Data: what's new and what's interesting?

More information...

This talk describes a use case for Big Data analysis for fostering transparency and communication among several communities interested in biodiversity data.
Fresh Data is a suite of services for monitoring new biodiversity data matching specific queries across multiple biodiversity data sources, and notifying data providers when their data has been requested. Our goal is to connect time sensitive data consumers (researchers, primarily) with data producers (wildlife observers) in a meaningful but unobtrusive way. The community we seek to serve is non professional observers on platforms such asÂ http://citsci.org/ andÂ http://www.inaturalist.org/, who would not otherwise know they were documenting scientifically relevant data.
For these contacts to be useful, they must be fast. A subscribed researcher with a saved query should learn of a relevant new data point within a few days of the observation, and an observer should learn as quickly as possible that they have reported something that was needed by a researcher; this will encourage timely reactions, (additional reports by the observer or recruiting of other observers, and direct communication from the researcher if desired.)
To attract researchers to the monitoring tool, its search must be comprehensive, including the data sources they already rely on. Thus, the search index includes both GBIF and iDigBio data, as well as orphan data sources not yet aggregated.
Each data source is updated individually, and schedules are set appropriately for each; priority communities with short internal lag times (eg: iNaturalist) are updated the most frequently. Whole aggregator datasets (GBIF, iDigBio) are refreshed as frequently as capacity permits. Some of the priority communities have their data hosted at GBIF; their datasets are indexed separately as well, in order to allow faster update schedules, and records are deduplicated by occurrence ID.
Services are documented atÂ https://github.com/gimmefreshdata/freshdata/wiki/api . Services available include:
-all occurrence records, filtered by taxonomic and geographic parameters, occurrence date and date added to Fresh Data (supports data monitors for interested researchers)
-monitored occurrence records only,Â filtered by the same parameters, and also data source (supports data usage reports per interested data source)
-query parameters for all monitors, filtered by all the above parameters, and also occurrence ID (supports query dissemination, eg: yourÂ Urania Swallowtail report was sent to a researcher interested in Lepidoptera in the Caribbean)
Â

Speakers

TDWG 2016

16:00 CST

Nicky Nicolson

Allan Tucker

16:30 CST

Pierre Bonnet

HervÃ© Goeau

Alexis Joly

French Governement

16:45 CST

Matthew Collins

Jennifer Hammock

Jorrit Poelen

Alexander Thompson

EOL

iDigBio

17:00 CST

Matthew Collins

Alexander Thompson

17:15 CST

Jennifer Hammock

Jorrit Poelen

National Science Foundation

Recently Active Attendees