Loading…
TDWG 2016 has ended
Sunday, December 4
 

17:30 CST

Buses leave for Arenal Manoa Hotel
Buses begin departing from conference hotels and the Central Square in La Fortuna. Each bus will start from a conference hotel then stop and pick up those waiting in the Central Square (near the Church) in La Fortuna.

Sunday December 4, 2016 17:30 - 17:30 CST
Hotels Recommended Hotels Area

18:00 CST

Sunday Registration
Registration will be available on Sunday, December 4th at the Arenal Manoa Hotel in conjunction with the Welcome Reception.

Sunday December 4, 2016 18:00 - 20:00 CST
Reception Hotel Arenal Manoa

18:00 CST

Welcome Reception
The Welcome Reception will be held Sunday evening before the official start of the conference at the Arenal Manoa Hotel, La Fortuna, Costa Rica.

Sunday December 4, 2016 18:00 - 20:00 CST
Reception Hotel Arenal Manoa

20:00 CST

Buses depart from Arenal Manoa for conference hotels
Sunday December 4, 2016 20:00 - 20:00 CST
Entrance EcoTermales
 
Monday, December 5
 

08:45 CST

Monday Registration
Please make every effort to pick up your registration materials on Sunday during the Welcome Reception. Registration will only be available during breaks.

Monday December 5, 2016 08:45 - 09:00 CST
Lobby CTEC

09:00 CST

Opening Session
Welcome and announcements in Auditorium of CTEC. Featured speakers: Dr. Cynthia Parr, Chair TDWG; Dr Edgardo Vargas, Director of TEC; Dr. Julio Calvo, President of TEC or Dr. Paola Vega, Vice President for Research and Extension; and welcome video by naturalist, Alvaro Cuberto.

Monday December 5, 2016 09:00 - 09:30 CST
Auditorium CTEC

09:30 CST

Keynote: Dr. Rodrigo Gámez Lobo
Dr. Rodrigo Gámez Lobo was founder and former Director General and President of the National Biodiversity Institute (INBio). "Our real goal is to make the society come to the understanding that, because of being something that directly affects quality of life, materially, intellectually and spiritually, we must preserve at all costs the rich biodiversity of the country", he says in his book "On Biodiversity, People and Utopias" (1999). Dr. Erick Mata Montero will introduce Dr. Gámez.

Monday December 5, 2016 09:30 - 10:30 CST
Auditorium CTEC

10:30 CST

Monday AM Break & Registration
Lobby of CTEC

Monday December 5, 2016 10:30 - 11:00 CST
Lobby CTEC

10:30 CST

Poster Setup
Please set up your poster at this time. Posters should remain available for view throughout the week.

Monday December 5, 2016 10:30 - 11:00 CST
Lobby CTEC

11:00 CST

A Standards Architecture for Integrating Information in Biodiversity Science
In this presentation, we will identify what we believe are the essential elements in a standards architecture for how we represent, share, and use biodiversity data.  Our shared vision should include enabling human users and machines to find all of the information and to traverse all of the data connections that a knowledgeable researcher can see in the biodiversity literature, collections and other resources. We should be able to start from any point in the biodiversity data graph and find the meaningful links to associated data objects. From specimen to taxon concept to taxon name to publication; from sequence to associated sequences to taxon concepts to species occurrences; etc.
This means that our data architecture needs to pay attention to the following matters (quite independently of the challenges of delivering the infrastructures that underpin their successful implementation):
Agreement on the set of core data classes within the biodiversity domain which we consider important enough to standardise (specimen, collection, taxon name, taxon concept, sequence, gene, publication, taxon trait, or whatever we all agree).
Agreement on the set of core relationships between instances of these classes that we consider important enough to standardise (specimen identifiedAs taxon concept, taxon name publishedIn publication, etc.).
Making sure that our data publishing mechanisms (cores, extensions, etc.) align accurately with these classes and support these relationships – this mainly means reworking the current confused interplay between cores, DwC classes, use of dcterms:type and use of basisOfRecord – every record should be clearly identified as an instance of a class (or a view of several linked class instances) and (for the core data classes) this should form the basis for inference and interpretation.
An ongoing process of defining for each core class what properties are mandatory (maybe only: id, class), highly desirable (depending on the class, things like: decimal coordinates, scientific name, identifiedAs, publishedIn), generally agreed (many other properties for which we have working vocabularies and do not want unnecessary multiplication, e.g.: waterbody, maximumDepthInMeters) or optional/bespoke (anything else that any data publisher wishes to include). In other words, allow any properties to be shared but ensure that the contours of the data are clear to standard tools.
A set of good examples of datasets mapped into this model, using various serialisations.
While accommodating plain text and URIs in the same fields enables data publishing from the enables data publishing from the widest possible range of sources, it leaves problems for data aggregators and users.


Monday December 5, 2016 11:00 - 11:15 CST
Auditorium CTEC

11:15 CST

Biodiversity Data Integration from an Aggregator’s Perspective
GBIF’s fundamental charge is to make all of the world’s biodiversity data (as much as people are willing to share) behave as though it were managed in a single consistent database, with linkages to any other similar resources in biological and earth sciences.  [Replace that with your preferred grand expression, but I hope one that highlights the contrast between consistent and inconsistent data.]  The ability to query and summarize data, with answers that are as complete and accurate as possible, is made much more difficult by the fact that people record and publish data so differently.
We will summarize GBIF data ingestion and integration operations, and highlight how standards, particularly vocabulary standards, could simplify the integration effort and vastly improve the quantity and quality of data that are represented consistently.
GBIF harvests more than 32,000 data resources from over 800 providers.  At the first level, follow DarwinCore, ABCD, and various extensions, standardize the larger concepts, but at the value level, contents are still very heterogeneous.
The key concepts that GBIF standardizes include: Decimal-Latitude, Decimal-Longitude, Country, Taxon-Name (ranks of the taxonomic hierarchy?), Collecting-Date (and Time?).  In addition to Specimen, Observation, and Taxon-Name, what are the key classes that we need to standardize?
The processes of standardizing content has been expensive, and fields that remain unstandardized impede the producing complete and accurate results.
What are the concepts that most important to address with content vocabulary?
How else can vocabulary standards improve the quantity and quality of biodiversity data?
Will internationalization of vocabularies be required?



Speakers

Monday December 5, 2016 11:15 - 11:30 CST
Auditorium CTEC

11:30 CST

A High-altitude View of TDWG Standards: Machine Processing, Graphs, and the Vocabulary Development Process
 
For the past ten years, TDWG has envisioned a system that would facilitate automated machine processing to enable aggregation of data about similar types of resources, linking of differing types of resources, and reasoning of entailed data that is not explicitly stated by providers.  Despite the attractiveness of this vision, progress towards achieving it has been very slow.  This presentation will take a very broad view of what we can expect to achieve through machine processing, the challenges TDWG has faced and will face in moving toward a system that enables machine processing, and how the goal of enabling machine processing must influence the vocabulary development process.  The presentation will lay out the issues in terms of a graph model, which is central to understanding the issues surrounding machine processing, and on which standards such as Resource Description Framework (RDF) are based.  However, the presentation will not dwell on the details of RDF.

Speakers

Monday December 5, 2016 11:30 - 11:55 CST
Auditorium CTEC

11:55 CST

GitHub for TDWG standards and Interest Groups
GitHub is an online platform (https://github.com) to manage source code. It offers the distributed version control and source code management of git, as well as a number of features that greatly facilite source code collaboration, especially for open source projects. It has become the largest host of source code in the world and supports projects ranging from traditional software management to scientific research and open data. In 2014 TDWG adopted Github to host, version and collaborate around its biodiversity information standards (https://github.com/tdwg) and is increasingly using it for executive and interest group activities. In this talk I will explain 1) how to contribute to TDWG standards and activities using GitHub, covering features such as version control, editing documents, and submitting pull requests and issues, as well as 2) how to manage a GitHub repository, including features such as the wiki, issue management, inviting collaborators and creating releases. It should provide you with enough knowledge to feel comfortable diving into GitHub, be it for TDWG or otherwise.

Speakers

Monday December 5, 2016 11:55 - 12:10 CST
Auditorium CTEC

12:30 CST

Monday Lunch
For Monday only, please allow those attending the 1PM Newcomer's Chat priority in the lunch line. Lunch is served in the Cafeteria (also known as "Soda").
 

Monday December 5, 2016 12:30 - 14:00 CST
Cafeteria Cafeteria

13:00 CST

Newcomer's Chat
Computer Science 1

Monday December 5, 2016 13:00 - 13:55 CST
Computer Science 15 Computer Science (next to lunch place)

14:00 CST

Real use cases for Semantic Information from the Mining Biodiversity project
The Mining Biodiversity project explores the transformation of the Biodiversity Heritage Library (BHL) into a next-generation social digital library resource to facilitate the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community and to raise awareness of the changes in biodiversity over time in the general public.
The resulting digital resource would provide fully interlinked and indexed access to the full content of OCR text library documents, via semantically enhanced and interactive browsing and searching capabilities, allowing users to locate precisely the information of interest to them in an easy and efficient manner.
Sharing with different colleagues involved in efforts to enhance the accessibility of biodiversity data through well-defined semantics-rich representations it was recommended to focus first into the real questions that users had, but through initial consultations with potential users of this semantically-enhanced information, it became apparent that the functionality would not be required for their daily use of data.
A further close collaboration with actual final users of the corpus of information allowed the team to discover a valuable set of real use cases for Semantic Information from the Mining Biodiversity project.  This talk will demonstrate current real information needs that interoperable semantics for biodiversity data and knowledge can support. It will also highlight areas challenged by the lack of machine-interpretable semantics that could help address them and suggest potential solutions.

Speakers
avatar for William Ulate

William Ulate

Sr. Project Manager, Missouri Botanical Garden
Currently working at the Center for Biodiversity Informatics, Missouri Botanical Garden. William does research in Semantics, Systematics (Taxonomy) and Data Mining. Their current project is 'World Flora Online'.


Monday December 5, 2016 14:00 - 14:15 CST
Auditorium CTEC

14:15 CST

Towards the next-generation ABCD
The TDWG standard ABCD (Access to Biological Collection Data) has been developed between 2001 and 2006. It was aimed at harmonising terminologies used for modelling biological collection information. Furthermore, it is used as a comprehensive data format for transferring collection and observational data between software components facilitating searches across a large number of distributed and heterogeneous data providers.
From the beginning, ABCD used "XML Schema" as a mechanism for structuring data elements and their repeatability, for setting data types, and for capturing the specification of concepts in the form of semi-structured XML-annotations. ABCD elements are referenced by their xpath from the root node. This approach proved effective for integrating and harmonising different collection data models used in the collection community. However, it lacks the technical prerequisites for machine-readable semantics of ABCD-elements and their integration into the growing number of semantics-aware biodiversity informatics applications.
The DFG-funded project ABCD 3.0 (A community platform for the development and documentation of the ABCD standard for natural history collections) addresses the transformation of ABCD into a semantic-web compliant ontology by deconstructing the XML-schema into individually addressable RDF-resources published via the TDWG Terms Wiki. In a second step, informal properties and concept-relations described by the original ABCD-schema will be transformed into a machine-readable ontology and revised. An important aspect will be the deployment of a Semantic MediaWiki based platform for future editorial processes. This platform aims at facilitating the annotation process of elements by domain-experts by presenting only required semantic features in a user-friendly way. Apart from the new possibility to integrate ABCD 3.0 concepts into semantic applications and inference processes, we will derive tailored (XML-) application profiles serving data exchange requirements of specific biological communities.
The described approach shall spark a broader discussion on how to proceed with the transformation process of XML-based TDWG standards towards the Semantic Web. In their new form, these standards will need to meet the requirements of the emerging Semantic Web, while preserving the existing treasure of domain knowledge and fostering the continuous engagement of domain experts at the same time.


Monday December 5, 2016 14:15 - 14:30 CST
Auditorium CTEC

14:30 CST

The Open Biodiversity Knowledge Management System: A Semantic Suite Running on top of the Biodiversity Knowledge Graph
The Open Biodiversity Knowledge Management System (OBKMS) is a suite of semantic applications and services running on top of a graph database storing biodiversity and biodiversity-related information, known as the Biodiversity Knowledge Graph. The main purpose of OBKMS is to provide a unified system for interlinking and integrating diverse biodiversity data such as taxon names, taxon concepts, treatments, specimens, occurrences, gene sequences, bibliographic information, and others.The graph is serialized as Resource Description Framework (RDF) quadruples, extracted primarily from biodiversity publications. Options for expressing Darwin Core encoded data as RDF for insertion in the graph are explored.

Biodiversity publications provide a rich source of high quality data. In order to be able to convert such data into RDF, we have developed a general semantic model in support of information extraction from prospectively published and legacy taxonomic literature. We chose a number of ontologies from the publishing and biological domains to incorporate in our model. In addition to the utilization of Darwin Core Filtered Push (http://filteredpush.org/ontologies/FP/2.0/dwcFP.owl), together with Plazi, we have extended the Treatment Ontology for knowledge representation of current and legacy biodiversity publications. We understand a treatment to contain the informational value of a taxonomic concept and designed the ontology as such. Furthermore, the semantic model allows the expression of relationships between taxonomic concepts using Region Connection Calculus, RCC-5 (https://en.wikipedia.org/wiki/Region_connection_calculus). These relationships (congruence, inclusion, overlap, exclusion) are not usually found in old biodiversity publications where only nomenclatural relationships are allowed. However, for new publications, we are in the process of modifying Pensoft’s ARPHA Writing Tool (AWT) and the XML schemas, to allow authors to enter such information during the authoring process.

The system is currently in prototype stage and incorporates information extracted from Plazi’s TreatmentBank, as well as from the archives of ZooKeys and Biodiversity Data Journal. The system is designed also as a source for generating nanopublications.
The system is intended for different groups of users. Biodiversity scientists can use it, for example, to retrieve all taxonomic information associated with a name. Ecologists can use geographic search to locate taxon information associated with a region on a map. Collection managers can track if and where their specimen data have been published. Data aggregators can use the system to extend their stores. Biomedical scientists can make use of the linking of taxon and gene information.




Monday December 5, 2016 14:30 - 14:45 CST
Auditorium CTEC

14:45 CST

Semantic Annotation for Tabular Data
Tabular data, expressed as spreadsheets, and tab or comma-delimited files, are a convenient and common method for storing and transmitting biodiversity data.  However, tabular data is all too often “dark” data, lacking context and consistency, with little clarity about exactly what is being referred to in the data: for example, whether a set of fields in a “row” refers to a curated specimen, a living individual that is being tracked on an ongoing basis, or an observation. Common difficulties in working with dark data include values with no units, identifiers that are local in scope only or missing, and especially a lack of context for the relationships that exist between data values in columns. These issues are a true impediment for sharing and integrating data from distributed data sources.   While this topic has received a lot of attention in recent years, implementations that offer usable solutions for helping users improve semantic clarity and create instance identifiers have lagged.  This talk will explore a method for validating and classifying instance data based on project management rules, expressed in an XML (extensible markup language) configuration file, and useful for biologists and data managers.   Beginning with a look at the necessary steps of project configuration and then data validation, we will finish by following a sample input file from the National Phenology Network (NPN) as it is loaded into the Biocode Field Information Management System (http://biscicol.org/) and finally a look at the resulting triples and a discussion of implications and future directions.


Monday December 5, 2016 14:45 - 15:00 CST
Auditorium CTEC

15:00 CST

Some Challenges in Working with Biodiversity Ontologies
We have faced a number of recurring ontology-related challenges in our ongoing work aimed at extracting structured data from taxonomic treatments and other sources; representing this data in RDF; providing and curating ontological infrastructure for reasoning over this data; and integrating this data into an open biodiversity knowledge graph. We will describe some of the challenges that we have overcome, and some that we continue to struggle with. These include issues with representing and integrating data about phenotypes, habitats, phenology, and establishment means (native vs. introduced). We will also describe the structure of our biodiversity knowledge graph, and invite collaboration in its continued construction.
RDF = Resource Description Framework


Monday December 5, 2016 15:00 - 15:15 CST
Auditorium CTEC

15:15 CST

Bottom-up Phenotype Ontology Building from Character Descriptions
Phenotypic characters are described in published works mostly using human languages ("natural languages"). They are valuable knowledge but not amenable to computational analyses. Text mining algorithms have been developed to extract useful information from the text, but the extracted information needs to be ontologized to ensure “apples are compared to apples”.  Highly trained biology researchers are taking on the role of biocurators to convert characters expressed in human languages into formal statements, for example, EQ (Entity-Quality) statements, using ontologies. During the process, terms encountered in the descriptions are added to ontologies. Grounding phenotypic data in the rich literature brings the knowledge accumulated in the past several hundred years to life. However, biologists in general are not involved in the ontology building process, and natural language descriptions are continuously being published. In this talk, we will discuss the issues encountered in building and using existing ontologies for phenotypic data curation and present our progress in creating ontology building tools biologists can use. We will discuss ontology building issues that contribute to inter-curator variations and ontology design patterns for phenotypic characters that have the potential of reducing the variations.


Monday December 5, 2016 15:15 - 15:30 CST
Auditorium CTEC

15:30 CST

Monday PM Break, Poster Viewing, & Registration
CTEC Lobby

Monday December 5, 2016 15:30 - 16:00 CST
Lobby CTEC

16:00 CST

TDWG-GBIF Data Quality Interest Group
Biodiversity data come from many different sources and occur in a range of different forms and formats – museum specimens, observations by amateurs or professionals, from static or mobile recording devices, photographic or video-graphic images, audio recordings, vegetation transects, bioblitzs, laboratory measurements, DNA sequences, character traits, statistics, models and more.
As more data becomes readily available, the number of uses to which the data may be put is also increasing rapidly. For the data to be useful however (i.e. fit for use for a particular purpose), it needs documented quality characteristics. Some potential uses may have legal ramifications, and as such, may need to be defensible in a court of law. But unless we have standard ways to test and document the data, we have no way of defending the data.
It is for this reason that the TDWG-GBIF Data Quality Interest Group was floated in 2013 and formally adopted in 2014.  Since that time, a lot of thought and work has gone into how standardization assists in the use of biodiversity data. We have submitted a Framework for pulication, developed a core set of tests and assertions based on the Darwin Core fields and begun the formalization of a core set of Use Cases.
If we get these standards adopted universally by data custodians, data publishers and by users themselves, we will have made the use of biodiversity data more efficient. This is our next challenge, and we hope to use this meeting to gain your support to achieve that end.


Monday December 5, 2016 16:00 - 16:15 CST
Auditorium CTEC

16:15 CST

Conceptual Framework for Assessment and Management of Fitness for Use - A Contextualization in Biodiversity Informatics Scenario
A consistent approach to assess and manage Data Quality (DQ) is currently critical for biodiversity data users. However, achieving this goal has been particularly difficult because of the idiosyncrasies inherent to the concept of quality.

DQ researches show that DQ assessment and management are founded on, mostly, DQ dimensions (or their opposite, e.i. DQ problems), which are used to define the meaning DQ in a given context and measure it. With an appropriate DQ measurement, data users or automated systems are able to select a subset of data that have a suitable level of quality for a determined purpose, and data managers can direct effort on the improvement of the subset of data that are not fit for use.

A conceptual framework has been developed to provide a context for describing biodiversity data quality, allowing users to make an informed assessment of DQ and its subsequent fitness for their use as well as allowing institutional assessment of DQ for management and improvement.

This proposed framework defines nine concepts organized into three classes: DQ Needs, DQ Solutions and DQ Report, which can formalize human thinking into well-defined components to make it possible to share and reuse concepts of DQ needs, solutions and reports in a common way among data user communities. With this framework, we establish a common ground for the collaborative development of needs and solutions for DQ assessment and management based on data fitness for use principles.

In future work, we will use the presented framework to engage the biodiversity informatics community to formalize and share DQ profiles related to a number of data usages.


Monday December 5, 2016 16:15 - 16:30 CST
Auditorium CTEC

16:30 CST

Data Quality’ Task Group 2: Tools, Services and Workflows
Data qualitycomes up in most presentations I give to research audiences about the Atlas of Living Australia. The usual comment is How do I remove the amateur observations from the data in the Atlas? My response is that many amateur observations are of better quality than scientific ones and that fitness for use is a far better term than data quality because a record that one scientist filters out may be appropriate in other circumstances.
It took me around a year to ensure that the Atlas of Living Australia made public all submitted records. My argument was that scientists wouldnt appreciate data hidden from them. Therefore, how do Data Publishers address this issue? I say that we can provide most value by the application of a standard suite of automated tests to all records, and where failed tests are flagged in the record. Such tests cannot detect and address all issues but they are easy to apply and are effective in determining fitness for use.
If such tests are comprehensive, adequate for filtering purposes and representative in addressing known issues, then why not standardize them? A standard suite of tests makes life easier by consistency across Data Publishers.


Monday December 5, 2016 16:30 - 16:45 CST
Auditorium CTEC

16:45 CST

Data Quality Workflows using Akka
Data cleaning has the potential to improve the chances for people and computers to find and use relevant data. This is true for researchers as well as for large-scale data aggregators. In the biodiversity realm, Darwin Core provides a convenient scope and framework for data cleaning tools and vocabularies.
One way to address data cleaning tasks is to use workflows that act on a combination of original data, controlled vocabularies, algorithms, and services to detect inconsistencies and errors, recommend changes, and augment the original data with improvements and additions. There are advantages from the perspective of flexibility to construct such workflows from specialized, reusable "actors" -- building blocks that do specific tasks, such as provide a list of distinct values of a field in a data set.
The Kurator project uses Akka, a Java-based framework to construct workflows with actors written in a variety and even in a combination of programming languages. In this presentation, we will explore the process of building actors and combining them in Akka workflows that do a variety of data cleaning and reporting tasks inspired by the VertNet process of mobilizing data from institutional data sets for large-scale aggregators such as VertNet, iDigBio, and the Global Biodiversity Information Facility. Ultimately, the goal of this work might be, given a biodiversity data set, to provide an improved version of that data set in the form of a Darwin Core archive that includes a data quality extension (not yet developed) to report what was found, what was done to it, and what could still be done to further improve it.


Monday December 5, 2016 16:45 - 17:00 CST
Auditorium CTEC

17:00 CST

IDQ: Integrating Data Quality into Biodiversity Workflows
This talk will provide an overview of the Integrated Data Quality (IDQ) software package,its design philosophy, some of the history behind it at iDigBio, and its future as a reference implementation candidate of some of the ongoing harmonization work on data quality in TDWG and GBIF task groups.  IDQ is a software package for building data quality processes to maximize their ability to integrate into diverse workflows. The ultimate goal of the project is to provide a robust set of pre-packaged test, assertion and correction tools that can be utilized by users of all skill levels across a wide variety of biodiversity data.  It grew out of work done on data quality at the iDigBio project and is being spun out of the main code base in order to open up the tools and methods used to a broader community of users. The base library provides tools and interfaces for easily constructing efficient data quality workflows, and separate modules build upon the core to provide the actual library of tests and assertions. It is intended to be usable at all scales, from working on individual records to aggregator sized data processing pipelines and all the steps in between. The code is hosted on iDigBio’s Github organization at: https://github.com/iDigBio/idq .
 


Monday December 5, 2016 17:00 - 17:15 CST
Auditorium CTEC

17:15 CST

Improving quality while preserving quantity using OBIS automated QA/QC procedures
The Ocean Biogeographic Information System (OBIS) was established as the data repository and data dissemination system for the Census of Marine Life and now OBIS is building on that success by working to include marine observations from other projects around the world. OBIS uses Darwin Core to integrate species occurrence records from across the world so that marine biogeographic data are freely available to address today’s global concerns for coastal areas and oceans. OBIS is a distributed data system and consists of member nodes with each node having a regional or thematic focus. Data are assembled for OBIS through the OBIS nodes, with each node performing the initial quality check of incoming data. Once data make their way to OBIS, two sets of automated quality control and quality assurance procedures are run to verify that incoming data include the seven required Darwin Core fields, and are taxonomically and geographically rigorous. The first product of these QA/QC procedures is presented back to the OBIS Nodes using an html data harvest report, allowing each node to see the results of the procedures and devise remedies for errors. The second product, which is not available in the standard OBIS data download but can be accessed via R or the API, generates QA/QC flags on the data at the record level. OBIS uses a subset of those QA/QC flags on the dataset pages (http://www.iobis.org/explore/#/dataset/3963), making it easier for users to determine the level of quality of the data they are accessing. When users have access to information about data quality, they are able to assess the fitness for use of the data for their projects and analyses. Over time OBIS expects this will lead to a sense of trust about the data, increased usage of the data, and the conversion of data to knowledge. Overall, the OBIS quality assurance and control procedures help to ensure the data in OBIS are robust, accurate, and trustworthy.


Monday December 5, 2016 17:15 - 17:30 CST
Auditorium CTEC
 
Tuesday, December 6
 

08:45 CST

Tuesday Registration
CTEC Lobby

Tuesday December 6, 2016 08:45 - 09:00 CST
Lobby CTEC

09:00 CST

Geographic entities extraction from biological textual sources
This work is focused on the exploration and application of entity extraction techniques for the codification and identification of geographical locations present in the geographic distribution section within botanic documents, such as the plant species manual of Costa Rica. Several technologies must be combined to achieve such objective, among them is Natural Language Processing (NLP) that helps in the extraction of entities such as the module ANNIE in the GATE framework, which uses gazetteers. Another technology is the usage of rules (regular expressions, Deterministic Automata, context-free grammars), Freeling is an example of it.
Additional to the identification and codification, it is very important to bind the geocoding to authorized sources such as geonames. Furthermore, this work identifies and enriches the entry text with extra information extracted from the paragraphs where the distribution is defined. An algorithm using Freeling 3.1 and Solr 5.5 is presented. Some values of interest for this work are: Holdridge life zones, world distribution, Costa Rica distribution, elevation and flowering months of the year. After those values are identified, the information is structured so that can be processed and become useful for diverse applications, such as geographic information systems. Other research projects might be interested in the results of this project.
The results obtained were evaluated by manually judging a randomly selected sample to establish whether or not the algorithm yielded useful data. The judgment consisted in assigning three possible values (GOOD, BAD, UNKNOWN) to the entities extracted and geocoded from the world distribution and Costa Rica distribution using the source’s context. The ideal is to have the least BAD percentage. The algorithm is relatively good to geocode and bind the world distribution and life zones. More work needs to be done for distribution in Costa Rica.


Tuesday December 6, 2016 09:00 - 09:15 CST
Auditorium CTEC

09:00 CST

Biodiversity informatics and the agricultural data management landscape
Historically, ecological and biodiversity researchers have focussed on the basic patterns and processes of populations, communities, and ecosystems, with minimal attention paid to the role of humans. Human impacts have instead been addressed in the more applied sciences of conservation, medicine, and agriculture. In recent years, however, boundaries between applied and basic sciences have blurred. There is general recognition that our future is best served by science that seeks to understand systems in their true, full contexts. Societies cannot live sustainably without an understanding of the biosphere and how humans and their behavior and management practices might impact it. Data infrastructure (e.g., data management systems, metadata standards, ontologies) must therefore accommodate use cases that span managed and "pristine" systems. In this talk we describe the challenges faced by agricultural research communities that share some domain-specific data needs with basic biodiversity and ecology research communities, but that also share needs with the social science and biomedical communities. Big data in agriculture involves both real-time environmental and high throughput genomics and phenomics. Long-term data includes social science surveys, repeated crop rotation experiments, and basic monitoring of soil and water and weather conditions. Battling emerging pests or adapting cropping or ranching activities to climate change requires an understanding of wild relatives and microbial ecology. We sketch out a landscape of loosely coupled data and analysis infrastructures and policies that are being developed to address these challenges, with special focus on the United States. Some parts of this landscape are centered at the US National Agricultural Library (NAL), e.g., the Ag Data Commons, i5K workspace, Life Cycle Assessment Commons. Other parts are found elsewhere in the US Department of Agriculture (e.g., Long Term Agroecosystem Research initiative, National Institute of Food and Agriculture's data science program). Other government agencies, universities, and private organizations all play critical roles. Some parts of the landscape are already familiar to the biodiversity informatics community but agricultural use cases can help all of us work together on best practices and interoperable systems. Collectively, we can identify and address gaps in standards and services for machine-readable data dictionaries, thesauri, and ontologies. We can strengthen the use of Globally Unique Identifiers and ride public access mandates and advances in high performance computing to promote text and data mining and modeling. We can build a living knowledge landscape that serves and promotes both basic and applied research.


Tuesday December 6, 2016 09:00 - 09:15 CST
Computer Science 3 Computer Science

09:00 CST

Biodiversity Through Deep Time: Data standards and best practices for paleobiology
The Paleobiology Interest Group (Paleo) was established in 2015 to broaden the application of existing standards such as Darwin Core to accommodate paleobiological data. This will represent the inaugural meeting of the interest group at a TDWG conference.
The group seeks to extend existing standards to meet the needs of paleobiology and to foster greater integration between neontology and paleontology in the study of biodiversity across space and deep time.  Understanding long-term temporal patterns in biodiversity provides the context for interpreting modern changes in biodiversity and understanding the process responsible for these changes.
Employing biodiversity information standards, such as Darwin Core to paleobiology data requires addressing a broader range of metadata requirements and adapting best practices. The role of the Paleo group is to provide guidelines for deploying existing biodiversity standards to paleobiology and to propose extensions and modifications to existing standards to make them more amenable (and generalizable) for paleobiology.
The primary goals for the 2016 meeting of the Paleobiology Interest Group are: 1) to update TDWG members on the group's progress, 2) to establish a reliable schedule of meetings and communication outside of the annual meetings, 3) to generate a list of the most common paleobiological use cases for deploying existing standards such as Darwin Core, in paleobiology, 4) to discuss and produce examples of Darwin Core entries for those use cases and in the process 5) identify challenges that arise in applying Darwin Core to paleobiology data.
The annual meetings offer the best opportunity for bringing biodiversity information specialists together with disciplinary specialists in paleobiology to resolve questions and make concrete advances in broadening the application of biodiversity information standards in a new disciplinary domain.

Speakers

Tuesday December 6, 2016 09:00 - 10:30 CST
TecnoAula 1 CTEC

09:15 CST

Semi-automatical classification and structuring of text fragments from biological documents
An enormous body of information required for an effective biodiversity conservation is stored in books and papers. Unfortunately, that makes harder to synthesized knowledge from it. In this talk a tool is presented to help users semi-automatically extract and structure knowledge from scientific literature about the flora of Costa Rica. At this point the tool is still being developed and it is not yet integrated with tools that extract morphological characters from taxonomic descriptions and extract geographic entities.
As its first goal the tool allows users to mark fragments of text from a botanical document (Flora de Costa Rica, Árboles de Costa Rica) and assigned them one of some semantically meaningful categories that described its content. Among these categories we have: morphological descriptions, distributions, dichotomy keys and diagnostics descriptions. Depending on the information available the process will go from totally manual to completely automatic (in the future). There will be four levels of processing:
  1. manual mark up and manual assignment of categories
  2. manual mark up and automatic suggestions of categories
  3. manual mark up and automatic assignment of categories
  4. automatic mark up and automatic assignment of categories
We already implemented the first two levels. The next two levels are under developing.
The second goal of the tool is to allow users invoked specialized tools to extract structured information from a subset of fragments according to their categories. We are in the process of integrating the tools that have been developed to:
  • semi-automatically structure the morphological descriptions to extract characters
  • semi-automatically structure the distribution descriptions to extract geographical entities
As the final goal, some modules are being developed to take advantage of the structured information available.
  • query the morphological descriptions
  • query the distribution descriptions
  • select a set of taxons and generate input for taxon-character matrix software (like Lucid)

Speakers

Tuesday December 6, 2016 09:15 - 09:30 CST
Auditorium CTEC

09:15 CST

Progress in Standardizing Sampling-Event Data
Scientists can now share sampling data on GBIF.org, making it available for other researchers while showing a commitment to open access and reproducibility, which are integral to scientific inquiry.
GBIF.org is the world's largest source of species occurrence data, providing free and open access to more than 600 million occurrences from more than 29,000 datasets published by over 800 institutions. Its near real-time infrastructure is now widely used, supporting more than one substantive use in peer-reviewed research per day.
Over the past two years, the GBIF Secretariat has been working with European Biodiversity Observation Network (EU BON) partners and the wider biodiversity informatics community to enable sharing of “sampling-event datasets”. These data are derived from environmental, ecological and natural resource investigations that follow standardized protocols for measuring and observing biodiversity. Because the sampling methodology and sampling units are precisely described, the resulting data is comparable and thus better suited for measuring trends in habitat change and climate change. Previously GBIF.org did not support this type of data because of the complexity of encoding the underlying protocols in consistent ways.
In March 2015, TDWG ratified changes to Darwin Core (DwC) standard to enable the mobilization of sampling-event data, particularly species abundance. In September 2015 GBIF released a new version of the Integrated Publishing Toolkit (IPT), its free, open-source data publishing software, allowing publication of sampling event datasets in connection with updates to GBIF.org, which enhanced indexing and discovery of these datasets.
Early adopters began publishing the first round of sampling event datasets in late 2015. Based on feedback collected from these publishers, four additional DwC terms were proposed in June 2016 in order to more faithfully represent a wider number of sampling protocols. Their input also helped guide the development of documentation to support publishers interested in sharing sampling-event data.
This presentation will highlight recent improvements GBIF has made to support the publication of sampling event datasets. The presentation will also reveal how upcoming changes to GBIF.org may improve the discovery and reuse of this type of dataset. Drawing on some exemplar datasets, the presentation also aims to promote this new data standard, demonstrating, for example, how it can truly represent vegetation plot data.

Speakers

Tuesday December 6, 2016 09:15 - 09:30 CST
Computer Science 3 Computer Science

09:30 CST

Semi-Automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions
Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format.  The amount of publications generated is very large; therefore to process it manually is a complex and very expensive activity.  The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million of pages published in over 5.4 million of books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current).
It is necessary to develop standards and software tools to extract, integrate, and publish this knowledge into existing free and open access repositories to support science, education, and biodiversity conservation.
In this talk, an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish is presented. The developed algorithm is based on the work of Dr. Hong Cui (University of Arizona), uses semantic analysis, ontologies, and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the book Trees of Costa Rica Volume III and to a subset of descriptions of the Manual of Plants of Costa Rica with very competitive results (more than 94.1% of average performance).   The system receives the morphological descriptions in tabular format and generates XML documents according to the scheme proposed by Dr. Cui (available at https://github.com/biosemantics/schemas/blob/master/semanticMarkupOutput.xsd). The scheme allows documenting structures, characters, and relations between characters and structures. Each extracted object is documented with attributes like name, value, modifiers, restrictions, ontology term id, among other attributes.
The implemented tool is free software, was developed using Java, and integrates existing technology as FreeLing, the Plant Ontology (PO), the Ontology Term Organizer (OTO), and the Flora Mesoamericana English-Spanish Glossary.


Tuesday December 6, 2016 09:30 - 09:45 CST
Auditorium CTEC

09:30 CST

Recognizing the Data Gap of Arthropods in Agricultural Biodiversity
The release in February 2016 of the Final Report of the Task Group on GBIF Data Fitness for Use in Agrobiodiversity concentrated primarily on crop plants and their wild relatives, calling for a Darwin Core (DwC) germplasm extension and the integration of the Multi-Crop Passport Data standard, which has been in use and widely accepted for decades by agricultural gene banks. While addressing an important data gap, this interpretation of agricultural biodiversity does not address the enormous fauna that forms a dynamic part of the agroecosystem, and which are often the subjects of controlled multi-year, multi-location experiments conducted within the crop and at various scales in the larger landscape.
Traditional agricultural scientists are as much to blame for the lack of appreciation of the extended value of their data in the context of biodiversity as anyone: they do not contribute nor do they see the value of contributing their data to the same repositories as those conducting or documenting “traditional” biodiversity inventories. Yet they have a wealth of data, often collected over years in field experiments designed to show differences in how cropping systems and their management may affect the resident and visiting fauna (herbivores, vectors of plant pathogens, pollinators, predators, parasites, and parasitoids) and their phenology through time.  This represents a large untapped reservoir of raw data from a community that generally just publishes distilled and analysed specimen observations, and only deposits vouchers of observed species in museums. But to make use of that data for purposes other than its original use would require not only certain modifications of systems to easily accommodate it, but a real change in attitude on the part of the researchers and credit for making such efforts.  What standards might need to be added? And how can we either mine existing datasets or persuade agricultural scientists to contribute their data to the global conversation?  This is a hole in the animal (particularly invertebrate) area, but integrating this with various crops, cropping systems, and cultural practices could well provide insights for pollination services, predator/prey dynamics, epidemiology of plant viruses, invasive species, and long distance movement of pests.

Speakers
avatar for Gail Kampmeier

Gail Kampmeier

Affiliate, Illinois Natural History Survey, Prairie Research Institute, Univ of Illinois
Entomologist; Code of Conduct Committee member; TDWG Program Co-Chair; Editor-in-Chief, Biodiversity Information Science and Standards (BISS)


Tuesday December 6, 2016 09:30 - 09:45 CST
Computer Science 3 Computer Science

09:45 CST

Understanding mass flowering of dipterocarps through semantic occurrence information extraction
Forest restoration and rehabilitation is a challenge in biodiversity conservation that requires the understanding of data collected over long-term periods from large-scale geographic areas, given the complex and long reproductive cycles of forest trees. In the Philippines, the lowland tropical forests primarily comprised of dipterocarp species are one of the most threatened ecosystems in the world. Dipterocarps, belonging to the family Dipterocarpaceae, are economically and ecologically important due to their timber value as well as contribution to wildlife habitat, climatic balance and stronghold on water releases. They exhibit supra-annual mass flowering events that occur in irregular intervals of two to ten years possibly synchronously across Asia. In order to understand the mass flowering of dipterocarps within the context of their effective natural regeneration and reforestation, we propose to exploit enormous amounts of text form biodiversity records in taxonomic literature, scholarly articles, books and agency reports. We aim to develop and employ information extraction methods to augment structured observation data with occurrence information captured from the literature. To this end, we have developed a schema for the semantic annotation of taxon names, geographic locations, dates, habitat descriptions, authorities, and names of herbaria (in the case of collected specimens) to aid in determining the distribution of dipterocarps. Our proposed schema, furthermore, captures the species’ reproductive state to enable the derivation of phenological patterns and the identification of factors that trigger mass flowering. In this way, we enable the generation of more comprehensive time series occurrence data that includes information on reproductive maturity and habitat conditions of dipterocarps. This will facilitate further knowledge discovery tasks focused on restoration of dipterocarp forests.




Tuesday December 6, 2016 09:45 - 10:00 CST
Auditorium CTEC

09:45 CST

LTAR Research: Aspiring to meet production and conservation objectives on the USDA-ARS Central Plains Experimental Range, Nunn, Colorado, USA
The Long-Term Agroecosystem Research (LTAR) Network consists of 18 sites across the continental United States (US) sponsored by the US Department of Agriculture, Agricultural Research Service, universities and non-governmental organizations. LTAR scientists seek to determine ways to ensure sustainability and enhance food production (and quality) and ecosystem services at broad regional scales.  They are conducting common experiments across the LTAR network to compare traditional production strategies (“business as usual or BAU) with aspirational strategies, which include novel technologies and collaborations with farmers and ranchers.  Within- and cross-site network success towards achieving the desired outcomes of enhancing quality food production and reducing environmental impact requires that LTAR scientists and collaborators have well-timed access to various data.  We are striving to provide data and metadata in useable, well documented and consistent formats for them.
Scientists at the Central Plains Experimental Range, in collaboration with scientists from Texas A&M University, Colorado State University and the University of California-Davis and local ranchers, designed a novel, co-production grazing study, the Adaptive Grazing Management (AGM) experiment (https://www.ars.usda.gov/plains-area/cheyenne-wy/rangeland-resources-research/docs/adaptive-grazing-management/research/) in 2012. The AGM investigates how rangeland management strategies can be implemented to achieve livestock, vegetation and wildlife objectives for both production and conservation goals in a manner that responds to changing weather/climatic and rangeland conditions, incorporates active learning, and makes decisions based on quantitative, repeatable measurements collected at multiple spatial and temporal scales.
Multiple, large (big) data sets are produced in this study including: vegetation production composition, and structure, soil water, carbon and nitrogen, livestock diet composition, foraging behaviour, energetics, and weight gains, grassland bird numbers and distribution, carbon/energy/water fluxes (from Eddy Covariance towers), vegetation phenology (from Phenocams), vegetation greenness (from Normalized Difference Vegetation Index (NDVI) sensors), and precipitation inputs from numerous rain gauges.
Today, data and information are served within static pdf files on a project website, within PowerPoint slides, as journal articles and reports. But, these static documents are limited in showing the extent of the information.  As a result, we are investigating the use of a Geospatial Portal for Scientific Research (GPSR), which has an ESRI geospatial database on the backend, which drives an online interface to visualize data and communicate information in more dynamic ways.


Tuesday December 6, 2016 09:45 - 10:00 CST
Computer Science 3 Computer Science

10:00 CST

Plant Specimen Contextual Data Consensus
Plant specimen contextual data provides information about the plant material that is being analysed in a molecular assay. This information layer is distinct from the investigation layer, which specifies the investigation purpose and its contributors, and from the experiment layer, which provides details of the experimental design. This suggests that a common set of descriptors can be used for reporting the contextual information about a plant sample that is associated with any dataset. The Compliance and Interoperability Working Group of the Genomic Standards Consortium (GSC) facilitates expert-community building and development of recommendations for description of genomic data and associated information. This short presentation will describe a recent effort by the working group to harmonize reporting of contextual data of plant specimens associated with genomic data. The consensus uses a number of concepts from the GSC’s Mininum information for any (x) Sequence (MIxS) standard and is available at http://gensc.org/the-plant-specimen-contextual-data-consensus/ . The consensus represents a shift in MIxS away from the original core+environmental package model of specifying standards toward more purpose-driven collections of metadata terms sometime referred to as application profiles. The use of categories for terms within the consensus (organism, sample, treatment, growth medium) can aid in metadata collection. Although this consensus was developed with plant molecular assays in mind, the contextual metadata list can serve well for other types of assays such as phenotyping observations.


Tuesday December 6, 2016 10:00 - 10:05 CST
Computer Science 3 Computer Science

10:00 CST

Enhancing semantic search through the automatic construction of a Biodiversity Terminological Inventory
The increasing growth of literature in biodiversity presents challenges to users who need to discover pertinent information in an efficient and timely manner. In response, text mining techniques offer solutions by facilitating the automated discovery of knowledge from large textual data. An important step in text mining is the recognition of concepts via their linguistic realisation, i.e., terms. However, a given concept may be referred to in text using various synonyms or term variants, making search systems likely to overlook documents mentioning less known variants, which are albeit relevant to a query term. Domain-specific terminological resources which include term variants, synonyms and related terms, are thus important in supporting semantic search over large textual archives. We describe the use of text mining methods for the automatic construction of a large-scale biodiversity term inventory. The inventory consists of names of species, amongst which naming variations are prevalent. We apply a number of distributional semantic techniques on all of the documents in the Biodiversity Heritage Library, to compute semantic similarity between terms and support the automated construction of the resource.
With the construction of our biodiversity term inventory, we demonstrate that distributional semantic models are able to identify semantically similar terms that are not yet recorded in existing taxonomies. Such methods can thus be used to update existing taxonomies semi-automatically by deriving semantically related terms from a text corpus and allowing expert curators to validate them. We propose our inventory as a resource that enables automatic query expansion, which in turn facilitates improved semantic search. Specifically, we developed a visual search interface that suggests semantically related terms available in our inventory but not in other repositories, to incorporate into the search query. An assessment of the interface by domain experts reveals that query expansion based on related terms is useful for increasing the number of relevant documents retrieved. Its exploitation can benefit both users and developers of search engines and text mining applications.


Tuesday December 6, 2016 10:00 - 10:15 CST
Auditorium CTEC

10:05 CST

S09: Panel Discussion
Authors of all of the session's presentations will be available to answer questions and participate in further discussion of the topic of agricultural biodiversity standards and semantics.

Speakers
avatar for Gail Kampmeier

Gail Kampmeier

Affiliate, Illinois Natural History Survey, Prairie Research Institute, Univ of Illinois
Entomologist; Code of Conduct Committee member; TDWG Program Co-Chair; Editor-in-Chief, Biodiversity Information Science and Standards (BISS)


Tuesday December 6, 2016 10:05 - 10:30 CST
Computer Science 3 Computer Science

10:30 CST

Tuesday AM Break, Poster Viewing, & Registration
CTEC Lobby

Tuesday December 6, 2016 10:30 - 11:00 CST
Lobby CTEC

11:00 CST

Worldwide Engagement for Digitizing Biocollections (WeDigBio)—Our Biocollections Community's Citizen Science Space on the Calendar
Digitization of biocollections is an ongoing and critical task that has been galvanized by technological advances and new resources, including innovations in crowdsourcing and citizen science. Involving citizen scientists in this process increases their awareness of the number, kinds, and value of biodiversity specimens in collections, advances STEM literacy, increases support for biocollections, and builds sustainability for digitization activities. In turn, growing digital biocollections databases have direct implications for the global community who make use of those data for research and education. To build support for biocollections and their digitization activities and to increase digitization rates, we organized the annual Worldwide Engagement for Digitizing Biocollections (WeDigBio) Event. In the two years of the event, dozens of museums and classrooms have hosted onsite digitization events where participants transcribed specimen labels using one of five online platforms (DigiVol, Les Herbonautes, Notes from Nature, Smithsonian Institution’s Transcription Center, and Symbiota). Thousands of additional citizen scientists also contributed from more than one hundred fifty countries, completed tens of thousands of transcription tasks.
Planning and executing WeDigBio events required us to find efficient ways to integrate disparate transcription and participant data across platforms and projects. For example, to accurately tally completed transcription tasks among platforms with different workflows, we developed a method that counts each pass of a record by a volunteer as a single unit, or fraction thereof. To quantify participation, such as the number of times an individual visited a site and estimate their location, we relied on tools such as Google Analytics. We used surveys to evaluate event host experiences and participant enjoyment. Here, we present information on the process of organizing an international citizen science event, an analysis of the event’s effectiveness (e.g., transcription rates before, during, and after the event), lessons learned, and future directions.


Tuesday December 6, 2016 11:00 - 11:15 CST
Computer Science 3 Computer Science

11:15 CST

What's in a name? Sense and reference in digital biodiversity information
"That which we call a rose by any other name would smell as sweet.” Shakespeare has Juliet tell her Romeo that a name is just a convention without meaning, what counts is the reference, the 'thing itself', to which the property of smelling sweet pertains alone. Frege in his classical paper “Über Sinn und Bedeutung” was not so sure, he assumed names can be inherently meaningful, even without a known reference. And Wittgenstein later in Philosophical Investigations (PI) seems to deny the sheer arbitrariness of names and reject looking for meaning out of context, by pointing to our inability to just utter some random sounds and by that really implying e.g. the door. The word cannot simply be separated from its meaning, in the same way as the money from the cow that could be bought for them (PI 120). Scientific names of biota, in particular, are often descriptive of properties pertaining to the organism or species itself. On the other hand,  in semantic web technology and Linked Open Data (LOD) there is an overall effort to replace names by  their references, in the form of web links or Uniform Resource Identifiers (URIs). “Things, not strings” is the motto. But, even in view of the many "challenges with using names to link digital biodiversity information" that were extensively described in a recent paper, would it at all be possible or even desirable to replace scientific names of biota with URIs? Or would it be sufficient to just identify equivalence relationships between different variants of names of the same biota, having the same reference, and then just link them to the same “thing”, by means of a property sameAs(URI)?  The Global Names Architecture (GNA) has a resolver of scientific names that is already doing that kind of work, linking names of biota such as Pinus thunbergii to global identifiers and URIs from other data sources, such as Encyclopedia of Life (EOL) and uBio Namebank. But there may be other challenges with going from a “natural language”, even from a not entirely coherent system of scientific names, to a semantic web ontology, a solution to some of which have been proposed recently by means of so called 'lexical bridges'.

Speakers

Tuesday December 6, 2016 11:15 - 11:30 CST
Auditorium CTEC

11:15 CST

Biospex—A Basecamp for Launching, Advertising, and Managing Biodiversity Specimen Digitization Expeditions
Public participation (i.e., crowdsourcing, citizen science) in the digitization of biodiversity specimens is an appealing strategy for biocollection curators, since it enables them to simultaneously advance digitization, outreach, and sustainability goals.  Several engaging sites for public participation now exist, such as those discussed elsewhere in this symposium.  However, the curator community is still figuring out how to efficiently piece together this constellation of new resources with existing workflows and specimen data management systems.
Biospex (https://biospex.org/) emerged from a series of workshops and hackathons at iDigBio (https://www.idigbio.org/) focused on these new public participation resources.  Biospex is designed to lower barriers to the creation and management of public participation projects, make data flow more easily among relevant actors, build capacity for recruiting and engaging public participants, and enable what is sometimes called "co-created" citizen science.
Most recently, we have been working to enable the packaging of specimens from Symbiota (http://symbiota.org/) into sets with compelling themes ("expeditions"). Expeditions can then be launched on Zooniverse's Notes from Nature (https://www.notesfromnature.org/) and returned to Symbiota, with Biospex moving the specimen and provenance data between platforms.  Skeletal records from Symbiota are ingested at Biospex as a Darwin Core Archive, allowing those using comparable specimen data management systems to also use Biospex and Notes from Nature.  Further, Biospex now provides a public dashboard for projects that might launch many expeditions (e.g., the WeDigFLPlants partnership between biocollections and the Florida Native Plant Society at https://biospex.org/project/wedigflplants).
We will provide an introduction to these recent activities and an overview of steps that could be taken to move the constellation of new public participation resources into a more coordinated software ecosystem.


Tuesday December 6, 2016 11:15 - 11:30 CST
Computer Science 3 Computer Science

11:30 CST

Creating computable definitions for clades using the Web Ontology Language (OWL)
We present the concept and current state of implementation of phyloreferences, an informatics tool for integrating taxon-linked data by the use of clade definitions with fully computable semantics based on ancestral relationships. Phyloreferences can be represented in the Web Ontology Language (OWL) as a set of logical constraints using concepts defined in the Comparative Data Analysis Ontology (CDAO) as well as those from a new ontology we are developing. CDAO can also be used to represent an entire phylogeny. Any reasoner capable of classification using the OWL2 Direct Semantics (OWL-DL) profile may then be used to generate a list of nodes and branches included by the phyloreference in that particular phylogeny. This methodology may be applied to any phyloreference and any phylogeny that can be represented in OWL, providing a great deal of flexibility in their definition and use.
To continuously validate the correctness and expressive power of our approach, we are building a test suite for a collection of phyloreferences that correspond to phylogenetic clade definitions actually used in the wild. We welcome contributions to this collection, including in the form of taxonomic names whose ambiguity makes existing data integration difficult. In this talk, we will describe our methodology and the current roadmap for implementation. Ultimately we aim for phyloreferences to become a community standard, and we therefore show how the community can participate early on.


Tuesday December 6, 2016 11:30 - 11:45 CST
Auditorium CTEC

11:30 CST

SERNEC collaborative georeferencing; leveraging the interoperability between GEOLocate and Symbiota for a large scale digitization project.
The Southeast Regional Network of Expertise and Collections (SERNEC) is a Thematic Collections Network (TCN) focused on digitization of over 4 million herbarium specimens from the southeast United States. In order to meet SERNEC’s primary goal of generating a research ready dataset for public use, all metadata records will be georeferenced. Given the large number of specimens to be digitized and the size of the herbarium network, GEOLocate’s collaborative georeferencing tool is well suited for the network’s needs. Symbiota is a platform for creating specimen-based biodiversity information communities online. SERNEC is utilizing both of these tools in order to meet its goals. Newly built functionality between these two software packages will provide a stable and efficient platform for georeferencing specimens. Specifically, interoperability between GEOLocate and Symbiota was extended by adding a set of web services to their API libraries that allow project managers to coordinate real-time data augmentation through the two applications. Several user interfaces were created within Symbiota that give collection managers the ability to push data packages of non-georeferenced occurrences from a Symbiota portal into a targeted GEOLocate community expedition. Global unique identifiers and Darwin Core Archive transfer protocols are implemented to enable coordinated record flow between the two applications. The collaborative georeferencing tools also allow SERNEC to engage with it’s wider group of stakeholder to crowd source the georeferencing tasks within online communities that are built within the system. We will give an overview of these newly built functionalities and present how these create an efficient workflow for georeferencing in a collaborative environment. We will also describe how data provenance is handled in the system and how other projects can leverage these new tools.


Tuesday December 6, 2016 11:30 - 11:45 CST
Computer Science 3 Computer Science

11:45 CST

Creature Features: A semantic toolkit for biodiversity trait data
For vast areas of the globe and large parts of the tree of life, data on trait diversity are grossly incomplete. When fully assembled, these trait data form the links between the evolutionary history of organisms, their assembly into communities, and the nature and functioning of ecosystems. Recent efforts to close data gaps have focused on collating trait-by-species databases that provide species-level aggregated values or ranges and almost always lack the direct observations on which those ranges are based. Digitized biocollections records collectively contain an under-utilized trove of trait data measured directly from individuals, but this content remains hidden and highly heterogeneous, impeding discoverability and use.  We developed a successful proof-of-concept that targeted body length and mass data found in digitized records published by VertNet, a thematic biocollections publishing platform, demonstrating that extraction, harmonization, and re-provisioning of specimen-level trait data are possible. We also characterized all the other trait contents in VertNet and attempted to align traits broadly to known ontologies.  We report on the outcomes of these efforts in this talk.  We also discuss critical ways to extend this proof of concept to gather other trait data from multiple taxa and to develop a more complete workflow for effective use of these data in research. We refer to our semantically based toolkit as Creature Features.  Creature Features will be a toolkit for assembling trait data from digitized specimen data and have at its foundation ontologies and semantic tools. CF is meant to leverage existing efforts in the model organism community, will be based on a semantic model, and powered by extensible parsers, a backend graph database, and API. A key aspect of Creature Features will be the ability to collect, store, aggregate, and share data at the individual or specimen level and at higher levels without loss of information. We discuss the research potential of such a toolkit and how to develop to most effectively leverage popular portals (e.g., VertNet, iDigBio) and software such as R, in order to make data broadly accessible to scientists in biodiversity and other biological domains.


Tuesday December 6, 2016 11:45 - 12:00 CST
Auditorium CTEC

11:45 CST

Widening a label transcription website community and activity
Since it started in 2012, the LesHerbonautes website (http://lesherbonautes.mnhn.fr ) has demonstrated its ability to enroll enthusiastic volunteers in label transcription of herbarium specimens. Thanks to scientific communication on biodiversity, gamification, and discussion between members, a community of 2500 citizen scientists have shared more than 2 million contributions on 160 000 specimens. The redundancy, training and communication among members ensure a good overall quality of transcription.
While operating this crowdsourcing platform, we have seen several ways in which we can broaden the activity.
Firstly, as the proposed tasks might attract people interested in zoology as paleontology, the parameterization of the questions was improved so as to address those fields and propose relevant specimen images.
In the herbarium field, volunteers can be involved in image quality control during the digitization phase preceding label transcription. The Recolnat project was an opportunity to experiment with this.
Since herbarium labels are often multilingual, it matters that we reach an audience not limited to the French language. After partial success with mixing different languages on the same website on flora Argentina, it seems that different websites, like Die Herbonauten-Pilot, that are run separately with independent community management on the same open source code base is a more effective solution. To extend the federation beyond sites sharing common principle requires establishing standards and protocols for interoperability.
Different projects are hosted by the same website, and we saw that community involvement is higher when there is a visible link between the transcription and a specific research project. This is an argument in favour of proposing different missions with more creative workflows that focus on specific label information or even specimen observations.


Tuesday December 6, 2016 11:45 - 12:00 CST
Computer Science 3 Computer Science

12:00 CST

Logic that embraces systematic progress and persistent conflict - an update on taxonomic concept reasoning
During the Taxonomic Database Working Group (TDWG) 2013 meeting in Florence, Italy, results of an early-state logic reasoning toolkit were presented. The toolkit is called "Euler/X", and is openly available at https://github.com/EulerProject/EulerX. Euler/X uses the syntax and semantics of the TDWG-ratified Taxonomic Concept Transfer Schema (https://github.com/tdwg/tcs). It is a multi-hierarchy alignment reasoning tool that represents all intra-hierarchy components as taxonomic concepts with parent/child relationships, and the inter-hierarchy relationships as set constraints (Region Connection Calculus [RCC-5]: congruence, [inverse] inclusion, overlap, and exclusion). The use case presented at TDWG 2013 is now published at http://dx.doi.org/10.1371/journal.pone.0118247.
Over the past three years we have worked broadly and deeply at the intersection of TCS syntax and semantics, logic reasoning with Euler/X, and the heterogeneous primary systematic literature. This work strongly indicates that we are achieving a novel, widely compatible way to represent and reconcile systematic progress and persistent conflict in ways that are more powerful than the Linnaean system yet entirely compatible with it, and are readily interpretable by humans and actionable by machines.
Progress with use cases - spanning angiosperm floras, primate classifications, insect revisions, and bird phylogenies - will be reviewed, showing that the foundations for linking systematic advancement to logic reasoning are well established. Former scalability bottlenecks are being resolved increasingly through custom reasoning solutions. At present, reconciling two input hierarchies, each with 1,500 taxonomic concept labels, is feasible on a desktop.
The TCS approach and related reasoning tools have profound implications for the development of biodiversity informatics. We will argue that this approach is ready now for direct integration with voucher-based biodiversity data portals, allowing systematic expert contributors to regain recognition for critical contributions to quality data packages, and users to control for the robustness of data-driven analyses while recognizing that taxonomic or phylogenetic structure is almost never a "constant", but instead a variable that we can logically represent and control for.
Because the TCS offers critical services to this field that the Darwin Core (DwC) standard alone is not designed to offer, our results also point to the need to eventually augment or replace DwC with a more suitable standard that achieves the semantics we need to reliably integrate biodiversity data at scale.
Reference:
Franz, N.M, N.M. Pier, D.M. Reeder, M. Chen, S. Yu, P. Kianmajd, S. Bowers & B. Ludäscher. 2016. Two influential primate classifications logically aligned. Systematic Biology 65(4): 561–582. doi:10.1093/sysbio/syw023


Tuesday December 6, 2016 12:00 - 12:15 CST
Auditorium CTEC

12:00 CST

Notes from Nature 2.0: Standardization for citizen science at scale
Notes from Nature (http://www.notesfromnature.org; NFN) is a citizen science tool focused on public engagement and label transcription of natural history specimens. The project was developed collaboratively by biodiversity scientists, curators, informatics experts and experts in citizen science, within the well-established Zooniverse platform.  Notes from Nature launched in April 2013 and has been successful by any measure, with over 9190 registered participants providing 1,340,000 transcriptions.  While successful, NFN has been difficult to scale up for broadest community use, both for natural history collections providers and citizen scientists. This talk introduces the newly re-launched Notes from Nature, which leverages new tools produced by both the natural history collections community and the Zooniverse to revolutionize how to bring new collections online.  The key innovation are two tools, Biospex and the Zooniverse Project Builder, that together dramatically simplify and automate creation of new expeditions. The new Notes from Nature also has streamlined and enhanced transcription tools, provider dashboards, and soon to be upgraded user profile pages.  These tools work via standardization, and we are further developing means to standardize outputs.  In particular, work is ongoing to build services to return "best transcripts" along with data quality assessments to our providers. We also discuss engagement efforts and overall standardization and interoperability with other biodiversity informatics tools. Such improvements help cement Notes from Nature's place as a critical component of an ecosystem of tools needed to unlock the vast legacy biodiversity data for broad public good.
 


Tuesday December 6, 2016 12:00 - 12:15 CST
Computer Science 3 Computer Science

12:30 CST

Tuesday Lunch
Lunch is served in the Cafeteria (also known as "Soda").
 

Tuesday December 6, 2016 12:30 - 14:00 CST
Cafeteria Cafeteria

14:00 CST

Semantics to standardise the interpretation of flower-visiting data
In previous work we implemented a prototype of an ontology-based semantic enrichment and mediation system for flower-visiting data digitized from labels of flower-visiting insect specimens in natural history collections. This system transformed database records documenting physical specimens into enriched records of ecological events e.g. a FlowerUtilizingEvent or a FlowerProductUtilizingEvent. In subsequent work we created a probabilistic model (Bayesian Network) of the causal knowledge that an expert implicitly uses to interpret individual specimen records e.g. to assert that an insect was probably foraging for pollen or probably foraging for nectar.
The objective of the present work is to link interpretations of individual organisms behaviour (i.e. output from the Bayesian Network) to aggregated records of behavioral interactions. These aggregations are population samples which allow the data to be interpreted at a higher level of abstraction (population and community level) i.e. in terms of ecological relationships between population samples of different species. Using the widely adopted modelling construct of the Interaction Network (represented by an ontology), we modelled an ecological community as a network of interacting populations of different species. Each node in the interaction network is a population sample of a different species. Nodes are connected by edges representing ecological interactions, of which there are several types.
We envisage a system that will allow a user to filter database records by spatio-temporal extent so as to realistically model a network of co-existing populations. The user will then be able to adjust the level of precision of the visualised ecological interactions (e.g. ecologicalInteraction > foragingEcologicalInteraction > nectarForagingEcologicalInteraction).
Explicit semantics could bring a degree of standardization to the construction of interaction networks and the interpretation of flower-visiting data.


Tuesday December 6, 2016 14:00 - 14:15 CST
Auditorium CTEC

14:00 CST

How do managers and scientists decide if Citizen Science data are trustworthy?: Modeling data quality and trust together.
Citizen science projects offer innovative approaches for solving the data deficiency in biodiversity studies. However, doubts remain about the quality of citizen science data because of the novelty of the method, the lack of formal training for citizen participants, or other issues related to engagement of citizens. When there are issues about data quality trust becomes a key factor among project managers, government officials and scientists for the publishing and use of citizen science data. Whether or not to trust citizen science teams and the data they produce is a complex behavioral choice because trust itself is a complicated concept. Research in many academic disciplines including political science, psychology, business management and computer science have contributed to our understanding of trust. We present a general model for the confidence in using citizen science data that combines aspects of data quality and trust. We demonstrated how this model can be useful in real projects such as eBird and iNaturalist by simulating situations of the deficit of trust.


Tuesday December 6, 2016 14:00 - 14:15 CST
Computer Science 3 Computer Science

14:00 CST

Annotations Interest Group Meeting
The Annotations Interest Group will hold a working session to examine advances in annotation technologies over the last year including use of annotations in the wild in the biodiversity community (e.g. AnnoSys-GBIF integration, FilteredPush-Symbiota integration), and annotation-related activities in the World Wide Web Consortium (W3C).  Of particular interest are annotations as sets of assertions concerning quality of biodiversity data.
Agenda:
  1. Reports on data annotation systems in production use in the biodiversity community.
  2. Discussion of activities in the W3C Web Annotation Working Group https://www.w3.org/annotation/.
  3. Updates on a Task Group for an applicability statement concerning the W3C activities.
  4. Expressing data quality assertions in the Framework for Data Quality as annotations.

New participants in the Annotations Interest Group are encouraged.


Tuesday December 6, 2016 14:00 - 15:30 CST
TecnoAula 1 CTEC

14:00 CST

How to standardize a dataset to Darwin Core with OpenRefine
Whether you are a biodiversity data publisher or user, you have probably encountered messy data: variations of the same value, inconsistent date formats, incomplete geospatial information, etc. As a nontechnical person, how do you explore, let alone clean and standardize such data?
In this workshop, we will teach you how. With the free, open source tool OpenRefine (formerly Google Refine) you will learn how to 1) import a dataset, 2) explore it with facets, 3) clean and standardize it to Darwin Core by clustering and splitting, and 4) exporting it back as simple Darwin Core... all in easy repeatable steps. We will also show you how to link your data to the GBIF taxonomic backbone or the Encyclopedia Of Life by using external services and crosslinking. And we will try to find decimal coordinates using the Google or Mapquest web services. Intrigued? Join us and we are sure that you will become an OpenRefine adept! Note: this workshop contains a theoretical and hands-on session, so bring your own computer and data.


Tuesday December 6, 2016 14:00 - 15:30 CST
TecnoAula 2 CTEC

14:15 CST

Standardising and integrating metadata associated with remote underwater video recordings
Several South African research institutes operate equipment to record marine underwater video footage for ecological monitoring of e.g. coastal fish stocks. We designed a local process to upload fish length and count data output from a Stereo Baited Remote Underwater Video (Stereo-BRUV) camera. Data were uploaded to a Specify database using the Specify Workbench, which required us to design a standard Workbench Template to be used by participating institutes. Working more broadly we mapped the fields in the Specify database to terms in the Darwin Core schema and Audubon Core schema to describe the data with a view to publishing the standardised data. We evaluated the richness of the standardised metadata obtained through the use of these vocabularies. Specific terms to describe BRUV data are unavailable. The development of such terms will aid the discovery, integration and interpretation of BRUV data.


Tuesday December 6, 2016 14:15 - 14:30 CST
Auditorium CTEC

14:15 CST

Citizen science and expert community interactions using Wikwio, a weed knowledge portal, focused on southern Africa
The Wikwio agricultural project aims to build a community of stakeholders including extension agents, students, farmers and scientists with a focus on weed species of food and cash crop systems in southern African region. Interactions among the citizen and scientific communities, facilitated by a technology platform, is a promising approach. After almost three years in existence and steady growth, the platform hosts information on about 420 weed species with a community of around 650 members from 19 countries.
A citizen science module on the Wikwio web platform enables people to contribute their observations about weeds. The module is interconnected with expert-curated species pages, spatial and documents modules that foster interactions between citizen and scientific communities.  These interactions improve the aggregation of scientifically validated data. For example, the citizen science module can inform scientists about the nature and level of occurrence of different weeds in different crop systems, their seasonality and distribution across the southern African region. In turn, the species pages module can inform extension agents and farmers about appropriate control measures for each species. Platform developers have built custom fields in the citizen science module to allow specific surveys to be commissioned by the scientific community. The comment history in the same module can enable dialogue between the citizen and scientific communities allowing people to pose questions directly to experts to which experts can respond with additional details and clarifications. Active, wide and large-scale participation from diverse stakeholders in agricultural research and practice is building a vibrant Wikwio community around a facilitating technology platform and a shared scientific practice.
 


Tuesday December 6, 2016 14:15 - 14:30 CST
Computer Science 3 Computer Science

14:30 CST

BiGAEOn: an ontology for Biogeographic areas
In the current context of Biodiversity loss and climate change, it is more than ever necessary to adapt and develop our scientific practices to face these present-day global issues. Scientists, particularly biologists, have to define new protocols to optimize the tremendous amount of new data being generated and to analyse them.
Monitoring Biodiversity is a complex problem because of its multiple facets and cross-domains links. The creation and use of ontologies to conceptualize those different aspects of Biodiversity is an efficient means for key stakeholders and policies makers to promote consistency and reliability of systems.
For this purpose, the Environment Ontology (ENVO; http://www.environmentontology.org) is a community ontology for the concise and controlled description of environments. It is interoperating with other domain ontologies closely linked to the representation of biodiversity in order to better interface with efforts such as Darwin Core and initiatives to promote the achievement of the United Nations’ Sustainable Development Goals (SDGs).
As part of the ENVO consortium, BiGAEOn is an ontology for biogeographic areas specifically. Biogeographic areas are the basic units used in Comparative Biogeography to produce classifications of biogeographic areas, here, bioregionalisation. BiGAEOn model describe and harmonize biogeographic entities (e.g. areas of endemism, endemic areas…) as well as their relationships. Hence, it provides a rigorous and simple framework that improves biogeographic analyses and interoperability between systems.
In particular, BiGAEOn integrates formal descriptions of WWF ecoregions (http://www.worldwildlife.org). In this presentation, we will illustrate how our ontology fits current debates with a case study on Australia, since it’s the actual scene of the bioregionalisation revival.


Tuesday December 6, 2016 14:30 - 14:45 CST
Auditorium CTEC

14:30 CST

Potential of mobile search logs in citizen science context for biodiversity monitoring
Pl@ntNet is a web platform started in 2009 and dedicated to identify, explore and share observations of plants using pictures. It is organized by location into different databases for the floras of Europe and tropical regions including the Indian Ocean, French Guyana and North African. The platform uses crowdsourcing approaches and machine learning tools. It supports a computational infrastructure for a mobile plant identification service based on automated image analysis. This service, freely available on iPhone and Android platforms and the web (http://identify.plantnet-project.org/), was initially set up for a fraction of the European flora (800 species at the beginning). Currently there are 6,000 species of the European flora in the database and more than 20,000 users per day. With more than two million downloads in more than 3 years, this infrastructure is able to produce a large volume of botanical observations (more than two millions occurrence records, with a growing rate of more than 200% per year) contributed by people from many backgrounds and interests.
The volume of daily produced data by this initiative (5,600 occurrence records daily, produced this summer 2016) has a potential huge impact for biodiversity studies. The appropriate use of this huge volume of noisy data (in terms of identification and geolocation precision) will be only possible with the resolution of specific scientific challenges (related to large scale collaborative data revision and enrichment), that will be presented and discussed with the TDWG community.
Development of Pl@ntNet continues.  Based on user feedback, we are considering different development directions for educational users in K-12 and at the university levels, and for expansion to new geographical regions including the Caribbean, North America, and the tropical Andeans.
 


Tuesday December 6, 2016 14:30 - 14:45 CST
Computer Science 3 Computer Science

14:45 CST

A Conceptual Framework Developed to Integrate Scientific Tacit Knowledge into OntoBio
Biodiversity data are complex and abundantly available and spread out over a multitude of repositories. These data can be classified as semi-structured and are organized differently, depending on the elicitor or the expert who generated the knowledge. This constitutes the problem of biodiversity data interoperability. To mitigate such problems and to improve knowledge acquisition, OntoBio was developed.
The methodology adopted for the development of OntoBio, uses explicit knowledge to define the ontological schema of the domain. Thus, the tacit knowledge of the domain is not considered during modeling and it is observed that much could be inferred and the scope of the modeled schema would be amplified if it were considered during the process of formalization. The incorporation of tacit knowledge to ontological schemas has the purpose of increasing the expressiveness of ontologies. This purpose has guided the development of a conceptual framework to incorporate semantics to formal ontologies through tacit knowledge.
The conceptual framework consists of the following steps: (1) knowledge elicitation; (2) knowledge formalization; (3) ontology matching; (4) recommendations for ontology evolution and; (5) analysis of the recommendations for ontology evolution. The application of the framework to OntoBio has produced two main outputs: (a) recommendations for evolution of the underlying ontology from the domain Expert Mental Model (EMM). In this research, EMMs refer to more specific fact situations, rather than more general phenomena. To each new elicited and formalized EMM, new recommendations for change are available. Ontology becomes a dynamic instrument of knowledge representation; and (b) to each EMM applied to the framework, a Progressive Formalization Schema (the knowledge of a domain may be presented at different levels of formalization, from text documents to explicit rules) is generated, allowing ontology engineers to revisit the elicited and formalized knowledge for further use. Also, it allows access to knowledge at different levels of granularity and minimizes the semantic losses that may occur at different levels of knowledge representation.
The steps (3) and (4) are under development and the tests carried out so far have covered the ichthyology domain. The next phase of the research, includes the design of an experiment to elicit scientific knowledge of strategic research groups at Instituto Nacional de Pesquisas da Amazônia (INPA), for example ornithology, and disseminate the EMMs. This implies that any new EMM should be mapped to OntoBio, resulting in improvements to the ontology.


Tuesday December 6, 2016 14:45 - 15:00 CST
Auditorium CTEC

14:45 CST

Identifying biodiversity using citizen science and computer vision: Introducing Visipedia
Accurate species identification forms the foundation for our knowledge of the natural world. It is a prerequisite to citizen science, conservation, and public engagement in the natural world, but most people can name only a tiny fraction of the species around them. For a novice, classifying species to the correct family or genus can be daunting and some species are difficult for experts to identify. Even in a popular taxon such as birds, the availability of experts does not scale across broad geographic regions to engage broader communities of users. To overcome these limitations, we are developing Visipedia, which engages citizen scientists to gather media that are then identified by experts. These expert-identified and scientifically curated media are used to build computer vision and machine learning models to identify species in images.
Methods. Our goal is to engage citizen scientists in building image datasets of various taxonomic groups beginning with birds. We are developing a system to collect data from citizen scientists in many formats, including images, audio recordings, videos, and observations for all taxa along with critical metadata. This material is archived in the Macaulay Library at the Cornell Lab of Ornithology. Citizen scientists are providing annotations to assist the computer learning. Experts create verified testing and training datasets.
Results. On-demand computing infrastructure currently being tested provides real-time detection and classification algorithms across taxa. Classification services are delivered to the community, meeting the identification needs of end-users. These services support front-end interfaces such as Merlin Bird Photo ID, which helps the community identify 400 species of birds in images. In the process, we are growing and modernizing the Macaulay Library, an archive of more than 1.2 million scientifically curated images, audio recordings and videos of wildlife that have been collected since 1929. The user community for Merlin is also huge – to date more than 1 million people have used Merlin on their smartphones.
Conclusions. Visipedia classification services will become a way to verify the identification of photos submitted to citizen science projects or images taken by remote sensors. By employing machine learning and computer vision we are developing a novel monitoring tool, capable of working in locations in which expert reviewers are unavailable.

Speakers

Tuesday December 6, 2016 14:45 - 15:00 CST
Computer Science 3 Computer Science

15:00 CST

An Introduction to the Plant Phenology Ontology
Plant phenology — the timing of plant life-cycle events, such as flowering or leafing-out — has cascading effects on multiple levels of biological organization, from individuals to ecosystems, and is crucial for understanding the links between climate and biological communities.  Today, thanks to data digitization and aggregation initiatives, phenology monitoring networks, and the efforts of citizen scientists, more phenologically relevant data is available than ever before.  Unfortunately, combining these data in large-scale analyses remains prohibitively difficult, mostly because the organizations producing phenological data are using non-standardized terminologies and metrics during data collection and data processing.  The Plant Phenology Ontology (PPO) is a collaborative effort to help solve this problem by developing the standardized terminology, definitions, and term relationships that are needed for large-scale data integration.  In this talk, I will give an overview of the PPO, including the high-level design of the ontology, examples with real phenological data, and future development efforts.


Tuesday December 6, 2016 15:00 - 15:15 CST
Auditorium CTEC

15:00 CST

Flickr biodiversity data quality: Can Citizen Scientist identify Swallowtail butterflies?
Biodiversity occurrence records are the foundation for research in applied and theoretical biodiversity studies. Even though aggregators like GBIF serve such data in large quantities, major gaps and biases exists. To address these gaps, through citizen science approach, here we investigate butterfly images on Flickr contributed by the broader community as rich potential source of data. Specifically, we investigate the quantity of records existing, and the quality of identifications for swallowtail butterflies. We queried for 569 species of Papilionidae butterflies on Flickr and found more than 60 thousand records of 355 species, as compared to close to 250 thousand records available on GBIF. We explored these data by developing a website and presenting photographs with associated metadata to select experts, to check the accuracy of the identifications provided on Flickr. This method may be utilized for flora and fauna which can be identified using photographs and are interesting enough for citizen to photograph and share on social networking sites.
 

Speakers

Tuesday December 6, 2016 15:00 - 15:15 CST
Computer Science 3 Computer Science

15:15 CST

Bridging discrepancies across North American butterfly naming authorities: supporting citizen science data integration
Citizen science monitoring programs have collected data on presence, abundance and distribution of butterfly species (Papilionoidea and Hesperoidea) across North America on a regular basis; some for up to forty years. These observations are fundamental for analyzing trends in butterfly populations over time and space in response to environmental changes, and are all the more valuable because butterflies, with differing requirements for larval and adult stages, are particularly good bioindicators of ecosystem disturbance. In the last 15 years, citizen scientist enthusiasm to participate in butterfly monitoring has sparked a tremendous expansion of new survey projects that would otherwise not be feasible. The North American Butterfly Monitoring Network (http://www.thebutterflynetwork.org/) tracks 30+ independent programs ranging in scope from local to continental, all which recruit local citizen scientists to collect field observations. No global naming authority yet exists for butterflies, so the programs operate using a taxon list of their own choosing, which may or may not derive from a published, authoritative list. To look at broad-scale patterns, data must be integrated across independent monitoring projects, but species and subspecies-level name conflicts significantly complicate integration of data among projects, and in some cases integration is impossible. To resolve as many issues as possible, we developed a data structure to act as a bridge in interpreting nomenclatural discrepancies. We aligned a cumulative total of 3201 species names and 3282 subspecies names from the three most recently published North American butterfly taxonomic checklists: NABA 2nd edition (2001), Opler and Warren (2003), and Pelham (2014) and North American species from the global Integrated Taxonomic Information Service (ITIS) to characterize and resolve all discrepancies. Pair-wise agreement between these base lists ranged from 78% to 95%. We worked with 10 programs to identify which of these base taxonomies their list most closely resembled and to record any name deviations from that base. The project taxon lists ranged in size from 80 to 244 taxa.  None of the lists matched a base list exactly; the highest number of deviations between a project list and its base list was 22.  Most deviations were due to generic-level disagreements. Our alignment unambiguously relates any name from any participating survey group to the equivalently defined taxonomic entity on any other participating group’s list. Defining comparable relationships between authorities allows essential cross-talk among the multitude of established and newly developing research agendas monitoring the dynamic occurrences of North American butterflies.


Tuesday December 6, 2016 15:15 - 15:30 CST
Computer Science 3 Computer Science

15:30 CST

Tuesday PM Break, Poster Viewing, & Registration
CTEC Lobby

Tuesday December 6, 2016 15:30 - 16:00 CST
Lobby CTEC

16:00 CST

Defining dataset specifications to communicate data quality characteristics
The Darwin Core standard provides a list of community-ratified terms for sharing biodiversity information. Although some terms have strict definitions, most allow users a certain level of freedom in how to interpret these. This degree of freedom has enabled a wide range of biodiversity data to be mapped to Darwin Core, but it complicates automated data aggregation and processing. One way to resolve this are community specific guidelines describing how data should be mapped, but few have been created or adopted. Moreover, these are intended for humans only.
Inspired by existing data validation specifications in other fields, we propose the usage of a specification file, describing the constraints to which the data should comply. Its syntax is  human- and machine-readable, so it can be used to communicate expected data quality/conformity and to validate data automatically. The scope of the set of rules can be specific to a dataset, publisher or community, which allows bottom-up and top-down adoption.
In this talk, we will present a prototype format for these specifications, where the rules are defined on the level of individual terms and expressed as a YAML file. We also present prototype software to validate data with these specifications. We hope it will trigger a discussion on how to express data specifications and mapping guidelines.


Tuesday December 6, 2016 16:00 - 16:15 CST
Auditorium CTEC

16:00 CST

Citizen Science in Biodiversity Research: Defining infrastructure needs and standards to increase global monitoring
Although most citizen science monitoring projects are less than 50 years old, they are already making major contributions to biodiversity monitoring on a globally scale.  Some programs are very successful in sharing their data while are number are not yet using the infrastructure developed by TDWG and GBIF. This workshop will address two fundamental questions; 1) What are the barriers for citizen science projects to share their data with GBIF? and 2) What new standards such as common names, survey methods, data quality metrics, etc. will enhance the flow and quality of biodiversity data for citizen science programs? Short presentations will be organized from among the conference attendees and smaller working groups formed. From these activities we will update the citizen science interest group charter at TDWG, make plans for presentations at citizen science meetings around the globe, and develop working groups to work on white papers that address the special needs of citizen science biodiversity research.


Tuesday December 6, 2016 16:00 - 17:30 CST
Computer Science 3 Computer Science

16:15 CST

AnnoSys – Improving Data Quality by annotating virtual Specimens
AnnoSys is an online annotation management system and repository for published specimen data records. Traditionally, experts improved data quality placing such annotations directly with the specimens. AnnoSys fulfils the same function for published data. Linked Data, REST and SPARQL web services provide public access to annotations and the respective copy of the original record. BGBM Berlin is committed to sustain AnnoSys beyond the financed project phase.
AnnoSys is now employed in a dozen specimen portals, including GBIF. Annotation data are stored together with the original data in the AnnoSys repository, so annotations may now, or later, or not at all be processed by those in charge of the collection.
Specimen data are often accessible through multiple portals. For example, the same botanical type specimen from the Berlin herbarium is published through JStore Global Plants, GBIF, GGBN, BioCASE, the German Virtual Herbaria, JACQ, Europeana, and the BGBM’s local portal. All these portals have or will have an AnnoSys annotation link, and (if there is an annotation of the respective record in the repository) a link that provides access to the annotations. Moreover, all records with the same identifier are accessible, independent of the portal that has originally been used as a starting point. (This accentuates the need for a globally accepted system of unique identifiers.)
Independent of that, users of specimen data have access to the annotated record and can thus profit from that shared (often expert) knowledge when using the data in their research. Users can query the data in the AnnoSys portal or subscribe to annotations using criteria referring to the data record. A specialist for a certain family of organisms, working on a flora or fauna of a certain country, may subscribe to that taxon name and the country. Another may be interested in the records pertaining to a certain collector, or subscribe to annotations made by a certain specialist. The subscriber is notified by email about any annotations that fulfil these criteria. For curators a special curatorial workflow supports their handling of annotations.
Implementing this idea has taken a few years of conceptual and programming work. The current system functions, with some impediments, but these technical obstacles will be overcome in the near future. With the system in place we have finally reached the point where the concept can be promoted and user acceptance be tested.
 


Tuesday December 6, 2016 16:15 - 16:30 CST
Auditorium CTEC

16:30 CST

New Scientific names finding, parsing and resolution tools from Global Names.
We are able to investigate biology on grander scales by integrating biological data from multiple sources. The use of scientific  names of organisms allows aggregation of information on the same taxa out of many different places. There are impediments to such aggregation because  there is often more than one name for a taxon, or one name may apply to more than one taxon. Names are often spelled with variations, sometimes misspelled,  abbreviated, annotated. Author information often varies dramatically.
To be able to deal with biodiversity information we need tools that disambiguate different spelling variants, find names in spite of misspellings, to find synonyms and currently accepted names for a taxon. To mobilize information from scientific literature in general and from Biodiversity Heritage Library in particular we need fast and reliable name finding, reconciliation and resolution tools. With advances of DNA and RNA sequencing there is a dramatic increase in usage of "surrogate" names that do not follow established rules of nomenclature. These mandate new nomenclatural systems to map legacy literature to molecular knowledge systems.
As a part of Global Names Architecture project we are developing a new generation of high quality tools with an emphasis on scalability and speed. Our current goal is to create name-parsing, name-resolving, name-finding programs that are able to process whole corpus of biological literature in a few days, re-index Biodiversity Heritage Library system in 1-2 days. Such speeds will improve the quality of scientific name services globally.
In this presentation we introduce the second generation of the Global Names Parser and the Global Names Resolver tools. Both projects are heavily based on Scala language and offer orders of magnitude increase in throughput over previous tools. For example parser is able to process 30 million names/hour per CPU thread with 99% accuracy, and name-string resolver Application Program Interface (API) is able to match 1000 name-strings/second per request. We will also discuss our approach for global name-string finding effort which is planned for release by the middle of 2017.  We strive to rescan Biodiversity Heritage library routinely every time we release an update of name-string handling algorithms and as a result continuously increase the quality of biodiversity bibliographic indexes and mapping.
 


Tuesday December 6, 2016 16:30 - 16:45 CST
Auditorium CTEC

16:45 CST

Checking scientific plant names in European germplasm holdings as documented in EURISCO
Keywords: ECPGR, EURISCO, plant genetic resources, information system, taxonomic names
 
The European Search Catalogue for Plant Genetic Resources, EURISCO, provides information about 1.8 million accessions of crop plants and their wild relatives, preserved ex situ by almost 400 institutes in Europe and beyond. EURISCO is being maintained on behalf of the European Cooperative Programme for Plant Genetic Resources [1]. It is based on a network of National Inventories of 43 member countries and represents an important effort for the preservation of world’s agrobiological diversity by providing information about the large genetic diversity kept by the collaborating institutions.
 
The germplasm accessions documented in EURISCO presently comprise 6,233 genera and 41,649 species, including synonyms, name variants and misspellings. The dataset is updated regularly whereas the correctness of scientific plant names poses one of the most important challenges [2]. In order to distinguish between accepted names and synonyms, or to detect errors, such as typos, new or updated information needs to be checked against controlled vocabularies.
 
An overview of the taxonomic composition of European germplasm holdings, based on EURISCO, will be given. A pipeline for checking the consistency of taxonomic plant names against GRIN (U.S. Genetic Resources Information Network) Taxonomy and the Catalogue of Life will be presented.
 
[1]   S. Weise, M. Oppermann, L. Maggioni, T. van Hintum and H. Knüpffer (2016). EURISCO: The European Search Catalogue for Plant Genetic Resources. Nucleic Acids Res. DOI: 10.1093/nar/gkw755.
[2]   T. van Hintum and H. Knüpffer (2010). Current taxonomic composition of European genebank material documented in EURISCO. Plant Genet. Resour., 8, 182–188.


Tuesday December 6, 2016 16:45 - 17:00 CST
Auditorium CTEC

17:00 CST

Why you must clean your big-data
Aggregated big biodiversity databases are prone to numerous data errors and biases. Improving the quality of biodiversity research, in some measure, is based on improving users-level data cleaning tools and skills. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data. We estimated the effect of user-level data cleaning on species distribution model (SDM) performance, and exemplified the value of more intensive and case-specific data cleaning, which are rarely conducted by biodiversity big-data users. We implemented several relatively simple and easy-to-execute data cleaning procedures, and tested SDM performance improvement, using GBIF occurrence data of Australian mammals, in six different spatial scales.
Occurrence data for all Australian mammals (1,041,941 records, 297 species) were downloaded from the Australian GBIF node. In parallel, 24 raster layers of environmental variables in Australia (elevation, land use, NDVI, and 21 climatic variables) were compiled at a spatial resolution of 1km2. A Maximum Entropy Model (MaxEnt) was performed for each species in each grid cell, based on data before- and after user-level data cleaning, respectively. We compared model performance before- and after cleaning using one-tailed paired Z-test. The cleaning procedures used in this research improved SDM performance significantly, across all scales and for all performance measures. This finding showcase the value of user-level data cleaning for big data, regardless of spatial scale.
In a typical research, data are very expensive, and filtering/removing big proportion of the data is inconceivable. In contrast, in the big-data world, data are plentiful and relatively inexpensive, and it is sometimes worthwhile to dispose large volumes of data for the sake of data quality. Here, for example, we disposed half a million records, which consisted 50% of the database, in order to increase data quality. Thus, tools for easy yet advanced query of the data are as important as tools for detecting and correcting errors. The results of our study stress the need for data validation and cleaning tools that incorporate customizable techniques. We plan to develop an R package that will facilitate a comprehensive, structured and reproducible data-cleaning.


Tuesday December 6, 2016 17:00 - 17:15 CST
Auditorium CTEC

17:15 CST

Quasi-F – An Infrastructure for the Quality Assurance of Citizen Science Data in Germany
In Germany there is a long standing tradition in doing citizen science and thus a large citizen science community, particularly with focus on life science. A large number of people are highly engaged with collecting biodiversity data of their favorite taxon group and in their favorite environments. This community has a wide range regarding their knowledge (from amateurs up to experts) on species, environmental parameters, but also scientific methods for collecting the data. Furthermore, today's increased usage of mobile apps makes data gathering much easier than decades before. However, up to date only few professionals train a large number of interested amateurs. As a consequence they have to cope with all the quality assurance of an increasing amount of data records on their own in order to be able to provide reliable datasets to higher offices like the Federal Bureau of Nature Conservation or data portals like GBIF. Moreover, there are too many differences in applied methods and a general lack of national standards for the quality assurance. However, this is needed in order to gain full benefit from the data in biodiversity research and governmental nature conservation.
In order to approach this impediment the biodiversity informatics department at the Museum für Naturkunde Berlin and other collaborating German partners submitted a three-year project proposal. The project aims for the establishment of a national web-based quality assurance infrastructure with a comprehensive web API that can be easily integrated in the various biodiversity citizen science projects. It will (1) standardize the methods of quality checks and document them transparently for data re-use; (2) establish an in situ quality feedback for the end users, which enables them to cross-check and verify their observation directly in the field, to correct their data or gather additional data; (3) develop and investigate a measure for the data quality assessment which will be proposed as a new standard for the re-use of citizen science data in governmental nature conservation and biodiversity research in Germany.
The presentation will illustrate the general ideas of the submitted project, related preliminary studies and preparatory work. Furthermore, potential issues and challenges will be discussed.


Tuesday December 6, 2016 17:15 - 17:30 CST
Auditorium CTEC
 
Wednesday, December 7
 

08:45 CST

Wednesday Registration
Lobby of CTEC

Wednesday December 7, 2016 08:45 - 09:00 CST
Lobby CTEC

09:00 CST

TDWG Then and Now
The Taxonomic Databases Working Group (TDWG, now Biodiversity Information Standards) started out in Geneva in 1985 and this will be its 25th anniversary.   TDWG has evolved from a relatively close-knit group within the field of Plant Sciences into the current encompassing standards organization underpinning biodiversity publication effort across the entire Linnean tree, over the whole globe, and on top of cutting edge of information technology.
What drove TDWG throughout the turn-of-the-century’s  changes in biodiversity science? How has TDWG’s scope adapted or evolved to changing challenges? What has been its turnout rate? How have interests shifted over time? Can it be considered a stable, healthy network ready to continue its work, boldly going where no taxonomists have gone before?
Through a network analysis, and representation and visualization of samples of TDWG’s themes and interests along its annual meetings, we will picture TDWG’s role, change and adaptation to the flow of biodiversity research, and we will explore what new frontiers may lie ahead to be trodden by TDWG participants between now and  BIS/TDWG’s 26th anniversary.


Wednesday December 7, 2016 09:00 - 09:15 CST
Auditorium CTEC

09:00 CST

Workshop of TDWG-GBIF Data Quality Interest Group
The Data Quality Interest Group was proposed in 2013 and formally adopted by the TDWG Executive in 2014. A Symposium and Workshop was held at TDWG2014 in Jönköping, Sweden. The Interest Group was combined with the GBIF discussion group on Data Quality to form the TDWG-GBIF Data Quality Interest Group. Approximately 100 members expressed an interest in working with the Interest Group since its early stages.
Three Task Groups were established after the Jönköping meeting – viz.
  1. Task Group 1: A Framework on Data Quality
  2. Task Group 2: Tools, Services and Workflows
  3. Task Group 3: Use Case Library
The three Task Groups have made significant progress to date and this will be reported to the workshop. Task Group 1 has submitted a paper to PLOS ONE, which we hope will be published prior to the workshop; Task Group 2 has concentrated on coordinating the many data quality tests being used by data managers around the world, and linking these to individual Darwin Core fields and have prepared a spreadsheet setting out those tests along with the assertions arising from them; and Task Group 3 has prepared an entry form and corresponding spreadsheet for documenting Use Cases.
The Interest Group, along with two of GBIF’s Working Groups on Fitness for Use met in March 2016 in São Paulo, Brazil and formally worked through the Framework, Tests and Assertions and Use Cases and discussed how these may be applied to Species Distribution Modeling and Agrobiodiversity. A second meeting is being held in Melbourne, Australia in October to discuss some of the next steps and to liaise with the GBIF Fitness for Use Working Group on Alien Invasive Species.
The Workshop at TDWG2016 will review the work accomplished to date and discuss the next steps, focusing on how the work of the Interest Group – the framework, tests and assertions, and use cases – may be disseminated and how we can encourage the universal adoption of these by data custodians, data publishers and users. The participation of the current members and other interested parties on this Workshop is very important to obtain a broad representation of the biodiversity informatics community.


Wednesday December 7, 2016 09:00 - 10:30 CST
TecnoAula 2 CTEC

09:00 CST

Towards Best Practices for the Implementation and Documentation of Biodiversity Informatics Services
One of the important objectives of the TDWG Biodiversity Services and Clients Interest Group (BSC, http://www.tdwg.org/activities/biodiversity-services-clients/charter/) is to promote common service API design, documentation, and registration principles, which would greatly improve the interoperability of biodiversity web services and their applicability in workflow systems. To this end, the interest group compiles existing best practice documents and recommendations and assesses their applicability in the field of biodiversity informatics. This workshop will review and discuss the findings so far, identify missing components, and discuss a roadmap towards a TDWG Applicability Statement.


Wednesday December 7, 2016 09:00 - 10:30 CST
Computer Science 3 Computer Science

09:15 CST

Nanopublications for biodiversity: concept, formats and implementation
The concept of “nanopublication” is developed by the Concept Web Alliance (http://www.nanopub.org), and is defined as “the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.” A nanopublication includes three key components, or named graphs: (1) Assertion, or a statement linking two concepts (subject and object) via a third concept (predicate); (2) Provenance, or metadata to provide context for the assertion, and (3) Publication/citation metadata about the actual nanopublication itself. A similar form of a machine-readable formalization of knowledge are the “micropublications” which may include also evidence underlying claims and arguments to support the assertions.
Nanopublications are proposed as a complement to traditional scholarly articles allowing the underlying data to be attributed and cited, providing an incentive to researchers to make their data available using machine-readable formats thus supporting large scale integration and interoperability whilst being able to track the provenance of every contribution.
Nanopublications can be derived from research or data papers or the supplementary materials associated with them, or can also be composed de novo as independent publications used to disseminate various kinds of data that may not warrant publication as a paper. For example, one possibility is to allow export of a nanopublication in the form of a “nanoabstract” by developing a mapping from article XML to nanopublication RDF. This could be facilitated either by mapping tools or via a specially designed user interface where authors can express the most important findings in their articles as assertions in nanopublications.
Nanopublication may play potentially a highly useful role in the challenging process of community curation of biodiversity databases, such as GBIF (see, iPhylo “Annotating GBIF: from datasets to nanopublications”, http://iphylo.blogspot.ie/2015/01/annotating-gbif-from-datasets-to.html), Catalogue of Life, or taxon names registries.  The credit and recognition provided by nanopublications, may serve as an incentive for experts and citizen scientists to annotate/amend data for community use.
As machine-readable RDF-based formalizations of knowledge, nanopublications can be consumed into the Biodiversity Knowledge Graph  and will be an essential component of the RDF-based Open Biodiversity Management System (OBKMS) developed by Pensoft and Plazi. Nanopublications for some classes of biodiversity data will be implemented first in the Biodiversity Data Journal and TreatmentBank. Nanopublication formats currently under development are: (1) descriptions of new taxa; (2) renaming of taxa, synonymies (nomenclatural acts); (3) new occurrence records; (4) new traits or other biological information about a taxon.




Wednesday December 7, 2016 09:15 - 09:30 CST
Auditorium CTEC

09:30 CST

COPIS: A Computer Operated Photogrammetric Imaging System
Technological advancements over the past two decades have made information about types and other specimens housed in natural history collections available online in digital form, primarily for research purposes. In the past few years, more emphasis has been placed on digital imaging of specimens, in effect bringing the specimens out of their cabinets and increasingly into public view globally via the World Wide Web.  This presentation will introduce full color, 3-dimensional imaging of external anatomy using photogrammetry, and will describe an architecture known as COPIS (Computer Operated Photogrammetric Imaging System) developed for rapid multi-camera, multi-view, image acquisition.  In addition to 3D-imaging, the outputs of COPIS, may also be used in traditional 2-dimensional image analysis.

Speakers
Sponsors

Wednesday December 7, 2016 09:30 - 09:45 CST
Auditorium CTEC

09:45 CST

The Digital Object Lifecycle of Paleo Data: Concepts of Digital Curation in a Natural History Context
Paleontological data presents many challenges. It can often be difficult to maintain best practices, follow established standards and methodologies, and ensure data quality over time.  At the Smithsonian National Museum of Natural History (NMNH) Department of Paleobiology, we are developing a comprehensive program for understanding and managing the full digital object lifecycle of our collections and research data. Following the tools and resources developed by the digital curation field, we are able to complete a comprehensive analysis of our digital data and all of its characteristics. This analysis follows a digital object, or paleontological collections record in this case, from the point of creation, whether through transformation from analog or as born digital, through submission to repository and preservation systems, and then as an output of interoperable information disseminated for consumption by a variety of audiences through many access points. An added complexity is the inherently cyclical nature of biodiversity data, requiring additional consideration for the continuous distribution, analysis and enhancement, resubmission, and redistribution over time. By defining the actions, roles, characteristics, and standards needed at each step in the lifecycle we build the capacity to fully comprehend our data. Therefore we increase our ability to enhance standards, workflows, policies, and ultimately longterm data quality and data management. This comprehensive definition of paleontological data also enables more in depth discussion at the global biodiversity informatics level, contributing to conversations about the current standards, data needs, and data usage, and to needed discussions about ways of improving the comprehensiveness and interoperability of paleontological data across institutions and data sources. This talk will cover the efforts and progress made thus far by the NMNH Department of Paleobiology and invite discussion from others undertaking similar studies or interested in collaborating on the global applications of these concepts.

Speakers

Wednesday December 7, 2016 09:45 - 10:00 CST
Auditorium CTEC

10:00 CST

Building Linked Open Data for Zooarchaeological Specimens and Their Context
Zooarchaeological collections data present special challenges for mobilization into global biodiversity networks, given the critical importance that human site context plays in interpretation.  At the same time, faunal remains are biological samples that can be represented using existing standards.  Here we present a means to use a linked open data framework to connect cultural context and specimen data in order  to support integrated global change research.  We demonstrate this approach using a subset of the zooarchaeological holdings of the Florida Museum of Natural History as a case study.  We show how these datasets can be expressed using Darwin Core, especially information relating to excavation, chronology, and cultural provenience.  We also have developed means to share context information with Open Context, an archaeoinformatics project that is well established in the community.  We discuss the importance of linked open data frameworks in representing zooarchaeological data, and the importance for future development of Darwin Core extensions to further appropriately capture contents rather than relegating this content to container fields such as dynamicProperties in Darwin Core. Many of the concepts required to share zooarchaeological data have conceptual overlap with paleontological data, and we argue it is timely and needed to fully connect biological, paleontological, and archaeological data together most efficiently for broad scientific use.


Wednesday December 7, 2016 10:00 - 10:15 CST
Auditorium CTEC

10:15 CST

Demonstrating the Prototype of the Open Biodiversity Knowledge Management System
The Open Biodiversity Knowledge Management System (OBKMS) is a suite of semantic applications and services running on top of a graph database storing biodiversity and biodiversity-related information, known as a biodiversity knowledge graph (http://rio.pensoft.net/articles.php?id=8767). A biodiversity knowledge graph is a data structure of interconnected nodes (e.g., specimens, taxa, sequences), compatible with the use of standards for Linked Open Data, and able to be merged with other similar management systems to ultimately form a grand Biodiversity Knowledge Graph of all biodiversity and biodiversity-related information.
The main purpose of OBKMS is to provide a unified system for interlinking and integrating diverse biodiversity data e.g., taxon names, taxon concepts, taxonomic treatments, specimens, occurrences, gene sequences, bibliographic information. The graph, at this stage, is serialized as Resource Description Framework (RDF) quadruples, extracted primarily from biodiversity publications, but database interlinks will follow. Options for expressing Darwin Core encoded data as RDF for insertion in the graph are explored.
You are encouraged to listen to the talk in Symposium S01, “The Open Biodiversity Knowledge Management System: A Semantic Suite Running on top of the Biodiversity Knowledge Graph,” to get a grasp of the theoretical underpinnings of OBKMS.
In this computer demo we would like to demonstrate the prototype of OBKMS that we already have running at the Bulgarian Academy of Sciences. We will mostly do SPARQL (SPARQL Protocol and RDF Query Language) queries and showcase some insights learned from the data (sources: Pensoft, Plazi).
We also want to be very interactive and gather the user's perspectives of what OBKMS should become after its prototype stage. In particular, after the database stage we want to explore different services and applications that OBKMS could offer.
 


Wednesday December 7, 2016 10:15 - 10:30 CST
Auditorium CTEC

10:30 CST

Wednesday AM Break, Poster Viewing, & Registration
CTEC Lobby

Wednesday December 7, 2016 10:30 - 11:00 CST
Lobby CTEC

11:00 CST

BHL - 10 Years and More!
The Biodiversity Heritage Library (BHL) is an international consortium working together to make biodiversity literature openly available to the world as part of a global biodiversity community.  Through its extensive network of Members and Affiliates, over 45 million pages of biodiversity literature are now available through the BHL portal.  Eight Global BHL Nodes work with institutions in their regions of the world to build capacity for digitization and to promote open access to the biodiversity literature.

The Biodiversity Heritage Library has grown to be an important part of the infrastructure of biodiversity. In an attempt to solve the literature component of the taxonomic impediment, the BHL continues to provide access to legacy print publications and make this data widely available for reuse in collections support systems. Recognizing the importance of archival materials, specifically field notes, the BHL has moved to increase coverage of these materials through ongoing projects. Additionally, the BHL has actively worked on a variety of social media platforms

This talk will focus on a general update of BHL activities, content aggregation, partnerships, technical development and related activities. The talk will also provide introduction and context for the remainder of the symposium panel.


Wednesday December 7, 2016 11:00 - 11:15 CST
Auditorium CTEC

11:00 CST

Darwin Core Documentation: More (would be) Better
Data publishers and aggregators, such as VertNet and GBIF, share biodiversity data using the Darwin Core standard. Darwin Core provides definitions of terms and makes recommendations about how to populate the corresponding fields. These recommendations, however, are often not followed by data publishers, which results in highly heterogeneous content. Mapping fields from data sources to Darwin Core can be problematic for a variety of reasons, including misunderstanding and inexact correspondence of concepts. Thus, content meant for one or more Darwin Core fields might be left out or placed incorrectly. Even if mapping is correct, the content of the fields can show great variability in the absence of or lack of adherence to controlled vocabularies. The combination of these problems render data less discoverable and less readily usable than they could be. In order to close the gaps between data availability and discoverability, it is useful, first, to measure the extent to which Darwin Core suggestions are followed by the community. Second, documentation gaps need to be identified and remedied to improve the use of the standard and ultimately favor data usage. In order to address these needs, we have investigated the heterogeneity in the data shared in Darwin Core fields in VertNet and provide evidence of the necessity for better Darwin Core documentation. We have examined the current state of fields that contain information of high value for ecological and evolutionary research, such as taxonomy, sex, and life stage. We also examined fields used by the community to capture a great variety of types of information: dynamicProperties, occurrenceRemarks, and fieldNotes. We will expose a panorama of the current content of these fields and present examples of how data on particular specimen traits (i.e., length and mass) are shared, the degree of their heterogeneity, and how they can be extracted to enhance their discoverability and usability. Finally, we will urge the community to join efforts to improve Darwin Core documentation and provide recommendations to achieve this goal.


Wednesday December 7, 2016 11:00 - 11:30 CST
Computer Science 3 Computer Science

11:00 CST

Workshop of TDWG-GBIF Data Quality Interest Group
The Data Quality Interest Group was proposed in 2013 and formally adopted by the TDWG Executive in 2014. A Symposium and Workshop was held at TDWG2014 in Jönköping, Sweden. The Interest Group was combined with the GBIF discussion group on Data Quality to form the TDWG-GBIF Data Quality Interest Group. Approximately 100 members expressed an interest in working with the Interest Group since its early stages.
Three Task Groups were established after the Jönköping meeting – viz.
  1. Task Group 1: A Framework on Data Quality
  2. Task Group 2: Tools, Services and Workflows
  3. Task Group 3: Use Case Library
The three Task Groups have made significant progress to date and this will be reported to the workshop. Task Group 1 has submitted a paper to PLOS ONE, which we hope will be published prior to the workshop; Task Group 2 has concentrated on coordinating the many data quality tests being used by data managers around the world, and linking these to individual Darwin Core fields and have prepared a spreadsheet setting out those tests along with the assertions arising from them; and Task Group 3 has prepared an entry form and corresponding spreadsheet for documenting Use Cases.
The Interest Group, along with two of GBIF’s Working Groups on Fitness for Use met in March 2016 in São Paulo, Brazil and formally worked through the Framework, Tests and Assertions and Use Cases and discussed how these may be applied to Species Distribution Modeling and Agrobiodiversity. A second meeting is being held in Melbourne, Australia in October to discuss some of the next steps and to liaise with the GBIF Fitness for Use Working Group on Alien Invasive Species.
The Workshop at TDWG2016 will review the work accomplished to date and discuss the next steps, focusing on how the work of the Interest Group – the framework, tests and assertions, and use cases – may be disseminated and how we can encourage the universal adoption of these by data custodians, data publishers and users. The participation of the current members and other interested parties on this Workshop is very important to obtain a broad representation of the biodiversity informatics community.
This represents the second session for this Interest Group and will bring in related task and interest groups.


Wednesday December 7, 2016 11:00 - 12:30 CST
TecnoAula 2 CTEC

11:15 CST

BHL: Grants and Growth
Recently, BHL has embarked on a variety of new initiatives designed to further facilitate research and discovery.  Earlier research and development projects developed an algorithm to identify images in the BHL corpus, enhanced metadata for illustrations and tested games to crowdsource the improvement of Optical Character Recognition (OCR) and citizen science driven transcription. Current initiatives  support consortia and collection growth and expanded global partnerships. I will  report on the  results of the completed Purposeful Gaming IMLS grant (transcription crowdsourcing) and the next steps for this work as well as describe how the Expanding Access to Biodiversity Literature grant is leading to new partnerships and additional content.  Membership and affiliate partnerships continue to grow with the addition of institutions in Canada, Europe and Asia as well as the US.  Partnering  with GBIF and CETAF  enhance collections and contributes to cross-linking and additional services.




Wednesday December 7, 2016 11:15 - 11:30 CST
Auditorium CTEC

11:30 CST

BHL-SciELO Network
The Biodiversity Heritage Library-Scientific Electronic Library Online (BHL-SciELO) Network aims at contributing to the strengthening of the Brazilian information infrastructure on biodiversity through the active participation on the BHL Global Network. It was initially proposed by a group of leading biodiversity researchers during COP 8 meeting held in Curitiba in March 2006. Its regular operation started in 2010 and its development is led by a partnership between the São Paulo Research Foundation (Foundation) and the Secretary of Biodiversity and Forests of the Ministry of Environment. The BHL-SciELO Network involves the main biodiversity related institutions of Brazil. The digitizing operation is carried out in a decentralized way overseen by a SciELO unit responsible for the quality control and transfer to the BHL global repository. By September 2016 the contribution to the BHL collection sums 300 documents comprising about 200.000 thousand digitized pages and the indexing of 23.434 open access articles published by SciELO journals. The regular operation in the coming years is expected to add over 1.200 new digitized documents and more open access articles from biodiversity related SciELO journals.


Wednesday December 7, 2016 11:30 - 11:45 CST
Auditorium CTEC

11:30 CST

Darwin Core: Tutorials; best practices; current issues; future directions
We continue our recent tradition of hosting a Darwin Core workshop combining tutorials with facilitated discussion of current issues and future directions for the standard. This year, we will have two sessions. The first will begin with an overview of Darwin Core and its extensions, and of different ways (e.g. spreadsheets; RDF) to represent Darwin Core data. This will be followed by a presentation examining how Darwin Core is used in practise, which will illustrate the variability in the way the standard is interpreted, and also suggest the most important documentation gaps that need to be filled. We will then give an overview of of the Apple Core guidelines for using Darwin Core in the context of herbarium records. There will be ample time for questions, and for discussion regarding how to move forward on documentation.
The second session will include a presentation on recent proposed changes to Darwin Core around improving the flow of alien species data, followed by discussion on how best to move forward on these proposals. Time permitting, there will also be opportunity to discuss other issues in the tracker or on people's minds, such as the W3C's emerging guidelines for tabular data, interoperability between Darwin Core and other types of data, and next steps in DwC/RDF.


Wednesday December 7, 2016 11:30 - 12:30 CST
Computer Science 3 Computer Science

11:45 CST

Towards extracting occurrence data from biodiversity literature
Dmitry Schigel, Carolyn Sheffield, Martin Kalfatovic
Biodiversity Heritage Library has been successful in detecting scientific names in biodiversity literature enabling linkages to external infrastructures such as GBIF.org of Global Biodiversity Information Facility. The very same pages contain geographical information, varying from names of the regions covered by a flora to the detailed locality information for a single specimen together with the sampling information. Beginning at TDWG 2014, BHL and GBIF have continued to explore the opportunities to expand the range of BHL services to deliver organism names and associated location information together. The state of this exploration is presented to TDWG 2016 to for discussion.

Speakers
Sponsors

Wednesday December 7, 2016 11:45 - 12:00 CST
Auditorium CTEC

12:00 CST

Biodiversity Heritage Library as official electronic information source for Biodiversity and Taxonomy studies curriculum
Biodiversity Heritage Library is at this moment the largest global electronic free source for not only historical biodiversity/taxonomy literature. The BHL platform is connected to the key infrastructures as for example Global Biodiversity Information Facility (GBIF) and Encyclopedia of Life (EOL).  This digital library with specific services, user tools, and global coverage is, in the sense of content and partnership continuously expanding and has a solid sustainability model. It is a great candidate to be officially accepted in the biodiversity and taxonomy studies as a standard framework. The question is what kind of requirements must be considered and which kind of actions must be taken to make the BHL an official standard for electronic information source for education? Education and training, especially in the field of taxonomy and biodiversity, is one of the goals of Consortium of European Taxonomic Facilities (CETAF). If we would like to focus on Standards Supporting Innovation in Biodiversity Research and Conservation, we need to also focus on education of future generations be able to follow up our path and understand biodiversity. Standards and sources are one of the key elements which are need.

Speakers

Wednesday December 7, 2016 12:00 - 12:15 CST
Auditorium CTEC

12:15 CST

Questions: BHL - 10 years of innovation & growth
Time is reserved at the end of this symposium for questions about earlier talks. If time permits, we will also solicit feedback to guide the future of BHL and discuss the needs and wants of the important TDWG audience.




Wednesday December 7, 2016 12:15 - 12:30 CST
Auditorium CTEC

12:30 CST

Wednesday Lunch
Lunch is served in the Cafeteria (also known as "Soda").  Please allow TDWG Exec members priority in the serving line today so that they can attend an Exec meeting at 1 PM.
 

Wednesday December 7, 2016 12:30 - 14:00 CST
Cafeteria Cafeteria

13:00 CST

TDWG Exec Meeting in TecnoAula 1
Wednesday December 7, 2016 13:00 - 13:55 CST
TecnoAula 1 CTEC

14:00 CST

Next Generation Publishing for Biodiversity using Pensoft's Arpha Writing Tool and Publishing System
ARPHA stands for Authoring, Reviewing, Publishing, Hosting, and Archiving, and is the first journal publishing platform ever to provide a full life cycle of a manuscript, from authoring, through peer-review, to publishing and dissemination. In the core of the system is the ARPHA Writing Tool (AWT), which offers biodiversity scientists a number of exciting features. For example, AWT is tightly integrated with the information technology (IT) infrastructure of the International Union for Conservation of Nature (IUCN), streamlining the Red List evaluations of endangered or invasive species. These evaluations can be respectively published as Species Conservation Profiles (SCP) or Alien Species Profiles (ASP) in the Biodiversity Data Journal. To data collectors and scientists, AWT offers tools to create data paper manuscripts from metadata expressed in Ecological Metadata Language (EML), directly from their computing environment. The data papers then serve as human-readable extensions to metadata and provide a point of citation and reference to a data set. To the taxonomists, AWT offers import/ export capabilities of occurrence or specimen records based on an Application Programming Interface (API). A number of interchange formats and databases (GBIF, BOLD Systems*, iDigBio, PlutoF) are supported, including the widely adopted Darwin Core (DwC).
ARPHA allows for the upfront markup of published texts, which is the first step to publication of atomized content and data that can be easily mined, extracted, and re-used. ARPHA will serve as the gate to the forthcoming Open Biodiversity Management System (OBKMS), based on extraction of information in the form of Linked Open Data, which is stored as nanopublications in a biodiversity knowledge graph. Note, the OBKMS is a semantic suite of databases and applications inter-connecting biodiversity data. You are encouraged to visit Symposium 01, where a presentation on OBKMS will be held.
This workshop aims to bring to light some of those brand new features as well as provide the user with a solid knowledge on how to use AWT in general. The bulk of the workshop will be spent in computer demonstrations as well as questions and answers, with a short introductory talk in the beginning.
* BOLD Systems = Barcode of Life Data Systems


Wednesday December 7, 2016 14:00 - 14:15 CST
Auditorium CTEC

14:00 CST

Meeting of the Joint RDA/TDWG Interest Group on Metadata Standards for attribution of physical and digital collections stewardship
The mission of this Research Data Alliance with TDWG is to enhance existing and create new standards for giving attribution for the maintenance, curation, and digitization of physical and digital objects with a special emphasis on biodiversity collections.
Within the 18-month timeline stipulated by RDA for working groups, we will analyse use cases from a variety of disciplines and review existing schemas and vocabularies in order to create the final deliverables: a metadata schema and set of metadata standards to support attribution for curatorial activities. These new standards will be suggested for ratification by TDWG as a community standard via an extension to Darwin Core.
The deliverables of this working group will benefit institutions that maintain collections and individuals who curate them and will lead to:
  • Improved recognition of the immense effort required for maintaining, curating, and sharing collections, which is likely to lead to increased funds for these activities
  • Increased efficiency in knowledge generation from collections through the proper documentation of corrections and analyses performed
  • Increased viability of crowdsourcing as a model for building collaborative research resources
  • Increased relevance of existing e-infrastructure that is being stifled by the expert annotation bottleneck
The session at TDWG 2016 will introduce the topic to TDWG members, but will be primarily a working meeting that will focus on:
  1. Reviewing the biodiversity use cases and surfacing any additional scenarios
  2. Initial brainstorming for schema design and adoption or extension of controlled vocabularies
  3. Discussing linking to the Darwin Core standard

Speakers

Wednesday December 7, 2016 14:00 - 15:30 CST
TecnoAula 2 CTEC

14:00 CST

Proposed changes to Darwin Core to improve the flow of alien species data
Preventative action towards the introduction of invasive alien species (IAS) and a fast response to already introduced IAS can reduce the cost of remediation considerably. Managers and decision makers need to be alerted rapidly with useful information and with few false alarms. Yet the data needed to achieve this are fragmented taxonomically and spatially. Data flows from original observations to actions are slow and unreliable. Furthermore, data conversions and data aggregation can result in a reduction in data quality and resolution. Improving this situation is far from simple requiring changes to culture, technology, funding, methods and standards. Here we focus on one small element of this issue, the data standards for IAS research and reporting. To be useful, standards need to be sufficiently sophisticated, but they also need to be simple enough to be usable by non-specialists. In this regard we propose three changes to Darwin Core (DwC). Two of these changes are to recommend controlled vocabularies for the existing terms, occurrenceStatus and establishmentMeans. These vocabularies are not new, but are already used by the Convention on Biological Diversity and International Union for the Conservation of Nature (IUCN). The third change is to add the new term ‘origin’ and a recommended controlled vocabulary also used by the IUCN. The result of these changes is to clarify three important pieces of information about the organism’s presence at a location. Firstly, it establishes whether the organism is considered introduced or native; secondly, whether it is extant at the location; and thirdly, how the organism came to occur at the location. While two of these terms already exist in DwC, their use is not clear and with the addition of the ‘origin’ term the concept of whether something is native or introduced will be separated from whether it still exists at the location and how it got there. These relatively simple changes use vocabularies that are already familiar and accepted by biodiversity researchers and can be used for the rapid generation of checklists of native and alien species. This would be a significant step forward and a milestone towards automation of processes such as horizon scanning, which tries to predict likely new IAS that are emerging, but are not yet present in an area. We therefore hope that this proposal will be ratified and adopted by the TDWG community.


Wednesday December 7, 2016 14:00 - 15:30 CST
Computer Science 3 Computer Science

14:15 CST

Novel article formats in the ARPHA* Writing Tool and Biodiversity Data Journal
The ARPHA Writing Tool (AWT) is a versatile online authoring system that encapsulates the full authoring and publishing process in one seamless workflow.
In this computer demonstration, Pensoft will showcase different capabilities that AWT has for importing and exporting many types of data into a manuscript. The two most important novel article formats that we want to address are the Species Conservation Profile (SPC) and the Alien Species Profile (ASP). These allow seamless integration of ARPHA articles with the International Union for Conservation of Nature (IUCN) Red List. We will also review the other article formats that AWT supports such as taxonomic papers and data papers.
After the presentation, the user should be very familiar with not only starting and writing a paper in AWT, but also with linking up the data, which is part of the manuscript, with his or her own data processing environment.
*ARPHA = Authoring Reviewing Publishing Hosting Archiving


Wednesday December 7, 2016 14:15 - 14:30 CST
Auditorium CTEC

14:30 CST

Online Import of Occurrence Records into Manuscripts from Taxonomic Databases using Pensoft’s ARPHA* Writing Tool
Pensoft's ARPHA Writing Tool (AWT) is a multi-purpose scholarly authoring platform. However, it places a particular value on streamlining the authoring of taxonomic manuscripts. Especially important for this process is the creation of material citations as taxonomic practice dictates that authors cite the occurrences on which their analysis is based (i.e., materials). It is possible to import specimen records from all the major biodiversity database portals, such as Global Biodiversity Information System (GBIF), Barcode of Life Data (BOLD) Systems, iDigBio, and PlutoF. Manually entering occurrence records into a taxonomic paper is error-prone and time-consuming. This is why we developed an API (Application Programming Interface)-based material import for Pensoft’s ARPHA Writing Tool and consequent submission, peer review and publication in the Biodiversity Data Journal.
In this workshop we will showcase how to do carry out this process and answer questions about the usage and tracking of occurrence identifiers. We invite the attendees of the workshop to visit the talk “Streamlining the Flow of Taxon Occurrence Data Between a Manuscript and Biological Databases,” where the rationale and motivation behind this workflow is discussed in detail.
*ARPHA = Authoring Reviewing Publishing Hosting Archiving


Wednesday December 7, 2016 14:30 - 14:45 CST
Auditorium CTEC

14:45 CST

Creation of Data Paper Manuscripts from Ecological Metadata Language (EML)
Data papers are scholarly articles describing the contents, provenance, and other details of a given dataset or datasets. As such, data papers can be viewed as extended metadata descriptors rendered in a human-readable form. They are important for the discoverability, dissemination, citation, and crediting of efforts to gather scientific data and build scientific databases.
In a previous effort by the Global Biodiversity Information Facility (GBIF) and Pensoft Publishers, a workflow was created that converts metadata expressed in EML in the GBIF Integrated Publishing Toolkit (IPT) into Rich Text Format (RTF) manuscripts that can then be submitted to any journal publishing data papers.
In this computer demonstration, we will present a brand new workflow that not only converts metadata into data paper manuscripts but imports these into the ARPHA* Writing Tool (AWT) of Pensoft. AWT is a versatile online collaborative system that encapsulates the full authoring and publishing process in one seamless workflow.
The workflow handles EML files downloadable from GBIF’s Integrated Publishling Toolkit (IPT), or directly from GBIF, DataONE and the Long Term Ecological Research (LTER) Network. After the import of the EML file, the authors can use the rich editing functionality of the ARPHA Writing Tool to complete their manuscripts and submit these to the Biodiversity Data Journal at the click of a button.
* ARPHA = Authoring Reviewing Publishing Hosting Archiving http://arphahub.com/


Wednesday December 7, 2016 14:45 - 15:00 CST
Auditorium CTEC

15:30 CST

Wednesday PM Break, Poster Viewing, & Registration
CTEC Lobby

Wednesday December 7, 2016 15:30 - 16:00 CST
Lobby CTEC

16:00 CST

Clustering botanical collections data with a minimised set of features drawn from aggregated specimen data
[Current state of play] Numerous digitisation and data aggregation efforts are mobilising botanical specimen data. Although digitisation is not yet complete, it is likely that we now have a critical mass of data available from which we can determine patterns.
[Problem] We know that many duplicate specimens exist, shared between separate botanical collections: these are digitised and transcribed in different herbaria and are yet to be comprehensively linked. Parallel digitisation efforts mean that the transcription of label data also happens in parallel, this results in some critical data fields (such as collector name) being much too variable to be easily used to resolve duplicates. Although not explicitly managed, we have the concept of a collecting trip (a sequence of collections from a particular individual or team). This research aims to uncover this implicit trip data from the aggregated whole. Once we have identified a collecting trip, we should be able to more easily resolve duplicates by cross linking on the trip identifier, along with the record number and date - i.e. avoiding the transcription variations that we often see in the collector field.
[Method and input data] This talk will show the output of a clustering analysis run in Python using the machine learning library scikit-learn. The data analysed were drawn from aggregated botanical specimen data accessed via the GBIF portal. Input to the analysis was optimised to use numeric features wherever possible (collection date and record number) along with minimal textual features extracted from the collector team.
[Results] The outputs of this clustering analysis will be used in a research context - to identify different kinds of collector trip – but also have immediate practical applications in data management: to identify duplicate specimens between herbaria, and to identify outliers and label transcription errors. Examples of each of these kinds of outliers will be shown. Numbers of geo-references which can be shared between institutions will also be included. Other applications of this clustering technique within problem domains relevant to biodiversity informatics (e.g. bibliographic reference management) will also be discussed.


Wednesday December 7, 2016 16:00 - 16:15 CST
Auditorium CTEC

16:00 CST

Literature Interest Group
This meeting will explore reinvigorating the Literature Interest Group, reworking the charge for the IG. (http://www.tdwg.org/activities/literature/) and expand the scope to include discussion of methods, actions and standards for working with literature in the context of TDWG standards. Examples of areas of discussion include DOIs (Digital Object Identifiers) as applied to literature for use in citations of literature in other taxonomic databases; and bibliometrics for biodiversity literature to assist authors in documenting the use of research for impact assessment.


Wednesday December 7, 2016 16:00 - 17:30 CST
TecnoAula 1 CTEC

16:00 CST

W03A: Darwin Core Invasive Species Extension Hackathon: Six schemas for GitHub
TDWG's Invasive Species Interest Group, under the auspices of the Global Invasive Species Information Network (GISIN), has facilitated the creation of a Darwin Core Invasive Species Extension, parts of which have been called the GISIN Invasive Species Status and put online here: http://tools.gbif.org/dwca-validator/extension.do?id=http://www.gisin.org/IASProfile/SpeciesStatus
This instance needs updating and expansion to include (at a minimum) the terms in the additional GISIN data models called speciesStatus and resourceURL. Because TDWG has determined that GitHub provides the best platform for our standards work, the information associated with the schemas of the GISIN protocol at http://www.gisin.org will be reviewed and posted on GitHub during this hands-on exercise.
This move of our annotated schemas to GitHub provides the opportunity to better promote our work to support leading-edge international invasive species data aggregation efforts. It is also hoped that the possibility of developing a true invasive species ontology will be considered and discussed by session participants.


Wednesday December 7, 2016 16:00 - 17:30 CST
TecnoAula 2 CTEC

16:15 CST

Taking a Big Data Approach to Estimating Species Abundance
Knowledge of species distributions is the foundation of biodiversity conservation. While estimates of occurrence are typical of distribution estimates, incorporation of abundance can provide critical information about the status and trends of a species.
Many species have dynamic distribution patterns that change seasonally and distribution visualizations attempt to account for this. Additionally, the sampling of species distributions are rarely uniformly distributed across a species range, and other data sources are often necessary to generate smooth surfaces of distributions.
Here we describe a big data workflow using data from the citizen science project eBird which we combine with NASA earth imagery to estimate patterns of seasonal bird abundances across a species entire life history. Methods. Our goal is to train a model that learns associations between eBird observations and environmental features estimated using NASA earth imagery. eBird is an online database of bird observations where participants enter when, where, and how they went birding, then fills out a checklist of all the birds seen and heard during the outing. eBird provides various options for data gathering including point counts, transects, and area searches. NASA MODIS provides information on local-scale ecological processes  (i.e., habitat) at a comprehensive global coverage allowing us to fully use eBird data. We interpolate predicted patterns of abundance across geographic regions based on the environmental conditions within those regions. The model creates weekly abundance estimates in a 3x3 km grid across the Western Hemisphere. Results. We have generated models estimating the patterns of abundance and habitat preferences across a species’ full life cycle. This allows us to explore the dynamic patterns of habitat preference; estimate regional and seasonal abundance; calculate the percent of the annual cycle a species occurs anywhere across their full annual distribution based on weekly estimates of relative abundance. Conclusions. Our modeling approach is the first to document distributional dynamics of migratory birds across the annual cycle and, in doing so, highlight the strong spatiotemporal variation that has been obscured by traditional range and distribution maps.  The combination of fine-scale and broad extent of the NASA data make it possible to develop comprehensive biodiversity visualizations that can be integrated across a range of spatial and temporal scales. Our success is based on open access to data; developing data interoperability techniques that link heterogeneous data. The outcome is species distributions that are temporally explicit across broad spatial areas at high resolution.

Moderators
Speakers

Wednesday December 7, 2016 16:15 - 16:30 CST
Auditorium CTEC

16:30 CST

Large-scale Evaluation of Multimedia Analysis Techniques for the Monitoring of Biodiversity
Computer-assisted identification of living organisms is considered as one of the most promising solutions to help bridging the taxonomic gap and build accurate knowledge of the geographic distribution and evolution of species. LifeCLEF (www.lifeclef.org) is a worldscale research forum dedicated to the evaluation of multimedia-oriented identification systems. Its principle is to measure and boost the performance of the state-of-the-art by sharing large-scale experimental data covering thousands of species. Each year, hundreds of research groups specialized in computer vision, audio processing, machine learning or data management register to the proposed challenges. Tens of them succeed in processing the whole data and submit technical papers describing their running system. Results are then synthetized and further analysed in joint research papers. The LifeCLEF research platform is globally organized around 3 tasks related to multimedia information retrieval and fine-grained classification problems in 3 subdomains. Each task is based on large and collaboratively revised data and the measured challenges are defined in collaboration with biologists and environmental stakeholders in order to reflect realistic usage scenarios.
The first task deals with image-based plant identification and is organized since 2011. It is based on a growing collaborative data collection produced by tens of thousands of members of a French social network of amateur and expert botanists. In 2015, this dataset contained 113,205 pictures of herb, tree and fern specimens belonging to 1,000 species (living in France and neighbouring countries). The second task deals with audio-based bird identification and is based on the audio recordings collected by a very active nature watchers network called Xeno-canto (http://www.xeno-canto.org/). This web-oriented community of bird sound recordists accounts for about 2,000 contributors that have already collected more than 180,000 recordings of about 9,000 species. Dataset used for the BirdCLEF task is focused on more than 20,000 audio recordings belonging to the 1000 bird species represented in the South-American region. The last task deals with the identification of sea organisms in general, from fish to whales to dolphins to sea beds to corals.
In this talk, we will report the main outcomes of the 2016-th edition of LifeCLEF including a comprehensive description of the best performing methods.  We will then discuss perspectives of future developments according to the growing available datasets, and interest of the scientific community for this lab.


Wednesday December 7, 2016 16:30 - 16:45 CST
Auditorium CTEC

16:45 CST

GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity Data
Managing research data has always been challenging but the recent availability of multi-gigabyte and larger datasets from major aggregators has created new problems, especially for individual and small institution researchers. A recent collaboration between the Integrated Digitized Biocollections (iDigBio) and the Encyclopedia of Life (EOL) called Global Unified Open Data Access (GUODA) aims to bring new techniques and resources for working with large biodiversity datasets to the widest community of researchers possible.
GUODA is both a computing infrastructure built and hosted by iDigBio and a community for collaboration in using the infrastructure. Our collaboration focuses on developing tools and workflows using Apache Spark for highly parallelized data analysis, a repository of pre-formatted and ready to use biodiversity datasets, and a resource management system capable of exposing these resources to the full skill range of software developers and data analysts.
This presentation will outline the software and hardware used in GUODA, the process and formats for transforming common biodiversity data such as the Global Biodiversity Information Facility (GBIF), iDigBio, and the Biodiversity Heritage Library (BHL) into computable data structures, and demonstrate the Jupyter Notebook interface to GUODA that is designed for researchers to interact with directly.


Wednesday December 7, 2016 16:45 - 17:00 CST
Auditorium CTEC

17:00 CST

Data Quality at Scale: Bridging the Gap between Datum and Data
This talk will provide a practical look at implementing high throughput, high volume data quality processing to tackle the task of providing efficient and effective feedback on data quality at the scale of an aggregator with tens of millions of records. Topics covered will include looking at the tradeoffs between coverage and accuracy, using the Apache Spark processing framework to rapidly iterate on data quality workflows across large volumes of data and methods for effectively capturing the results of large scale data quality work for distribution back to data providers. The examples given in this talk are driven by from work the iDigBio team has done on implementing data quality workflows across all of the data we have collected, as well as comparing and contrasting our methods with those of other projects.


Wednesday December 7, 2016 17:00 - 17:15 CST
Auditorium CTEC

17:15 CST

Fresh Data: what's new and what's interesting?
This talk describes a use case for Big Data analysis for fostering transparency and communication among several communities interested in biodiversity data.
Fresh Data is a suite of services for monitoring new biodiversity data matching specific queries across multiple biodiversity data sources, and notifying data providers when their data has been requested. Our goal is to connect time sensitive data consumers (researchers, primarily) with data producers (wildlife observers) in a meaningful but unobtrusive way. The community we seek to serve is non professional observers on platforms such as http://citsci.org/ and http://www.inaturalist.org/, who would not otherwise know they were documenting scientifically relevant data.
For these contacts to be useful, they must be fast. A subscribed researcher with a saved query should learn of a relevant new data point within a few days of the observation, and an observer should learn as quickly as possible that they have reported something that was needed by a researcher; this will encourage timely reactions, (additional reports by the observer or recruiting of other observers, and direct communication from the researcher if desired.)
To attract researchers to the monitoring tool, its search must be comprehensive, including the data sources they already rely on. Thus, the search index includes both GBIF and iDigBio data, as well as orphan data sources not yet aggregated.
Each data source is updated individually, and schedules are set appropriately for each; priority communities with short internal lag times (eg: iNaturalist) are updated the most frequently. Whole aggregator datasets (GBIF, iDigBio) are refreshed as frequently as capacity permits. Some of the priority communities have their data hosted at GBIF; their datasets are indexed separately as well, in order to allow faster update schedules, and records are deduplicated by occurrence ID.
Services are documented at https://github.com/gimmefreshdata/freshdata/wiki/api . Services available include:
-all occurrence records, filtered by taxonomic and geographic parameters, occurrence date and date added to Fresh Data (supports data monitors for interested researchers)
-monitored occurrence records only, filtered by the same parameters, and also data source (supports data usage reports per interested data source)
-query parameters for all monitors, filtered by all the above parameters, and also occurrence ID (supports query dissemination, eg: your Urania Swallowtail report was sent to a researcher interested in Lepidoptera in the Caribbean)
 


Wednesday December 7, 2016 17:15 - 17:30 CST
Auditorium CTEC
 
Thursday, December 8
 

08:45 CST

Thursday Registration
Lobby of CTEC

Thursday December 8, 2016 08:45 - 09:00 CST
Lobby CTEC

09:00 CST

Applying TDWG Standards to assess the Ecosystem Value of UNESCO Man and Biosphere Reserves in Africa.
The UNESCO Man and Biosphere program was established in 1971 to provide a scientific basis for improving relationships between people and their environments. Their recently published strategy for 2015–2025 includes the Lima strategic plan, which lists various desired outcomes to achieve and performance indicators to assess progress. Among those indicators are the Aichi Targets (https://www.cbd.int/sp/targets/) for the outcome on the added value of biodiversity in the biosphere reserves. In the framework of the European Union (EU) project EU BON, in collaboration with GBIF and GEO BON, a sample-based extension to Darwin Core was developed and the GBIF IPT adapted to accommodate data coming from monitoring sites such as those of the I-LTER network, so to have Essential Biodiversity Variables (EBVs) needed to meet the expectations set out by the Aichi Targets.
Meanwhile, GBIF via its BID has financed 23 African projects of various size, including data coming from the African Man and Biosphere (AfriMAB) UNESCO Network. The Belgian Science Policy Office (BELSPO) has an agreement with UNESCO, to fund a national project aimed at improving research and management in the African MAB reserves. The Royal Museum for Central Africa and Botanic Garden Meise have long term experience in the MAB reserves of Luki and Yangambi in Democratic Republic of the Congo.
These projects have been asked to develop a method to estimate the economic value of the AfriMAB reserves following the Lima performance indicators. In this context, with the experience from the above mentioned networks and activities, the authors will address how the sample base extensions to Darwin Core and other TDWG standards can be applied to the data coming from selected African MAB reserves. Namely it is planned to analyze in more detail, the data from the reserves of Luki and Yangambi in D.R. Congo, the Volcanoes reserve in Rwanda and the Island of Principe in São Tomé.
A movie illustrating the collection of edible mushrooms in the Mountains of Rwanda has been produced and can also be shown in the frame work of the TDWG conference.
UNESCO = United Nations Educational, Scientific, and Cultural Organization
EU BON = European Biodiversity Observation Network
GEO BON = Group on Earth Observations Bidiversity Observation Network
IPT = Integrated Publishing Toolkit
I-LTER = International Long Term Ecological Research
BID = Biodiversity Information for Development


Thursday December 8, 2016 09:00 - 09:15 CST
Computer Science 3 Computer Science

09:00 CST

Reviewing data integration and mobilisation using name reconciliation and identifier services
This talk follows on from a TDWG 2015 presentation of an open source toolkit to configure an Open Refine compatible reconciliation service over a tabular file or structured database. Over the past year this kind of service has been intensively used at the Royal Botanic Gardens, Kew to mobile name identifiers and aid data integration in a number of strategic projects.
A concept map will be used to visualise the kinds of data entity available (which are currently managed in separate systems), highlighting integration points, the organising principles (names and taxonomic concepts) to be used in data integration and the kinds of onward links available from representations of name entities (to representations of people, literature and type specimens). Challenges in names based data integration will be highlighted. We will also explore the potential use of expert opinions from intensive use of data services to further augment rules used to match names data.


Thursday December 8, 2016 09:00 - 09:15 CST
Auditorium CTEC

09:00 CST

Vocabulary Maintenance Task Group
The Vocabulary Maintenance Task Group has completed drafts of a Standards Documentation Specification and a Vocabulary Management Specification (https://github.com/tdwg/vocab). This session will outline the important aspects of the specifications and answer questions about their content and implementation.

Speakers

Thursday December 8, 2016 09:00 - 10:30 CST
TecnoAula 1 CTEC

09:00 CST

09:15 CST

A participatory module to curate species lists for generating a dynamically updating taxonomic backbone for the India Biodiversity Portal
An effective biodiversity informatics platform relies upon species name lists and their taxonomic hierarchies to provide the backbone upon which species information and content are organised. However, taxonomy is a dynamic subject and newer understanding of the relationships between organisms leads to revisions in names and hierarchies. Usage of invalid names, synonyms, misspellings etc. causes fragmentation of data associated with a species and it is therefore essential to keep the namelist updated with the latest revisions.
The India Biodiversity Portal (IBP) is an attempt based in civil society, which seeks to address the biodiversity information needs of India. The availability of a species name list is crucial for the portal to aggregate and serve information. However, a comprehensive list that captures the current taxonomic standing of all biodiversity in the country is lacking. Innovative solutions are required to bring together taxonomic expertise on various taxa to generate such an output for the flora and fauna of India.
IBP has developed a “Taxon namelist” module that caters to the twin aims of aggregating species namelists for the country as well as constructing a single underlying taxonomic backbone to organise and provide navigation of species-related content. The module provides a participatory interface that allows examination and editing of names, their attributes (authority, taxonomic status, current accepted name or synonyms) and taxonomic hierarchy. In addition, names are binned within three categorised lists based on their curation status: The ‘raw list’ with uncurated names, the ‘working list’ with names matched with the Catalogue of Life and a ‘clean list’ curated by experts, with explicitly validated names for the country. Permissions on the interface are allotted in a clade on the taxonomic tree to ‘taxon curators’ for participation in the raw and working lists and to ‘taxon editors’ for working on the clean list.
The portal dynamically synthesises a management hierarchy called the ‘IBP taxonomic hierarchy,’ prioritising the taxon levels in the clean list and ‘snapping’ other names onto it at the nearest matching taxon rank within its original hierarchy. Taxon editors work closely with the portal to keep the clean list updated with revisions. IBP currently has functional clean lists populated by taxon editors for spiders (Aranea), birds (Aves), ants (Formicidae), cicadas (Cicadoidea) and Orthoptera. Continued participation from curators and editors will ensure curation of all names and generate Clean lists to reflect the current state of documented biodiversity available in India.


Thursday December 8, 2016 09:15 - 09:30 CST
Computer Science 3 Computer Science

09:15 CST

Implementing Name Identifiers for the World Flora Online
In its decision X/17, the Convention on Biological Diversity (CBD) adopted a consolidated update of the Global Strategy for Plant Conservation (GSPC) for the decade 2011–2020 at its 10th Conference of the Parties held in Nagoya, Japan in October 2010. The updated GSPC includes five objectives and 16 targets to be achieved by 2020. Target 1 aims to complete the ambitious target of “an online flora of all known plants” by 2020, which is now the World Flora Online (WFO) project. A widely accessible Flora of all known plant species is a fundamental requirement for plant conservation and provides a baseline for the achievement and monitoring of other targets of the Strategy. A WFO partnership has been formed with thirty four participating institutions world-wide and an information portal online at http://www.worldfloraonline.org. The structure of the WFO will be a framework capable of accommodating regional floristic information (at national or lower level) as well as monographic information. Merging descriptive records for an estimated 400,000 species from multiple global data sources, aligning them with consistent taxonomic data, and enabling ongoing enhancements presents a significant technical challenge. A global consensus WFO Taxonomic Backbone curated by networks of plant taxonomic experts will provide the alignment.  Globally Unique Identifiers (GUID) will be utilized for the Taxonomic Backbone which will include accepted names with synonyms for all vascular plants and bryophytes at ranks from order to subspecies. Partners in the WFO project will contribute taxonomic and descriptive datasets that must be resolved and aligned to the WFO GUIDs. This presentation will discuss the challenges, discoveries and progress toward creating a Taxonomic Backbone and integrating plant taxonomic and digital floristic data using name resolution and digital identifiers.

Speakers

Thursday December 8, 2016 09:15 - 09:30 CST
Auditorium CTEC

09:30 CST

www.gbif.fr : French National Portal of GBIF
There are now over 615 million records available through http://www.gbif.org/ (Global Biodiversity Information Facility). These records represent the contributions from data providers linked to the nodes of GBIF. From this number, more than 39 million records come from French institutions.
National portals can provide a tailored view for a country, allowing a node to bring together region-specific information (attribution, species lists and traits, spatial layers) to further enhance the basic occurrence information. They also provide an important mechanism to galvanize data mobilization efforts within a country.
GBIF France (http://www.gbif.fr) launched two main functionalities of its new portal (http://portail.gbif.fr/) based on Atlas of Living Australia (ALA) modules at the end of May 2015: the GBIF France’s metadata portal (http://metadonnee.gbif.fr) and the GBIF France’s data portal (http://recherche.gbif.fr/). During the 10th year anniversary of GBIF France, on the 10th of June 2016, we launched the first version of our new spatial portal (http://spatial.gbif.fr/) also based on the ALA platform. For now, only the data search engine and the metadata portal have been translated in French. In the coming year, we will also translate the spatial portal into French and we will add more tools developed by ALA teams and the community into our new portal.
We have been able to customize this generic infrastructure to our local needs (e.g., links to different GBIF France’s partners) and requests (e.g., adding a map to the result page), styling the portal so that it can be integrated into our national website, translating our data portal into French and integrating pre-existing components developed by the French team.
During this demonstration, we will mostly focus on the data portal in order to show how it works but it will also give you a concrete example of ALA modules installed and configured for another country. We will also show some tools available through our spatial portal.


Thursday December 8, 2016 09:30 - 09:45 CST
Computer Science 3 Computer Science

09:30 CST

Identifiers for Biodiversity Informatics: The Global Names Approach
Scientific names are perhaps the most persistent global identifiers in biology. They have been used for aggregation and exchange of biodiversity information for 250 years. Their importance is hard to overestimate. Advances in informatics have brought new opportunities and challenges for organizing information. Biology is transitioning fast into the realm of “Big Data”. Connecting information via scientific names is not trivial, because of many spelling variants of the same name, instability in binomial names due to creation of new genus-species combinations, homonyms, name misapplications, etc. The Global Names Architecture (GNA) is designing better global identifiers for biology and mapping scientific names to these identifiers.
We follow certain goals in identifier design. The identifiers must be globally unique so they can be minted without checking a global registry. They should be optimized for computer/computer interaction, and should be independent from encoding, resolution or transportation protocols. Identifiers should be used for identification only; other important features such as addressability or resolution are achieved by including identifiers into currently used formats (e.g., PURL, URI, LSID). There are two kinds of identifiers used in the Global Names Architecture: Name-String Identifiers and Global Names Usage Bank identifiers.
A name-string is a combination of characters that represents a scientific name. Each scientific name can be expressed by many name-strings. We use UUID version 5 standard for conversion of name-strings into a Name-String Identifier. Such an identifier can be generated independently using any programming language and the resulting identifier will be exactly the same for the same name-string, so biological information bound to name-strings from many different sources is easily inter-connected. These identifiers are universally 128 bits long and can be easily managed by existing tools. When printed on paper, the identifier is as unambiguous as an electronic version.
Global Names Usage Bank identifiers (GNUB IDs) are created on the nomenclatural level. They create a solid foundation for tracking Taxon Name Usages (TNUs) in the world, and give an ability to easily organize nomenclatural events associated with them. Every scientific name has the “original” Protonym TNU. An UUID is created for every Protonym TNU. Such UUIDs then can be combined to represent a mapping to scientific names. For example a binomial scientific name is expressed as a binomial combination of UUIDs generated for the Protonym TNUs of its genus and species ranks. GNUB IDs exclude the possibility of homonyms and greatly simplify finding and organizing species information.


Thursday December 8, 2016 09:30 - 09:45 CST
Auditorium CTEC

09:45 CST

Biodiversity Information Systems in Geospatial Applications for Protected Areas in Bangladesh
The study explores a structure for a Biodiversity Information Systems (BIS), giving pertinent instructions and guidelines to the Space Research Remote Sensing Organization (SPARRSO) for surveying the protected areas (PA) of Bangladesh, particularly the Lawachara National Park in Moulvibazar district. SPARRSO facilitates interdisciplinary research associations at divisional, national, regional and international levels and provides a clearing house mechanism (CHM) to distribute information to affected parties. To date, Bangladesh has no national biodiversity database with clearing house mechanism services as defined by the Aichi targets of the Convention on Biological Diversity (CBD). There are many problems that national bio-networks face to manage biodiversity data of PA. This study provides a unique view of the tools being used to enhance the upcoming development of this national biodiversity database, which will use observations, interviews, reconnaissance findings, literature reviews and existing laws and policies. The study incorporates the collective technological information from stakeholders e.g., biodiversity specialists, forest officers, ecologists, conservationists, environmentalists, policy-makers, park managers, judges, environmental lawyers, academics, network managers, co-management team leaders and adjacent local village leaders. Almost 64% of the respondents agreed to develop the National Biodiversity Database for protecting biodiversity of PA and 53% of users stated that this BIS is more applicable than traditional systems. The study represents the indispensable connectivity with the World Database on Protected Areas (WDPA) for wide-ranging datasets, data sharing, data-indexing, web-publishing and electronic reports to CBD with the help of National Resources Information Management Systems (RIMS) and SPARRSO. Finally, this study suggests future research trajectories using a new collaborative approach to drive the methodological agenda and recommends ways to further incorporate the information systems integrating next generation biodiversity conservation outlooks.


Thursday December 8, 2016 09:45 - 10:00 CST
Computer Science 3 Computer Science

09:45 CST

The Catalogue of Life Editor's View on Globally Unique Identifiers for Names
The Catalogue of Life (CoL) is a global taxonomic catalogue of valid species through all domains. It is built as a curated assembly of expert based Global Species Databases (GSDs). CoL gives each scientific name record in the database a Globally Unique Identifier (GUID) and delivers Name GUIDs to CoL users.
  • CoL community of taxonomists emphasizes that global nomenclators (and designated nomenclatural act registers) such as IPNI, MicoBank, ZooBank, etc. should be a starting point for an intelligent "ecosystem" of Name GUIDs. Only curated nomenclators, which are documenting scientific names with published nomenclatural acts and references, can play a role of an authoritative Name GUIDs emitters (I mean, really "global" and really "unique"). Taxonomic databases should capture Name GUIDs from nomenclators. Then GUIDs can be populated into aggregators such as CoL, EoL, GBIF, etc.
  • Only taxonomic databases (incl. GSDs) can deploy and curate taxonomic concepts and emit Taxon GUIDs. Biodiversity data aggregators should capture Taxon GUIDs from taxonomic databases.
  • CoL database contains NameGUID and GSDTaxonGUID fields for storing global identifiers harvested from GSDs. Species 2000 community of taxonomic databases agreed to these fields in 2010. However, at present, only 556,051 species and infraspecific names have NameGUIDs in the CoL (31% of the CoL).
  • Previous CoL experiments with implementation of Taxon LSIDs (2008-2012, www.catalogueoflife.org/annual-checklist/2009/info_2009_checklist.php) cannot be regarded as successful. There was an incorrect assumption that CoL was capable of handling Taxon GUIDs in its existing model of data assembly and updates.

Speakers

Thursday December 8, 2016 09:45 - 10:00 CST
Auditorium CTEC

10:00 CST

Names and identifiers in the CyVerse cyberinfrastucture
The CyVerse, formerly known as the iPlant Colalborative, is a U.S. National Science Foundation-funded initiative “to design, deploy, and expand a national cyberinfrastructure for life sciences research, and to train scientists in its use” (http://www.cyverse.org/about). As part of this mission, CyVerse currently houses over 2 petabytes of data, most of them (we assume) about organisms. CyVerse recently launched the Data Commons, with the goal of providing support for data management throughout the data lifecycle. As part of the Data Commons, users can now publish data the Data Commons Repository (DCR) with permanent identifiers such as Digital Object Identifiers (DOIs) or Archival Resource Keys (ARKs) or publish to external repositories such as the National Center for Biotechnology Information Sequence Read Archive (NCBI SRA). Our goal for providing this data publication services is not just data preservation, but primarily data discovery and reuse. Therefore, being able to find out what organism or taxon a dataset is about, and being able to discover data for an organism or taxon is a crucial use case for the DCR. For this, we need a good way to identify organisms and taxa.
CyVerse is not a standards organization, so we rely on and collaborate with community-supported standards for the Data Commons. For SRA, this means that users should supply an NCBI taxon identifier as part of their BioSample submission. For the DCR, which uses the DataCite metadata profile, users can supply a taxon name in the “Subject” field, or, if they are motivated, supply a taxon name or identifier as additional metadata. When users supply names, we currently have no way to link those names to stable identifiers or concepts. We offer the Taxonomic Name Resolution Service (TNRS; http://tnrs.iplantcollaborative.org/) as a means of standardizing names for plants, and we are open to other collaborations for name resolution. CyVerse is actively seeking input from its user communities (including TDWG) on standards and practices (in names, identifiers, or other areas) for use within the Data Commons and related efforts.

Speakers

Thursday December 8, 2016 10:00 - 10:15 CST
Auditorium CTEC

10:15 CST

Utilizing Unique Identifiers for Taxonomic Concepts
Biodiversity research and conservation hinge on accurate and consistent species definitions, especially when those species definitions can be different across taxonomies and time.  Traditionally, a species is defined by a specific version of the taxonomy, and a scientific name.  However, taxonomies and taxon concepts change as we learn more about species behavior and evolutionary history.  While billions of records exist attributed to a taxonomic name and are being used in research and conservation, the concept behind the name may be different from dataset to dataset depending on the originating taxonomy. Our work focuses on birds and we manage the Clements Checklist of Birds of the World, which presents a global taxonomy of 10,514 species. This checklist provides the scientific and English name of each species, a taxonomic hierarchy, and a description of the worldwide range of each species and subspecies. Here we describe how taxon-based assets are managed through the use of unique taxonomic concept identifiers, how the identifiers are managed within a dynamic taxonomy and finally, how those identifiers are applied in various projects to consolidate unrelated assets.  We use these identifiers to uniquely define a taxonomic concept and the real-world populations that the concept represents.  Traditional taxonomic information, such as scientific names, common names, higher order assignments and relationships between species and subspecies is layered on these ids using the Clements checklist.  While these ids are represented by scientific names defined by the current Clements checklist, the id persists across taxonomies and versions, as long as the real-world populations represented are unchanged.
Every species based asset (observations, data visualizations, multimedia, life history articles and specimens) is assigned a unique taxon concept id and as the taxonomy is revised, the taxon ids of assets remain largely unchanged.  In some cases, taxonomic revisions will require changes to the taxon id of the asset, for example, where further refinement of the taxon can be determined (such as an elevation of allopatric subspecies to species).
We manage an ever-growing set of assets, 350 million observations, over 1 million images and 1000s of articles that are tracked and managed through yearly taxonomy updates allowing us to accurately represent these assets and the taxa they represent.  Researchers and conservationists combine these various datasets, without understanding the taxonomic intricacies themselves, and species account projects like Birds of North America and even Wikipedia can automatically integrate these resources into scientifically accurate species monographs.

Speakers

Thursday December 8, 2016 10:15 - 10:30 CST
Auditorium CTEC

10:30 CST

Thursday AM Break, Poster Viewing, & Registration
CTEC Lobby

Thursday December 8, 2016 10:30 - 11:00 CST
Lobby CTEC

11:00 CST

Building Digitization and Promoting Collaboration in GBIF Dark countries
A GBIF dark country is one that provides no or few biodiversity records to GBIF. This limits development of resources for studying their biodiversity and engagement in the biodiversity sciences by their people. Many of the GBIF dark countries have no history of digitization even within a collection let alone a viable collections network. They often do, however, have good collections. We have established two Symbiota-(http://Symbiota.org) based web sites OpenHerbarium.org (http://openherbarium.org) and OpenZooMuseum.org (http://OpenZooMuseum.org) to accommodate records from such collections. We have found, however, increasing digitization, requires persuading those close to the collections, from collectors to curators and department heads, of the value that comes from sharing data and what “high quality data” means in a digital age. We are in the process of developing open access resources for such purposes. They will be made available for use both on- and offline and as videos and pdf documents because most such countries suffer from poor and/or intermittent internet access. One reason for developing these resources is to enable offering more effective workshops, ones that end with at least some participants agreeing to collaborate on a project of common interest.


Thursday December 8, 2016 11:00 - 11:15 CST
Computer Science 3 Computer Science

11:00 CST

The Significance of Data Visualizations for Dynamically Occurring Species
Knowledge about species’ distributions is the foundation of biodiversity conservation. While distributions are typically visualized as static range maps, many species have strongly dynamic distribution patterns . Here we describe our approach at visualizing the dynamic patterns of migratory bird populations to more accurately convey distributional information and to study the links between populations and environments across the full annual cycle. We use a variety of data visualization approaches to (1) convey the dynamic nature of hundreds of migratory bird species distributions across the Western Hemisphere, (2) display detailed weekly abundance patterns for each species, (3) present the amount of time and area a species occurs across seasons, and (4) indicate the changes in habitat use across seasons. Methods. We train an ensemble model that learns the associations between observations gathered by the citizen science project eBird and environmental features using NASA earth imagery. Next we predict weekly relative abundance at a 3.3 × 3.3 km spatial resolution within the northern portion of the Western Hemisphere (>3 million predictions per week). Models are run on the Azure cloud computing service using R and Hadoop workflows. A team of statisticians, ornithologists, and designers then analyze model output and develop data visualizations. Results. From the model output we generate weekly maps of distribution and display them as animations.  In contrast to typical static, uniform, and coarsely defined distributions represented by geographic range maps, our results clearly show a species’ annual distributional dynamics. Next, we estimate species’ spatial patterns of duration across the annual life cycle based on the weekly estimates of relative abundance. When displayed on a map, these patterns show striking differences in occupancy and duration across a species’ full annual distribution. Finally, we quantify associations between species’ occurrences and land cover diversity and composition. We then estimate how these associations vary through an entire year and across multiple regions. We then use ‘cake plot’ visualizations to demonstrate seasonal shifts in land-cover associations for migrating birds. Conclusions. Our approach is the first to visualize the distributional dynamics of entire populations of migratory bird species across the full annual cycle, and highlight the strong spatiotemporal variation that has been obscured by traditional range and distribution visualizations. By visualizing annual distributional dynamics, we provide a robust and objective ecological baseline for assessing the implications of climate and land-use change for migratory bird species and the development of comprehensive conservation strategies.


Thursday December 8, 2016 11:00 - 11:15 CST
Auditorium CTEC

11:00 CST

11:15 CST

Visual Comparison of Biological Taxonomies
Numerous are the challenges of visualization and analysis of biological taxonomies.  First, taxonomies are large and will become larger.  It is estimated that only 20 percent of the planet's species have been identified. Secondly, taxonomies are dynamic.  Research in the taxonomy field constantly leads to new discoveries and corrections.  For example, a taxonomic review can reveal that what was considered for many years as one species, actually corresponds to two or more species. Conversely, two apparently different species may actually be the same.  All these situations generate changes at both topological and nomenclature levels of the taxonomy.  Third, taxonomy information is scattered in journals, books, and private databases of taxonomists and organizations around the world; this makes the conciliation of information particularly complex.  Experts might also have different opinions on how to classify species.  International initiatives have as a main goal the standardization and integration of worldwide taxonomic databases that come from multiple sources of information. On the other hand, a common understanding of taxonomy information is fundamental to document biodiversity, seek conciliation, and support conservation.
We are working in the development of an information visualization software tool for the comparison of biological taxonomies.  This tool is expected to be useful to biodiversity scientists to support the curation of taxonomic databases.  The comparison involves the automatic identification of changes such as splits and merges between the taxonomies.  During our research, we identified ten users´ visualization tasks: 1. Identify congruence, 2. Identify corrections (splits, merges, moves and naming corrections), 3. Identify additions, 4. Overview changes (obtain an overview of different types of changes), 5. Summarize (obtain a numerical understanding of change), 6. Find inconsistencies, 7. Filter, 8. Retrieve details, 9. Focus, and 10. Edit.  The challenge is to identify and visualize differences and similarities as well as to visualize a number of discovered conditions simultaneously in a limited screen space.   In our talk, we will discuss the characteristics of the tool as well as how these tasks could be accomplished through different information visualization techniques such as edge drawing, coloring, animation, matrix representations, and agglomeration.  We expect to obtain feedback on data visualization alternatives to effectively convey information for taxonomic work.


Thursday December 8, 2016 11:15 - 11:30 CST
Auditorium CTEC

11:15 CST

Biodiversity Informatics Curriculum Interest Group – BDI Curriculum IG
Informatics is becoming a more and more important approach within every field of science. Despite several major data sharing and linking initiatives and facilities for biodiversity data, biodiversity informatics (BDI) still lacks recognition as an independent methodological discipline providing the fundamental set of skills for students and researchers in biodiversity research. To establish this status, offering a biodiversity informatics curriculum as part of all levels of scientific education and related careers becomes essential.
Rather than training to use different tools, the BDI Curriculum Interest Group should lay basis for deeper understanding of the key principles of biodiversity informatics and synthesize the teaching curriculum for different target groups and credit levels in the field. By identifying the target groups and an evolving topics list, the Interest Group can proceed to define a framework, i.e., standardized curriculum modules on a conceptual level. Such a framework would support developing educational informatics programmes in a cost-effective way and if need be from scratch. This framework would also provide TDWG with a needed platform for knowledge exchange and sharing of content and ideas with the defined modules.
Task Groups are created within an Interest Group to develop a specific product within a specific time frame. An Interest Group is also responsible for maintaining the products of its past Task Groups.
This interest group session will first give an overview of the current status of BDI training and teaching activities during the past two years as reported by GBIF and TDWG members. Then we will set targets for concrete tasks and task groups that will aim to identify the BDI Curriculum target groups, existing expertise and curricula, and then collaboratively work to formulate standardized modules that could be used when building a framework curriculum for each target group. These tasks will be carried out with online tools by the tasks group members and the resulting framework will be reported to TDWG and hopefully presented in TDWG2017 meeting.
Later this Interest group can act as a coordinating body for sharing resources, seeking funding and finding expertise by surveying the needs of the TDWG and academic communities to further develop the curriculum framework(s). The interest group also aims to initiate and maintain the discussion of biodiversity informatics curricula from various points of view, e.g., academic discipline; training working professionals; and career development.

Speakers

Thursday December 8, 2016 11:15 - 12:30 CST
Computer Science 3 Computer Science

11:30 CST

Participatory Three-Dimensional Modeling (P3DM) as a tool for biodiversity mapping: Application of Indigenous Knowledge and GIS Technology
This presentation will describe a participatory process by which indigenous people in the protected forest areas in Kenya translated their spatial memories during the making of a georeferenced three dimensional model covering their settlement areas. The 3D map making process proved to be a catalyst in stimulating memory, articulating tacit knowledge and creating visible and tangible representations of the physical, biological and cultural landscapes of the project area. Elaborate elements of the map legend allowed participants in the community to gain greater clarity on meanings and relationships between natural and cultural features. This model selectively displayed both the tangible and the intangible heritage of the indigenous people. The composition of the legend and the making of the model stimulated collegial learning and community cohesion. The process has been perceived as a milestone for indigenous knowledge in terms of working together towards a common goal, and in realizing the value and potential authority of their spatial knowledge once it was collated, georeferenced, documented and visualized.
GIS facilities have been and still remain a privilege of elite scholars at the community level. GIS applications are difficult to manage and strongly depend on outsiders’ skills and facilities. This presentation focuses on Participatory Three-Dimensional Modeling (P3DM), which may effectively be considered as a bridge between the public and GIS. P3DM merges GIS-generated data and people’s knowledge to produce stand-alone relief models. These provide stakeholders with an efficient, user-friendly and relatively accurate spatial research, analysis and decision making tool, the information from which can be extracted and further elaborated by the GIS. The 3D modeling process and its output (the scaled relief model) are the foundations upon which Public Participation GIS (PGIS) can release its full potential. P3D Models provide local stakeholders and official policy makers with a powerful medium for interactions, by easing communication and language barriers.


Thursday December 8, 2016 11:30 - 11:45 CST
Auditorium CTEC

11:45 CST

Results of IndexMed GRAIL Days 2016: How to use standards to build GRAphs and mIne data for environmentaL research
Data produced by biodiversity research projects that evaluate and monitor Good Environmental Status have a high potential for use by stakeholders involved in [marine] environmental management. The lack of specific scientific objectives, poor organizational logic, and a characteristically disorganized collection of information leads to a decentralized data distribution, hampering environmental research. In such a heterogeneous system across different organizations and data formats, it is difficult to efficiently harmonize the outputs. There are few tools available to assist.
The task of the newly created consortium of IndexMeed is to index biodiversity data (and to provide an index of qualified existing open datasets) and make it possible to build graphs to assist in the analysis and development of new ways to mine data. Standards (including TDWG) and specific protocols can be applied to interconnect databases. Such semantic approaches greatly increase data interoperability.
The aim of this talk is to present the 2016 IndexMed workshop results (https://indexmed2016.sciencesconf.org) and recent actions of the consortium (renamed “IndexMeed - Indexing for Mining Ecological and Environmental Data”): new approaches to investigate complex research questions and support the emergence of new scientific hypotheses. With one day of plenary sessions and two days of practical workshops, this event was dedicated to the sharing of experience and expertise, the acquisition of practical methods to construct graphs and value data through metadata and "data papers". Recent developments in data mining based on graphs, the potential for important contributions to environmental research, particularly about strategic decision-making, and new ways of organizing data were also discussed at the workshop.
In particular, this workshop promoted decisions on how (i) to analyze heterogeneous distributed data spread in different databases, (ii) to create matches and incorporate some approximations, (iii) to identify statistical relationships between observed data and the emergence of contextual patterns, and (iv) to encourage openness and the sharing of data, in order to value data and their utilization.
The IndexMeed project participants are now exploring the ability of two scientific communities (ecology sensu lato and computer sciences) to work together. The uses of data from biodiversity research demonstrate the prototype functionalities and introduce new perspectives to analyze environmental and societal responses including decision-making. Output of the seminar lists scientific questions that can be resolved by the new data mining approaches and proposes new ways to investigate heterogeneous environmental data with graph mining.


Thursday December 8, 2016 11:45 - 12:00 CST
Auditorium CTEC

12:00 CST

Towards Recommending Visualization for Biodiversity Data
To address the critical challenges of biodiversity conservation and study its impact on the ecosystem, scientists have been producing a large amount of highly heterogeneous and distributed data. Managing, processing, and visualizing this data, requires informatics skills. While many biologists lack these skills, informaticians are limited in their understanding of biological domain requirements and the context of the data. Studies have shown that the potential of visualization has not been fully utilized in scientific journals, due to inappropriate visualization selection with respect to the nature of data and message to convey. Inappropriate visualization selection does not only impede analysis but also results in misleading conclusions. To aid scientists in exploring and understanding their data and to provide a solution for this problem, we propose a semi‐automated context‐aware visualization recommendation model.  To be useful, such suggestions need to be based on the visualization knowledge of domain experts. To gather such knowledge and to understand the requirements of biodiversity scientist we have designed the survey available at:
http://survey.sogosurvey.com/k/TsSTQTTsQSsWXsPsP
Acquired information and knowledge will be used in the development of a visualization recommendation framework serving the biodiversity research community. In the recommendation model, information will be extracted from data and metadata and annotated with suitable ecological operations (analytical tasks like spatial distribution, relative species abundance). This information will be mapped to the visualization semantics, i.e., each extracted operation in which variables are involved and how they are visually represented. This helps in deriving the relevant visualizations for that data.


Thursday December 8, 2016 12:00 - 12:15 CST
Auditorium CTEC

12:15 CST

Discussion: Application of Data Visualisation for Sustainable Biodiversity: Deriving Useful Knowledge and Insights from Heterogeneous Data
Data visualisation plays a crucial role in communicating complex and diverse information due to its ability to condense large amount of information in understandable form.  Over the last few years, there have been significant developments in the field of scientific data visualisation and its increased use at different stages of the research data lifecycle. In the recent past, biodiversity scientists have started to exploit the power of data visualisation to gain insights and comprehend multidimensional biodiversity data. However, due to limited data visualisation skills and awareness, many biodiversity scientists are unable to intimately interact with the data to gain deeper insights and communicate the outcomes effectively. Despite the presence of advanced and sophisticated tools, smart use of scientific data visualisations to produce informative, engaging and unbiased visualisations illustrating deeper links, have received little attention in the field of biodiversity science.
This discussion will be based on understanding the requirements, usage implications and issues faced by biodiversity scientists in the data visualization development process. We are also gathering this information via survey available at:
http://survey.sogosurvey.com/k/TsSTQTTsQSsWXsPsP


Thursday December 8, 2016 12:15 - 12:30 CST
Auditorium CTEC

12:30 CST

Thursday Lunch
Lunch is served in the Cafeteria (also known as "Soda"). Those leaving for the Soltis Center should pick up a bag lunch to carry on bus.

Thursday December 8, 2016 12:30 - 13:30 CST
Cafeteria Cafeteria

12:45 CST

Buses leave for Soltis Center & Mini-Bioblitz
Sign up for this event ($20) by Monday 5 December in RegOnline so we may estimate the number of buses and bagged lunches needed.

Thursday December 8, 2016 12:45 - 12:45 CST
Parking Lot CTEC

13:30 CST

Buses leave CTEC for La Fortuna hotels
Those not attending the mini-bioblitz may elect to return to hotels for the afternoon.

Thursday December 8, 2016 13:30 - 13:30 CST
Parking Lot CTEC

16:15 CST

Buses leave Soltis Center for La Fortuna hotels
Thursday December 8, 2016 16:15 - 16:15 CST
Entrance EcoTermales

17:00 CST

First bus leaves hotels for EcoTermales
Participants who did not go to the Soltis Center for the bioblitz need to take this bus. Chance to soak in the resort's hot springs http://ecotermalesfortuna.cr/ before attending the Gala Dinner.

Thursday December 8, 2016 17:00 - 17:00 CST
Hotels Recommended Hotels Area

17:30 CST

Enjoy Hot Springs at EcoTermales before the Gala Dinner
Change, enjoy the hot springs, then change back and enjoy the surroundings, have something at the bar, until dinner http://ecotermalesfortuna.cr/

Thursday December 8, 2016 17:30 - 19:30 CST
Hot Springs Pools EcoTermales

18:00 CST

Last bus travels from hotels to EcoTermales
All participants returning from the Soltis Center Bioblitz should be on this bus.

Thursday December 8, 2016 18:00 - 18:00 CST
Hotels Recommended Hotels Area

19:30 CST

Gala Dinner at EcoTermales
A special Gala Dinner is at EcoTermales http://ecotermalesfortuna.cr/ is included as part of your registration.

Thursday December 8, 2016 19:30 - 21:30 CST
Restaurant EcoTermales

21:30 CST

Buses return to hotels from Gala Dinner at EcoTermales
Thursday December 8, 2016 21:30 - 21:30 CST
Entrance EcoTermales
 
Friday, December 9
 

08:45 CST

Friday Registration
CTEC lobby

Friday December 9, 2016 08:45 - 09:00 CST
Lobby CTEC

09:00 CST

An open participatory species traits infrastructure to organize and discover species information.
Traits and trait values describe the characteristics of a species and its phenotype. Traits are increasingly finding value in categorising species diversity by function, morphology, physiology, and ecology.  The India Biodiversity Portal is developing an infrastructure for species traits. The objective is to model and implement a traits database for species that can evolve and scale with more information. It will aggregate a vocabulary on species attributes for organizing species information and identifying species by filtering on traits.
Traits will be associated with a node on the taxonomic hierarchy and its scope will be the clade of species below the node. Further, traits will be associated with a description field of the species profile model. For example a leaf type trait will apply to all species of the kingdom plantae and will be associated with the morphology species field.
Traits are associated with a set of trait values. They can be categorical or continuous values. Values defining a species can be single or multiple valued. Further, trait values can be defined at the level of a species or aggregated from values associated at the level of individuals of a species. A facts table will define the association of species with the trait values. Every species will be represented by a set of traits and values, allowing for filtering and exploring of species by traits.
A basic and simple traits infrastructure has been implemented on the open source biodiversity informatics platform that powers the India Biodiversity Portal, WIKWIO (Weed Identification and Knowledge in the Western Indian Ocean), and the Bhutan Biodiversity Portal. The system has been populated with traits data used in the IDAO (IDentification Assistée pour Ordinateur) species identification system. The system allows trait definitions, trait values and facts tables to be uploaded. The traits can be explored through a page that lists all traits and allows users to interact with the traits by selecting a set of traits and exploring species associated with them. Individual trait pages can also be viewed with its set of values, type of the data and species associated with them. The traits page allows users to add values to a trait and comment on the trait. Species pages display the set of traits that characterize the species and could be a pictorial representation of a species that is useful for communication, learning and awareness building on species and biodiversity. The trait data can be exported in a Darwin Core Archive format.


Friday December 9, 2016 09:00 - 09:15 CST
Computer Science 3 Computer Science

09:00 CST

Streamlining the Flow of Taxon Occurrence Data Between a Manuscript and Biological Databases
Taxonomic practice dictates that authors cite the occurrences their analysis is based on (materials) in the treatment section of the taxonomic paper. Information on occurrences of species could be stored in different biodiversity databases such as GBIF, PlutoF, iDigBio, and GBIF. Manual entering of occurrence records into a taxonomic paper is error-prone and time-consuming. This is why we developed an API (Application Programming Interface)-based material import for Pensoft’s ARPHA Writing Tool and consequent submission, peer review and publication in the Biodiversity Data Journal.

Ultimately, for an author of a taxonomic publication there are three use cases: (1) Occurrence records have not been digitized before. In this case, manual entry is always needed. (2) Occurrence data have been been deposited at data aggregators and available online from there. We developed an automated import for this case. (3) The data are available in а structured format, e.g in a Darwin Core (DwC) compliant Excel spreadsheet. We developed a tool for import of such data direct into а manuscript.

In order to import occurrence information from from GBIF and iDigBio, we rely on the DwC format.  For systems such as BOLD Systems and PlutoF API’s that do not use DwC, we developed a mapping between their terms and DwC.

Tracking the usage of occurrence records in publications is important for authors and collection managers. Where an occurrence ID is present (a persistent unique identifier of the occurrence), tracking is always possible. However, not all occurrence records have an occurrence ID. In this case a DwC triplet is used. We discuss how well different databases support these two approaches.

Finally, DNA-based species are often grouped in Operational Taxonomic Units (OTU’s). We are able to import all occurrence information from a BOLD OTU identified by a Barcode Index Number (BIN). This streamlines the formal taxonomic description and naming of DNA-based taxa. This is important as the number of dark taxa is rising and a fast way of taxonomic descriptions for DNA-based taxa is desperately needed.

Our workflows will also act as a curation filter for occurrence data as, once data are imported into a manuscript in the publication pipeline, their accuracy is expected to be vetted by authors and reviewers.

Finally, if authors wish to publish complete occurrence datasets as data papers, we have developed an automatic generation of manuscripts from metadata expressed in Ecological Metadata Language (EML).




Friday December 9, 2016 09:00 - 09:15 CST
Auditorium CTEC

09:00 CST

IG09: Natural Collections Descriptions (NCD)
Natural Collections Description (NCD) is a proposed data standard for describing collections of natural history materials at the collection level; one NCD record describes one entire collection.
Collection descriptions are electronic records that document the holdings of an organisation as groups of items, which complement the more traditional item-level records such as are produced for a single specimen or a library book. NCD is tailored to natural history. It lies between general resource discovery standards such as Dublin Core (DC) and rich collection description standards such as the Encoded Archival Description (EAD).
The NCD standard covers all types of natural history collections, such as specimens, original artwork, archives, observations, library materials, datasets, photographs or mixed collections such as those that result from expeditions and voyages of discovery.
https://github.com/tdwg/ncd
Current issues are described https://github.com/tdwg/ncd/wiki.

Speakers

Friday December 9, 2016 09:00 - 10:30 CST
TecnoAula 2 CTEC

09:15 CST

Plinian Core profiles: Facilitating the documentation and access to biological information of exotic and invasive species
The Plinian Core (PliC; https://github.com/PlinianCore/Documentation/wiki/About) offers a number of concepts that can describe a particular taxon and in the same way provides an aggregation of taxa that could share certain characteristics. As a part of a project with a group of specialized researchers working on identification and categorization of exotic and invasive species, the principal attributes to describe these taxa were documented. Researchers also were involved in improving the standard definitions and terms used to describe taxa with potential or existing extinction risk, in order to facilitate the assessment of a category of national threat status. As a result, 22 elements of the PliC were identified as mandatory to document information about exotic and invasive species; moreover for the assessment of extinction species risk, 26 PliC attributes were also identified. The outcome was consolidated in two guides (one for threatened and one for invasive species) that will acquaint researchers not only with the standard but with the guidelines to build taxon pages of threatened and invasive species. Currently these guides are in the process of formal publication as working papers, so they may be easily consulted in physical format and accessible online. The work carried out to obtain this specific application of the PliC demonstrates the collaboration inside a community of experts that can be adopted to integrate and validate information of any particular group of taxa. These application profiles are a clear example of the functionality and scope of the Plinian Core.


Friday December 9, 2016 09:15 - 09:30 CST
Computer Science 3 Computer Science

09:15 CST

Molecular and morphological biodiversity: survey of data access, use and publishing
Dmitry Schigel, Gabriele Dröge, Siro Masinde
Global Biodiversity Information Facility (GBIF) and Global Genome Biodiversity Network (GGBN) have been seeking expert opinions and the best available experiences from biodiversity and genomics scientists as well as data holders on how to link morphological and molecular biodiversity data, i.e. physical specimens, derived DNA, sequences, observations and other evidence of species' presence and abundance in space and time. The survey targeted data users and publishers of specimen and sequence information.
GBIF provides a single access point to the largest number of records on species occurrences comprising mainly specimen data from natural history collections and observational data from research, monitoring and citizen science projects. GGBN brings together the world’s biodiversity biobanks in order to make high quality well-documented and vouchered collections that store DNA and/or tissue samples of biodiversity discoverable and accessible. Many institutions are currently developing and implementing various digitisation projects, and it has been suggested that sampling the collections for genomic analyses could be better linked and included in the workflows to enhance efficiency.
In particular, we aimed at i) better understand the data use and data publishing needs of molecular biodiversity researchers, ii) get an overview of molecular collection holdings, 3) document existing data management solutions for environmental samples. The survey run from May to September 2016 and early signals were presented and discussed at the GGBN conference in Berlin in June 2016. Final results summarizing the survey of more than 190 respondents are presented to TDWG 2016 in Costa Rica for further discussion and elaboration of the action plan.

Speakers

Friday December 9, 2016 09:15 - 09:30 CST
Auditorium CTEC

09:30 CST

Mamut: An online web editor to manage biological information of taxa based on the Plinian Core
Plinian Core (PliC, https://github.com/PlinianCore/Documentation/wiki/About) was conceived as a standard for publishing and sharing information about all kinds of properties and traits related to taxa (of any rank), including descriptions, nomenclature, conservation status, management, natural history, etc. The development of the web editor Mamut allows users of the Biodiversity Catalogue of Colombia (http://catalogo.biodiversidad.co) to incorporate bio-ecological information about taxa, following the attributes and definitions set out in the PliC. The creation of this tool involved a close collaboration between the Information Technology and the Content Management teams of the Biodiversity Information System of Colombia (SiB Colombia), which allowed the integration of both the application profile and the online interface so that users can access it to create or edit content about a particular taxon.
This editor is composed of two main parts. The frontend, called Mamut (https://github.com/SIB-Colombia/mamut), was developed using the Angular framework (https://angularjs.org), and provides a series of organized sections according to the different elements of the Plinian Core. Any information sent to Mamut is conveyed to the second component, the backend a.k.a Chigui (https://github.com/SIB-Colombia/chigui), developed in NodeJS.
Based on the Plinian Core schema and terms (https://github.com/PlinianCore/Documentation/wiki#--list-of-terms) defined at the 2015 PliC Interest Group meeting in Mexico, the SiB Colombia standard profile consist now of 165 attributes selected from the original and grouped almost on the same sections proposed then, and are now available to document on Mamut. Today, the editor contains some modifications that simplify and adapt the standard to the mainly Colombian researchers' needs. Mamut represents an ongoing open access tool that other countries or initiatives can use for inspiration or to repurpose when developing their own species catalog editor.


Friday December 9, 2016 09:30 - 09:45 CST
Computer Science 3 Computer Science

09:30 CST

DNA and DOI-based identification of fungi in built environment
There are many reasons why fungi found indoors, much like fungi at large, are not easy to characterize using molecular methods. There are no reference DNA sequences for the majority of the described fungal species, and the majority of extant fungal species are not described. These complications place high demands on the reference databases and tools used and call for solutions that are not normally found in the molecular ecology toolbox. This presentation will discuss database and analysis solutions to allow high-precision studies of indoor fungi in an accessible and reproducible way.

Speakers

Friday December 9, 2016 09:30 - 09:45 CST
Auditorium CTEC

09:45 CST

TraitMan - an Online Management System for Biological Traits.
The database platform PlutoF has been developed at the Museum of Natural History, University of Tartu, Estonia. It provides various applications in different fields related to biology: ecological and taxonomic research, species monitoring, citizen science, etc. It can be used for observation and specimen data, as well as population, ecosystem and molecular data, data about scientific collections, and so forth. PlutoF provides infrastructure and services enabling data gathering, processing, analysis with various tools, reporting, and publishing. It consists of distinct modules, but the overall structure enables synergetic use of data obtained from different sources. A new module - TraitMan for traits and characters - has been developed recently.
Along with the development of TraitMan, different solutions for creating a universal and dynamic traits/characters system were investigated in terms of data model, technical specification and programming.
TraitMan allows a user to create lists of traits as hierarchical trees based on the user’s own specification, templates provided by the system, or publicly available trait lists from another workgroup or user.
List items can then be flexibly included in forms as data gathering fields for particular research projects. Fields can be customized in a variety of ways - by choosing between different field types; by adding predefined or custom units of measure; and by specifying control algorithms for each field.
For taxonomists, TraitMan enables the creation of taxon descriptions based on trait/character data, and their analysis at different taxonomic levels. The data can be used to create identification keys (e.g. for species identification, citizen science projects, etc.) and to publish them as web-based applications.
TraitMan is fully integrated with the PlutoF platform, enabling data analysis of one project or summaries over multiple projects and workgroups. It also benefits from the advanced system of permissions and data licensing implemented in PlutoF.


Friday December 9, 2016 09:45 - 10:00 CST
Computer Science 3 Computer Science

09:45 CST

Use of molecular data for aquatic biomonitoring – future potential and challenges from a European perspective
The protection, preservation and restoration of aquatic ecosystems and their functions is of global importance. In order to assess the ecological status of a given water body, aquatic biodiversity data can be collected and compared to a reference water body. The quantified mismatch thus obtained determines the extent of potential management actions. On a European level, standard bioassessments are e.g. implemented in the Marine Strategy Framework Directive (2008/56/EC) or Water Framework Directive (2000/60/EC). Thereby, bioassessments are often inferred based on a few morpho-species, but typically higher-order taxa (genus, family). Misidentifications are a common concern in these assessments. In the past decade, molecular data has proven a) to provide a higher resolution (i.e. detectability of taxa and genetic diversity), and b) to allow for the identification of cryptic or unidentifiable organisms (e.g. non-diagnostic genders or juvenile stages). Hence, the implementation of molecular (i.e. DNA barcoding) or ecogenomic data (i.e. DNA metabarcoding) into routine bioassessment strategies seems promising.
However, such a wide-ranging implementation faces several critical conceptual challenges. Species identification is only possible if a validated DNA barcode reference library for target taxa exists. Field, lab and analytical protocols should be standardised and manifested in Standard Operational Procedures (SOPs), thus allowing reproducibility and comparability. Also, High-Performance-Computing (HPC) and innovative data storage concepts have to be developed as to deal with the terabytes of ecogenomic biodiversity datasets. Finally, good-practice SOPs have to be legally implemented. As of October 2016, the Co-Operation in Science and Technology (COST) program of the European Commission supports the COST Action CA15219 ('DNAqua-Net'), comprising of five working groups (WGs): 'WG1: DNA Barcode References'; 'WG2: Biotic Indices & Metrics'; 'WG3: Lab & Field Protocols'; 'WG4: Data Analysis & Storage' and 'WG5: Implementation Strategies & Legal Issues'. DNAqua-Net will explicitly target the aforementioned challenges and deliver innovative, good-practice SOPs for a standardised aquatic bioassessment strategy in Europe – and beyond.


Friday December 9, 2016 09:45 - 10:00 CST
Auditorium CTEC

10:00 CST

Xper3: a Collaborative Descriptive Data System with Web Services
Xper3 is a knowledge base management system dedicated to managing phenotypes. It includes two web services for identification: a single­access key construction web service, and an interactive multi-access key (Mkey+). Xper3 is also widely used for the formalization, storage and automated comparison of phenotypes. Xper3 is a pioneer in taxonomic management software, and the platform was immediately adopted by a large set of users, proving its originality and efficiency:  1652 users and 2023 databases in July 2016.
The collaborative editor is especially useful for taxonomic research networks. Scientist may share their data on phenotypes (structured descriptions, documented by images, videos, and text including bibliography and external links), compare phenotypes, and import or export partial or total content in various standard formats such as SDD (Structured Descriptive Data), CSV (Comma Separated Values), and NEXUS for external analyses.
We favor the use of modular and open technologies such as web services, while paying particular attention to the user interface, in order to allow biologists to use our tools with little or no learning time. The flexibility of the platform makes it possible 1) to customize the interface depending on the content (for instance, we are building a custom interface of the interactive key web service for the citizen science project on pollinators insects SPIPOLL (Suivi Photographique des Insectes Pollinisateurs), see http://www.spipoll.org/), 2) to add new web services or 3) to use the web services in another platform.
This architecture facilitates improvement and ongoing development, and we plan to connect phenotypes with acoustic and genomic data.
[m1]Please spell this out.





Friday December 9, 2016 10:00 - 10:15 CST
Computer Science 3 Computer Science

10:00 CST

High-Throughput Sequencing and Taxon Occurrence Databases
There is major gap between taxon occurrence data, which are based on (including living) specimens or human observations and on DNA sequence data from biological samples like soil, air, water, etc. These data are often accumulating in different databases and are not compatible with each other partly because of missing knowledge about useful data standards. This is especially evident in taxa were most new occurrence data are produced in High-Throughput Sequencing (HTS) studies. For example, HTS analyses of freshwater, marine, soil, etc. samples are generating millions of occurrences which are currently difficult, or mostly impossible, to search through common portals which rely on taxon names (GBIF, GGBN, etc.) This restricts research in many other fields, including nature conservation. Bridging DNA-based taxon occurrence datasets with observational and specimen datasets will be discussed.
GGBN = Global Genome Biodiversity Network

Speakers

Friday December 9, 2016 10:00 - 10:15 CST
Auditorium CTEC

10:30 CST

Friday AM Break, Poster Viewing, & Registration
CTEC Lobby

Friday December 9, 2016 10:30 - 11:00 CST
Lobby CTEC

11:00 CST

DiSSCo: Building the future of Europe’s Natural History Collections
The Distributed System of Scientific Collections (DiSSCo) partners with circa 50 institutions from 18 European countries, in an effort to transform Europe’s Natural History collections into an integrated pan-European research infrastructure. By developing a sustained programme of collections digitisation, cataloguing and harmonising activities, DiSSCo will create an interoperable set of data portals that provide common access to engage with European collections; a digitally skilled workforce within participating institutions; and an unparalleled dataset to address questions of relevance to science and society. Our aim is to use these data to plan collections development, fill gaps and allow researchers to access collections (both physically and digitally) through a single integrated service. DiSSCo is a proposal being developed for inclusion into the 2018 ESFRI (European Strategy Forum on Research Infrastructures) Roadmap, and therefore subject to the vagaries of external funding. Nevertheless, it represents a common development trajectory for European Natural History Collections and is arguably the first time this number of European organisations have worked together at an executive level to build a common future for their institutions. DiSSCo provides a conceptual framework around which national and European projects and initiatives can align (e.g. SYNTHESYS, CETAF, and GBIF*), mitigating risks associated with single high profile funding applications. We urge non-European institutions to develop similar initiatives at regional levels along potential funding lines, such that these continental pillars may eventually combine to provide a global set of services to improve our understanding and management of the natural world.
*Synthesis of Systematic Resources, www.synthesys.info
Consortium of European Taxonomic Facilities, cetaf.org
Global Biodiversity Information Facility, www.gbif.org


Friday December 9, 2016 11:00 - 11:15 CST
Computer Science 3 Computer Science

11:00 CST

Exploring Data Gaps at the Species Level: Starting with demographic knowledge
When population mortality outweighs fertility rates, the long-term survival of the population is not sustainable, resulting in species extinction. Given the current extinction crisis there is an urgent need to maximize the effectiveness of conservation management programs. For many such strategies, well-informed demographic models provide rigorous predictions of population fate, which can be critical for successful conservation. However, the accuracy of these models hinges on the availability of demographic data. Despite the importance of demographic data to inform management of threatened species, there has been no global assessment of demographic information available. We standardized the taxonomy and terms describing traits across 24 databases on demography and/or with demographic life history traits. We developed a Demographic Index of Species Knowledge (DISKo) that shows data availability for fertility and survival, which are the essential data to understand population dynamics, and therefore demographics. Our results show that demographic data for both fertility and mortality is shockingly scarce. This is the case for even the best-studied tetrapod groups. The data with the highest quality (i.e. to be able to forecast population fate) is available for only 3.4% of mammals and just 1% of bird species. For amphibians and reptiles this figure is less than 1% (0.2% for amphibians and 0.3% for reptiles). Knowledge is also geographically biased with glaring data gaps in the tropics. The low number of species with demographic data is surprising because major efforts to tackle comparative questions in ecology and evolution have resulted in the development of numerous species-specific trait databases (e.g. PanTHERIA, Amniote life-history database, AnAge). Our results illustrate the distribution of demographic information across trait-type, phylogeny and space and provide a useful tool for optimizing future research priorities. In addition, they highlight the importance of standardizing terminology and units across parallel database efforts. The next step is to link DISKo with a Genetic Knowledge index that we are developing.


Friday December 9, 2016 11:00 - 11:15 CST
Auditorium CTEC

11:00 CST

TDWG Species Information Interest Group
The TDWG Species Information Interest Group (TSI IG) aims to be an exchange forum and standard development think tank in all facets related to "species information," covering biological (descriptions, distribution, etc.) and non-biological (legal, management, etc.) aspects. This is a very transversal group with much in common with other TDWG IGs, especially Biological Descriptions, Invasive Species, Taxonomic Names and Concepts.
This year's sessions aim at two objectives: to set the stage in this area, and to coordinate and advance the standardization of species information.
To set the stage, the first 90 min session will consist of short  (10-15 min) presentations. The second 90 min session will build on this foundation to do actual work: synchronizing initiatives, refining concepts and terms, etc.
The TSI IG  has its roots in an international  group of people that has been developing  the "Plinian Core" (https://github.com/PlinianCore). We expect this group to expand and become a TDWG task group this year, with a truly global participation and embedded in the TDWG procedures.

Speakers
Sponsors

Friday December 9, 2016 11:00 - 12:30 CST
TecnoAula 2 CTEC

11:15 CST

Concept relations in practical use: Taxonomic checklists in the context of Red Lists of endangered species
Red Lists are taxonomic checklists that report the threatened status of species within a certain region or globally. These checklists are of high practical importance for conservation, depicting the threats to species biodiversity. They are updated regularly and thus serve to keep track of the change in species occurrence and coverage. This allows prediction of tendencies in the species’ distribution and needs for ensuing conservation measures.
Taxonomic concepts (the circumscription of species) may change over time as a result of new taxonomic evidence. This directly affects the usability of the Red Lists. Such changes have to be tracked, assessed with regard to their impact on the degree of threat of the affected species, and – where necessary –the measures taken have to be updated to represent the latest state of knowledge. In an information system, this is handled by concept relations between the versions of the checklist. Concept relations basically describe the relation between two taxa by means of the set relationships used in set theory: congruent, included in, includes, overlapping, excludes.
The EDIT Platform for Cybertaxonomy is such a system. The Platform is based on the Common Data Model (CDM), which covers the entire domain of taxonomic research. The CDM provides an object-oriented model for storing and editing taxonomic data, including the support of concepts and concept relations. A programming library offers a service-oriented application program interface (API) and a Drupal-based Data Portal is used to publish data on the Web. We here present a web service-based checklist editor for the EDIT Platform, which specifically addresses the handling of concept relations in a simplified and fitting manner. The editor is implemented using Vaadin, a high performance web-application framework, which allows us to provide a state-of-the-art user interface leading the user step-by-step through the work process and masking the complex background model and operations.


Friday December 9, 2016 11:15 - 11:30 CST
Computer Science 3 Computer Science

11:15 CST

Trust Management Approaches Applied to Biodiversity Data
Traditionally, in biodiversity studies, the scientists who collected the data were also the people who analyzed it and published the associated scientific papers. The advent of DiGIR and Species Analyst software by Vieglais and collaborators, however, spawned a revolution in how biodiversity studies were undertaken. It became possible for scientists to electronically collate data from disparate museum collections and spurred the growth of niche modeling. This new approach helped create the Global Biodiversity Information Facility (GBIF) and programs such as iDigBio. The GBIF portal now serves as a directory to open source biodiversity data from museums, government agencies and non-governmental organizations (NGOs) around the world. In addition, it allows for the complete separation of data collection and data analysis. However, this separation of data collector and data analyst has generated a new concern. How can GBIF and data analysts have confidence in the data especially when it comes from a variety of sources? The traditional answer has been using data quality measures but studies in other fields such as social science, psychology, economics and computer science show that trust also plays a role. Here we develop a model that includes both elements of data quality and trust to understand the confidence one might have in biodiversity data. This modeling effort can lead to new policies that increase the perception of data quality and the knowledge among data suppliers, data aggregators such as GBIF data and data analysts. The purpose is find ways to increase the use of biodiversity data from aggregators to meet conservation goals.
DiGIR = Distributed Generic Information Retrieval


Friday December 9, 2016 11:15 - 11:30 CST
Auditorium CTEC

11:30 CST

Online Pollen Catalogue Network (RCPol)
The need for a pollen database to help with the identification of plant species motivated the construction of RCPol (Online Pollen Catalogues Network). Designed in 2009 and created in 2013-2015, RCPol’s main objective is to promote interaction between researchers, and the integration of data from their pollen collections, herbaria and bee collections. RCPol’s coordinators and collaborators intend to facilitate the search for information on Angiosperms species, their flowers, their pollen and the interaction between these plants and bees. Its main feature is an interactive species identification key that was developed in collaboration with the Escola Politécnica of USP (Universidade de São Paulo), and used by researchers of several Brazilian Institutions and others countries in America and Europe. This key was developed to identify species through the morphological description of its flowers and pollen grains. The database is available in the RCPol website (http://ww.rcpol.org.br). The network also allows access to plant species webpages that describe the main characteristics of the species, and to the specimen data at the collection. Currently, the RCPol’s Melissopalynology and Palynoecology database holds more than 500 plant species. Two other pollen databases are under construction: Palynotaxonomy and Paleopalynology, both planned to be publicly available at end of 2016.
Eight pollen collections are currently taking part in the network, and fifteen others are expected to join in the next two years. Palynology has been a complementary science, supporting studies on pollinator’s management and conservation, especially bees, in natural ecosystems and agroecosystems. At the beginning, our focus was on identifying the plant species used in bees’ diets but over time we extended to other areas of Palynology such as: Palynotaxonomy, Copropalynology, Forensic palynology, Geopalynology and Paleopalynology. With the spread of the use of pollen as a natural marker and given the small number of researchers working on Palynology and the few Pollen Collections in Brazil in relation to the existing botanical diversity, RCPol wants to encourage the integration of the Pollen Collections.
Our presentation will include our progress, some technological decisions, integration with data quality tools and standards, and our roadmap. Our goal is to collect feedback from the community to drive the future of the network.

Speakers
Sponsors

Friday December 9, 2016 11:30 - 11:45 CST
Computer Science 3 Computer Science

11:30 CST

A new power balance is needed for trustworthy biodiversity data
Biologists' trust and use of aggregated biodiversity data are suffering because of persistent criticisms of the quality of these data for basic and applied analyses. Individually, one can interpret each criticism as a problem of data quality local to some taxonomic group or geographic region. Indeed, biodiversity aggregators often respond by pointing critics toward correcting errors at their source. We will show, however, that these disputes over data quality are better understood as reflecting systemic flaws in the design of the aggregation process. As a result, fundamental change is needed to effectively address issues of trust in big biodiversity data. In particular, the design change must expand the roles available to researchers as established by data aggregators, such that the interests and views of bottom-up, high-quality content providers are more directly represented. We will outline steps towards alternative, provenance-aware design solutions that promote the formation and maintenance of high-quality biodiversity data packages.
Our discussion focuses on the unitary taxonomic syntheses ("backbones") created by biodiversity data aggregators. We show how the aggregation process can lead to a loss of data unity at the system level when different data sources adhere to conflicting taxonomic perspectives.
Many aggregators follow a design paradigm that requires one taxonomic hierarchy to organize all data at a given time. They achieve this unitary representation of the data using combinations of algorithmic and social practices governed by feasibility constraints rather than principles grounded in taxonomic theory. Eliminating taxonomic conflict between input sources in this manner often results in a hierarchy that no longer corresponds to the view of any particular source – it is a synthesis nobody believes in. Biodiversity data users and contributors frequently regard the quality of these novel classification theories as deficient.
We will show how the Darwin Core (DwC) standard plays a critical role in the design of the aggregation process. We carefully separate causes for poor aggregation that are rooted in failures on the data provision or DwC implementation side, versus systemic DwC flaws in the context of aggregation. For the latter, we outline specific syntactic and semantic solutions - often but not always represented in the Taxonomic Concept Transfer Schema - to achieve suitable aggregation outcomes. We conclude that improved aggregation designs must increase the power allocated to individual (or co-authoring) experts and their heterogeneous views to act as intermediary license-providers for the formation of trusted, big biodiversity data.


Friday December 9, 2016 11:30 - 11:45 CST
Auditorium CTEC

11:45 CST

French national standards for biodiversity: interoperability on an international level
As part of the Biodiversity Information System on Nature and Landscapes (SINP), the French National Natural History Museum has been appointed to develop a biodiversity data exchange standard by the French ministry in charge of ecology. The objective of this standard is to share marine and terrestrial data on French biodiversity at the national level (mainland and overseas region) to meet national and European requirements (e.g. the European Directive establishing an Infrastructure for Spatial Information in the European Community, INSPIRE).
The SINP standard was developed by a dedicated working group, representative of biodiversity stakeholders in France, and was this year submitted to a major overhaul. This standard focuses on core attributes that characterize taxon occurrences. Interoperability at the international level was achieved by being consistent with the Darwin Core standard and the INSPIRE European directive on geographical data.
The presentation will deal with: how the new standard version for taxon occurrence allows for more information than just a simple taxon occurrence, and on what grounds. It will show how the standard was tailored to tackle the users’ needs, which are very diverse, in accordance with the decisions made regarding data exchange with GBIF for an acceptable standardization effort. It will also show how that standard was mapped to the Event core extension of the Darwin Core in order to ensure interoperability as much as possible.
SINP standards are more restrictive than the Darwin Core format and the INSPIRE formats, and some of the issues we encountered depend on the direction of data flow (toward SINP or toward an exterior standard). These issues include absence of a corresponding term, and terms that do not correspond exactly, but only for certain elements. Our presentation will will show how mapping was done to ensure correct data transfer.


Friday December 9, 2016 11:45 - 12:00 CST
Computer Science 3 Computer Science

11:45 CST

Elicitation Techniques for Acquiring Biodiversity Knowledge
Traditionally, knowledge is kept by individuals and not by institutions. This weakens an institution’s ability to progress and be competitive. Evidence in the literature suggests gaps in the processes of knowledge elicitation and acquisition.
For a complex domain such as biodiversity, new mechanisms are needed to acquire, record and manage knowledge, preferably with a high level of expressiveness, which includes tacit knowledge. There is academic consensus that tacit knoweldge can aggregate semantics to structural instruments of knowledge. In this research, the tacit knowledge considered is scientific. This knowledge is not necessarily formalizable, but must be capable of systematization, associated with a logical process. In this domain, experts may not have the necessary skills to carry out the process of acquiring knowledge without the participation of an analyst.
The problem of knowledge communication and transference amongst individuals within an organization must be dealt with. The open question is: how to establish the ideal conditions that allow experts to communicate their knowledge? Much of the power of human expertise is the result of experience, gained through years, and represented as heuristics. Often the expertise becomes so common that the experts have difficultly describing specific tasks. In other cases, the knowledge is distributed throughout the organization and most of the time resides in the minds of experts.
The lack of attention to the differences between experts and the level of knowledge they possess, can affect the efficiency of the process of knowledge elicitation, and the quality of the knowledge acquired. The kind of knowledge that needs to be elicited must be considered too.
To browse through the variety of Knowledge Elicitation Techniques (KETs), it is necessary to identify the most appropriate method for a particular situation. It must be considered that: there are different kinds of knowledge, of experts and expertise; different ways of representing knowledge, which can help elicitation, validation and reuse of knowledge; different ways to use knowledge, so that the elicitation process can be guided by the use purpose of the elicited knowledge; and therefore, KETs should be chosen appropriately to meet the contingencies.
Among the KETs taxonomies available, we consider only the individual tacit knowledge elicitation methods that permit the participation of the analyst. Interview is the method used in the context of this research. The elicited knowledge must be stored and managed for further use. An architecture to register the elicited knowledge is under development.


Friday December 9, 2016 11:45 - 12:00 CST
Auditorium CTEC

12:00 CST

Catalogue of Life, China and Taxonomic Tree Tool
Since 2008, the Species 2000 China Node, under the support of the Biodiversity Committee of the Chinese Academy of Science (CAS), organizes scientists to compile and release the Catalogue of Life, China (CoL China) each year. It follows the Standard Data Set of Species 2000’s global Catalogue of Life to collect and release Chinese species data. Considering the local requirement, a Chinese formal name and its Pinyin, a Roman form name, are appended in species records. The data items include species accepted scientific name, Chinese name, synonyms, common names, latest taxonomic scrutiny, source database, family, classification above family, highest taxon in database, distribution, and references. A dynamic distribution map can be shown for each species in the checklist. The CoL China 2016 Annual Checklist was released on May 22, 2016, the International Day of Biodiversity. We developed a platform for species data collection and a Taxonomic Tree Tool (TTT) for data analysis, which integrates animal data with plant and microbial data into annual checklists and maintains the CoL China database system. The groups of species in the 2016 Annual Checklist and their number of accepted species names are Animalia (35905), Bacteria (469), Chromista (2239), Fungi (3488), Plantae (41940), Protozoa (1729) and viruses (805).
TTT is a web-based platform for managing and comparing taxonomic trees. It allows users to create their own taxonomic trees in any of four ways - inputting manually, uploading in xml, manually selecting taxa from template trees provided by TTT, or automatically selecting taxa from template trees according to a species list. Users can share their trees with registered users and compare them with public trees. TTT provides a tool for comparing different trees to focus on the spots where more attention should be paid by taxonomists or informatics scientists. The comparing tool explores taxa relationships from two different trees, and classifies the differences into different types of relationships. The tool helps to find the differences in the taxonomic positions for taxon A and taxon B, and marks them out explicitly. Furthermore, it calculates the similarity of branches from two compared trees to help taxonomists judge whether the taxa groups chosen are the same, and if it is necessary to continue drilling down the taxonomic trees for exploring more differences. TTT can extract common or different parts from two compared trees and the result can be exported for further tree integration research.

Speakers

Friday December 9, 2016 12:00 - 12:15 CST
Computer Science 3 Computer Science

12:15 CST

The BioCASe Monitor Service 2 – New Features for Monitoring Progress and Quality of Data Provision through Distributed Data Networks
In Europe, the Provider Software of the Biological Collection Access Service (BioCASe; http://biocase.org/) is a widely used XML wrapper software for exposing biodiversity data to the Web. Its interface allows for connecting different database management systems in order to map collection data to various XML schemas, like the Access to Biological Collection Data (ABCD) schema and its extensions. Data can be harvested by data portals in the preferred data standard format by communicating with BioCASe’s Web Service over the BioCASe Protocol.
Data providers, data network coordinators and data aggregators occasionally have to deal with a large number of BioCASe data sources and need to keep track of the data sources’ most important metadata, or the compliance of the respective mappings to external conventions. For this reason, the BioCASe Monitor Service (BMS) was developed as a complementary monitoring tool in 2012. It has been called into action in different German and European projects in order to provide a general overview of the project’s data sources, to facilitate their maintenance and improve their mappings. These projects are for example the German Federation for Biological Data (GFBio; http://www.gfbio.org), Opening Up the Natural History Heritage for Europeana (OpenUp! project; http://open-up.eu), Geosciences Collection Access Service (GeoCASe; http://geocase.eu) and the German GBIF node (GBIF-D, http://www.gbif.de).
In particular, the interface of the BMS 'Mapping Checker' is very useful for checking plausibility and compliance of a data source when strict quality guidelines of a data portal need to be met: the BMS checks the mapping against the desired target schema and visualizes potential issues. Thus, data network coordinators and aggregators are supported comprehensively and save time during the process of quality control. Now, the BMS can be configured in an even simpler and more flexible way using the recently developed new user interface and backend. The BMS is freely available under GPL 3.0 license*.
In our presentation we will give an overview of the assets of the BioCASe Monitor Service, its new features (e.g. automatically generated landing pages for datasets and a Web Service Application Program Interface (API) for automation) and will demonstrate some use cases. We will provide an outlook for the next steps of rolling out the tool to the international community. An additional demonstration session will present the basic hands-on and will be open for discussing new use cases and implementations.
* https://www.gnu.org/licenses/gpl-3.0.en.html


Friday December 9, 2016 12:15 - 12:30 CST
Computer Science 3 Computer Science

12:30 CST

Friday Lunch
Lunch is served in the Cafeteria (also known as "Soda").

Friday December 9, 2016 12:30 - 14:00 CST
Cafeteria Cafeteria

14:00 CST

Closing Session
CTEC Auditorium

Friday December 9, 2016 14:00 - 14:30 CST
Auditorium CTEC

14:30 CST

Examining citizen scientist engagement and transcription rates through site content and social media activity on the Notes from Nature platform
Notes from Nature (http://www.notesfromnature.org/; NFN) is a citizen science tool focused on public engagement and label transcription of natural history specimens. The project was developed collaboratively by biodiversity scientists, curators, and experts in citizen science. The project is hosted within the well-established Zooniverse platform (https://www.zooniverse.org/). NFN, first released in April 2013, was newly re-launched in June 2016. The project has been successful, with over 10,000 registered participants providing 1,350,000 transcriptions. Prior to the relaunch, the site featured large sets of images that were open-ended. Since the relaunch of the new NFN platform, the site has hosted smaller expeditions for citizen scientists to work on. These expeditions are sets of images numbering from a few hundred to a few thousand, which are grouped based on a geographic or taxonomic theme. An example would be Pollinator Plants of Virginia, U.S.A. which focused on the plants and their animal pollinators and contained close to 5,000 herbarium images. We predicted that the new expedition model would be more engaging and meaningful to our volunteers. Further, these expeditions provide better recruitment opportunities since they can be aligned with volunteer interests. We compare the statistics between these two approaches to see how effective our new system is at engaging volunteers and increasing transcription rates. We examine metrics such as the number of transcribers versus the number of expeditions. We also examine the impact of our social social media activities on transcription rates. This kind of information will assist managers of citizen science projects in creating efficient ways to engage citizen science volunteers and increase transcription rates.


Friday December 9, 2016 14:30 - 14:35 CST
Auditorium CTEC

14:35 CST

Toward a new data standard for combined marine biological and environmental datasets - expanding OBIS beyond species occurrences.
The Ocean Biogeographic Information System (OBIS) grows with millions of new species observations every year. Contributions come from a network of hundreds of institutions, projects and individuals with common goals: to build a scientific knowledge base that is open to the public for scientific exploration and discovery, and to detect trends and changes that inform society as essential elements in conservation management. Until now, OBIS has focused solely on the collection of biogeographic data (the presence of marine species in space and time) and operated with quality control procedures and data standards specifically targeted to these data. Based on requirements from the growing OBIS community for data archiving and scientific applications, OBIS launched the OBIS-ENV-DATA project to enhance its data standard by accommodating additional data types. The proposed standard allows for the management of sampling methodology, animal tracking and telemetry data, and environmental measurements such as nutrient concentrations, sediment characteristics and other abiotic parameters measured during sampling to characterize the environment from which biogeographic data was collected. The new OBIS data standard builds on the Darwin Core Archive and on practices adopted by the Global Biodiversity Information Facility (GBIF). It consists of an Event Core in combination with an Occurrence Extension and a proposed enhancement to the MeasurementOrFact Extension. This new structure enables the linkage of measurements or facts - quantitative or qualitative properties - to both sampling events and species occurrences and includes additional fields for property standardization. The OBIS standard also embraces the use of the new Darwin Core term parentEventID, enabling a sampling event hierarchy. We believe that the adoption of this new data standard for managing and sharing biological and associated environmental datasets by the international community will be key to improving the effectiveness of the knowledge base, and will enhance integration and management of critical data needed to understand ecological and biological processes in the ocean.


Friday December 9, 2016 14:35 - 14:40 CST
Auditorium CTEC

14:40 CST

Introducing LepNet - the Lepidoptera of North America Network
This lightining talk will announce and introduce "LetNep", the project "Lepidoptera of North America Network: Documenting Diversity in the Largest Clade of Herbivores." The project website is http://lep-net.org/ and the LepNet portal is available at http://symbiota4.acis.ufl.edu/scan/lepnet/portal/index.php. LepNet is a newly funded Thematic Collections Network in North America and also, in an important sense, a continuation and strengthening of SCAN - the Symbiota Collections of Arthropods Network (http://symbiota4.acis.ufl.edu/scan/portal/).
LepNet comprises 26 research collections that will digitize approximately 2 million specimen records and integrate these with over 1 million existing records. LepNet will digitize 43,280 larval vial records with host plant data, making this the first significant digitization of larvae in North American collections. LepNet will produce ca. 82,000 high-quality images of exemplar species covering 60% of North American lepidopteran species. These images will enhance remote identifications and facilitate systematic, ecological, and global change research. In collaboration with Visipedia, LepNet will create LepSnap, a computer vision tool that can provide automated identifications to the species level. Museum volunteers and student researchers equipped with smartphones will image >132,000 additional research-quality images through LepSnap. Up to 5,000 lepidopteran species will be elevated to a "research ready" status suitable for complex, data-driven analyses. LepNet will build on the existing data portal (SCAN) in consolidating data on Lepidoptera to the evolution of lepidopteran herbivores in North America. Access to these data will be increased through integration with iDigBio. Data for a broad range of research, including the evolutionary ecology of Lepidoptera and their host plants in the context of global change processes affecting biogeographic distributions will be generated. The LepXPLOR! program will spearhead education and outreach efforts for 67 existing programs, engaging a diverse, nationwide workforce of 400+ students and 3,500+ volunteers. Overall, LepNet will generate a sustainable social-research network dedicated to the creation and maintenance of a digital collection of North American Lepidoptera specimens.


Friday December 9, 2016 14:40 - 14:45 CST
Auditorium CTEC

14:45 CST

New developments for the Libraries of Life project and app
Libraries of Life is a collaborative project between the Biodiversity Knowledge Integration Center (BioKIC) at Arizona State University (https://biokic.asu.edu/) and iDigBio (Integrated Digitized Biocollections; https://www.idigbio.org/). The project and related app series are novel and popular learning tools that promote immersive engagement with biological specimens and their data through AR (Augmented Reality) Collection cards that interact with 3D specimen models. Following a first, successful release using a wide range of exemplary specimens from more than 15 Thematic Collections Networks created in North America, Libraries of Life now enters a new phase, broading reach, adding functions, and developing new sustainability solutions.
More information about Libraries of Life, and links to download Apple and Android apps, are available at http://www.libraries-of-life.org/.
Initial observations on various user populations suggest that the tool supports object-based pedagogies while promoting enhanced understanding of biodiversity and structural morphologies of diverse species. Our lighting talk wil briefly review current and future landmarks of this fast-developlng collections outreach tool.


Friday December 9, 2016 14:45 - 14:50 CST
Auditorium CTEC

14:50 CST

Updates on multiple Neotropical Symbiota portals - STRI, Flora, and Arthropods
We will present updates on the growth and further development of three Neotropical-themed Symbiota portals:
In an effort to create distributed information networks to support in-country research, Arizona State University (ASU) and the Smithsonian Tropical Research Institute (STRI) are collaborating to establish virtual environments that support distributed, occurrence-based research collection communities. Several biodiversity research data portals have been created using the Symbiota software (see http://bdj.pensoft.net/articles.php?id=1114). These portals have networked biodiversity resources from more than 35 collections and research datasets. The STRI biodiversity portals currently consist of two networked portals that integrate specimen and field observation data from numerous research projects; and feature a new designed, collaborative glossary module "TaxaGloss" (http://stricollections.org/portal/glossary/) - an illustrated, multilingual glossary to help biodiversity researchers and conservation practitioners understand the technical terms used to describe marine organisms.
Symbiota is designed to provide effective collection management and networking opportunities particularly for small- and medium-sized collections with very limited informatics resources - a condition that is common at many Neotropical institutions. Symbiota promotes open web access and use of voucher-based data for diverse research and learning applications. Our long-term objective is to create comprehensive Neotropical collection networks that can deliver access to high-quality occurrence data for a region whose biota are singularly diverse and in need of powerful informatics solutions that help communities explore and understand their biodiversity heritage.


Friday December 9, 2016 14:50 - 14:55 CST
Auditorium CTEC

14:55 CST

A New Georeferencing Tool
A common problem when dealing with biodiversity data and information collected over a period of time is a lack of standardisation and precision. One example of this, which we have encountered at the South African National Biodiversity Institute, is the difficulty in associating locality information written on specimens with actual positions on a map. For example, “5 mi SW of Calledon” is difficult and time consuming to resolve into a decimal degree latitude and longitude point, particularly if the specimen was collected 80 years ago (and the place name has changed), or if the collector has misspelled the name of the town “Caledon”, etc. We have developed a georeferencing web application to manage and speed up this process. We will give a brief overview of the design and functionality of the web application in this talk.


Friday December 9, 2016 14:55 - 15:00 CST
Auditorium CTEC

15:00 CST

Closing Session (continued)
CTEC Auditorium

Friday December 9, 2016 15:00 - 15:30 CST
Auditorium CTEC

15:30 CST

Friday PM Break & Remove Posters
Friday December 9, 2016 15:30 - 16:00 CST
Lobby CTEC
 


Filter sessions
Apply filters to sessions.
  • Contributed 01
  • Contributed 02
  • Contributed 03
  • Contributed 04
  • Contributed 05
  • Interest Group 01
  • Interest Group 02
  • Interest Group 03: Data Quality
  • Interest Group 04
  • Interest Group 05
  • Interest Group 06
  • Interest Group 07
  • Interest Group 08
  • Interest Group 09
  • Lightning Talks
  • Symposium 00
  • Symposium 01: Semantics for Biodiversity Science
  • Symposium 02: BHL
  • Symposium 03
  • Symposium 04
  • Symposium 05
  • Symposium 06: Biodiversity Data Quality
  • Symposium 09
  • Symposium 10
  • Symposium 12
  • Symposium 13
  • Workshop 01
  • Workshop 03: Darwin Core Invasive Species Extension Hackathon
  • Workshop 03C
  • Workshop 04
  • Workshop 05
  • Workshop 06: Darwin Core
  • Workshop 06A
  • Workshop 08