TDWG 2016 has ended
Back To Schedule
Wednesday, December 7 • 17:00 - 17:15
Data Quality at Scale: Bridging the Gap between Datum and Data

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

This talk will provide a practical look at implementing high throughput, high volume data quality processing to tackle the task of providing efficient and effective feedback on data quality at the scale of an aggregator with tens of millions of records. Topics covered will include looking at the tradeoffs between coverage and accuracy, using the Apache Spark processing framework to rapidly iterate on data quality workflows across large volumes of data and methods for effectively capturing the results of large scale data quality work for distribution back to data providers. The examples given in this talk are driven by from work the iDigBio team has done on implementing data quality workflows across all of the data we have collected, as well as comparing and contrasting our methods with those of other projects.

Wednesday December 7, 2016 17:00 - 17:15 CST
Auditorium CTEC