tony rees irmng 2015 presentation

50
Constructing a Biodiversity database: the IRMNG* project (*Interim Register of Marine and Nonmarine Genera) Tony Rees New South Wales, Australia (previously: CSIRO Marine & Atmospheric Research, Hobart) with Leen Vandepitte, Bart Vanhoorne & Wim Decock VLIZ (Flanders Marine Institute, Belgium)

Upload: tony-rees

Post on 14-Apr-2017

791 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Tony Rees IRMNG 2015 presentation

Constructing a Biodiversity database: the IRMNG* project

(*Interim Register of Marineand Nonmarine Genera)

Tony Rees New South Wales, Australia

(previously: CSIRO Marine & Atmospheric Research, Hobart)

withLeen Vandepitte, Bart Vanhoorne & Wim Decock

VLIZ (Flanders Marine Institute, Belgium)

Page 2: Tony Rees IRMNG 2015 presentation

Biodiversity Informatics – NOT Bioinformatics

Bioinformatics – mainly concerned with genomics / DNA sequencing

Biodiversity Informatics – Biodiversity as a computable resource, e.g. …◦ Numbers of species, genera, families, etc. in the world; what are

their names, where and when published (and by whom)◦ A navigable “taxonomic backbone” for biodiversity data systems◦ Name recognition / spell checking◦ A machine-searchable repository of attributes (traits) e.g.

marine/nonmarine, extant/fossil, more…◦ Cross referencing old names (synonyms) to current names◦ Tracking taxonomic dynamism (new names / taxonomic changes

through time)

Page 3: Tony Rees IRMNG 2015 presentation

What is this “Biodiversity” anyway?

One answer: “the variety of plant and animal life in the world or in a particular habitat”

For a number of reasons, may wishto include fossil as well as extanttaxa in a comprehensivetreatment:◦ Enable a view of evolutionary

processes / serve more thanone community

◦ New names must not bepreoccupied in either extantor fossil realms

◦ Provide a facility to distinguish between extant and fossil taxa in stored or incoming data.

Page 4: Tony Rees IRMNG 2015 presentation

An “artistic” representation…

http://www.evogeneao.com/tree-of-life/tree-of-life.htm

Page 5: Tony Rees IRMNG 2015 presentation

An “artistic” representation…

http://www.evogeneao.com/tree-of-life/tree-of-life.htm

Page 6: Tony Rees IRMNG 2015 presentation

An “artistic” representation…

http://www.evogeneao.com/tree-of-life/tree-of-life.htm

Page 7: Tony Rees IRMNG 2015 presentation

… need (1) a compendium of, and (2) a means to navigate / query this information (big

task!)

Page 8: Tony Rees IRMNG 2015 presentation

Some metrics…

Page 9: Tony Rees IRMNG 2015 presentation

Some metrics…

Page 10: Tony Rees IRMNG 2015 presentation

Some metrics…~1.9m valid, extant species + ?0.3m fossils – say 2.2m valid names total, + more for synonyms (maybe another 2-3m…)…no figures available for genera, but guesstimate would be around 10% of these values – maybe 200k valid, another 200-300k synonyms

Page 11: Tony Rees IRMNG 2015 presentation

Taxonomy is not static… ~20,000 new species described every year, also ~2,000 new

genera, dozens/hundreds of families – no single source of these at this time (but IPNI, ION* help)

Probably >1,000 species and some genus names move into/out of synonymy each year – specialist databases/publications keep track of these to a degree

Major new treatments of particular groups e.g. flowering plants (APG I/II/III), birds, protists can radically affect higher taxonomy.

*International Plant Name Index / Index of Organism Names

Page 12: Tony Rees IRMNG 2015 presentation

Why use a database to manage this information

Print works – need a physical copy (or pdf), no integration, limited computer querying

Wikipedia / Google / Google Scholar – ad hoc treatments, no/few common terms, hierarchies set up for human reading but not computers, cannot detect data gaps or generate lists

Databases (“relational databases”) – standardise content and relations between data items enabling rapid and efficient queries, e.g. …◦ Search millions of entries in seconds or milliseconds◦ Easy checks for data inconsistencies e.g. same author names spelled

different ways, “extant” species in “fossil” genus, etc.◦ Produce summary statistics on-the-fly◦ Provide intuitive navigation tools e.g. drill down or go up tree, etc., all

derived from the data (no separate maintenance overhead)◦ Enter the data once, query / report in multiple ways according to

current and future user needs / ideas◦ Support both human- and machine- based queries (=data services).

Page 13: Tony Rees IRMNG 2015 presentation

Work in progress elsewhere Catalogue of Life – attempt to “stitch

together” existing species-level databases for particular groups, where these exist◦ Extant only (2002-15), fossils just starting◦ Where no global species database/s (GSDs), no

data (even for genera/families)◦ ~45% complete for species in 2006, 85% in

2015 (less so for synonyms and genus names)

Paleobiology Database (previously PaleoDB) – community-based activity for fossil taxa & occurrences◦ Does include genus, family level information,

also content types not in CoL e.g. references, geological range, collecting localities

◦ 50,000(?) names in 2006, 320,000 names in 2015 (all ranks), degree of completeness not known

Page 14: Tony Rees IRMNG 2015 presentation

Index of Organism Names (ION) – based on Zoological Record ongoing indexing of the scientific literature (big task – 5,000 journals scanned + more)◦ Contains 5m+ animal names (extant + fossil), but mix of “clean” and

“dirty” data (including many duplicates)◦ Proprietary product: cannot obtain entire database (or portion of it) as

an export file although can query via web◦ Some “key” content e.g. abstracts, authorship of publications, user-

defined searches, is behind paywall (limited to subscribers to Zoological Record) and protected by IP assertions

◦ ION IDs a useful concept for subset of the names (those linked to primary publication instances) – potential for re-use in other systems.

Page 15: Tony Rees IRMNG 2015 presentation

Other resources – genus level/ species level (examples)

Nomenclator Zoologicus, Index Nominum Genericorum are fairly complete genus-level compilations (to ~2004) for animals and plants, respectively (including fossils)

Other databases attempt to be complete for specific groups (including species as well as genera) – many updated on ongoing basis

Page 16: Tony Rees IRMNG 2015 presentation

The IRMNG project – 2006-current Genera easier than species (10x fewer to catalogue – 0.5m vs. 5m), still

useful because part of every species name e.g. Homo sapiens, Physeter macrocephalus

Use most of the resources mentioned (especially genus level compilations) to provide as comprehensive as possible “complete” survey of biodiversity (no gaps), i.e. plant + animal, extant + fossil, also prokaryotes and viruses – initially for OBIS use

Some sources (e.g. the Plant List, PaleoBioDB, recent versions of Cat. of Life) not yet included, could be added in future

“Missing” animal species names could be included from other resources as time available (but lower priority)

“Interim” = some limitations, but useable in the absence of anything better or more “finished”.

2015: - 479,000/488,000 genus names(valid + synonyms + “not known”)

- 1.9m species names(incl. 0.6m synonyms)

- extant plus fossil taxa, all groups, most with marine/nonmarine and extant/fossil flags

Page 17: Tony Rees IRMNG 2015 presentation

Initial clients / drivers OBIS (Ocean Biogeographic Information System) /

CoML (Census of Marine Life) – need to discriminate marine from nonmarine, extant from fossil names in incoming data

CSIRO Marine Research – complement / extension to existing “marine, Australian only” dataset (CAAB) – including cross checks and new information/records

Author’s personal interest / background – including experience with algae, higher plants, marine fauna, palynology and microfossils – plus challenge of creating something new and useful

Subsequent interest from other projects – including GBIF, Atlas of Living Australia (ALA), SeaLifeBase, Global Names, Encyclopedia of Life, Open Tree of Life, WoRMS and more.

Page 18: Tony Rees IRMNG 2015 presentation

• Terminal tips are species (extant or extinct i.e. fossil)• Groups of species are genera• Groups of genera are families• Groups of families are orders, then classes, then phyla, then kingdoms

(intermediate ranks e.g. superfamilies, subphyla could be added later if desired)

Result: like this, but with real names (data)

Page 19: Tony Rees IRMNG 2015 presentation

Example in Mammalia (species omitted):

(etc.)

Page 20: Tony Rees IRMNG 2015 presentation

Example in Mammalia (species omitted):

(etc.)

• Mammal genus syns. are well resolved (varies with group)

• Mix of extant and fossil (†) taxa• Species completeness variable

(extant currently much better than fossil)

• Some genus names not yet in families, just in placeholder categories like “Mammalia – awaiting allocation” – to be addressed in due course (part of “Interim” nature of project)

Page 21: Tony Rees IRMNG 2015 presentation

Present IRMNG content…

Page 22: Tony Rees IRMNG 2015 presentation

IRMNG content sources

Initial lists of families from Parker and Benton print works (1982, 1983) for extant and fossil taxa, respectively; updated for many groups from more recent sources

Genus names from◦ Taxonomicon/SN200

(private Dutch compilation, 2006)

◦ Nomenclator Zoologicus (animals, to 2004)

◦ Index Nominum Genericorum, The Plant List and GRIN (plants, to c. 2010)

◦ more animal genera from ION/Zoological Record (to c. 2009) + updates in progress

◦ Index Fungorum (2009 + 2013 update)

◦ prokaryotes and viruses from LPSN (2009) and virus DB (2006)

◦ more…

Species from◦ Catalogue of Life (2006

version)◦ Australian Faunal

Directory (2007 version)◦ NZ Biodiversity Inventory

(2008 preprint) – 56k living + 15k fossil species

◦ Aphia/ERMS/WoRMS (2006 + 2013 update) – 220k valid species+159k synonyms

◦ Joel Hallan Biology Catalog (2012 version)

◦ Museum Victoria KEmu database (2006 version)

◦ Print sources for some fossil groups

◦ more…

Page 23: Tony Rees IRMNG 2015 presentation

OK, so what can “the system” do…?

Page 24: Tony Rees IRMNG 2015 presentation

… Input a name / set of names, get back a “standard suite” of IRMNG data for each name(taxonomic placement, extant/fossil and marine/nonmarine status, more)

Page 25: Tony Rees IRMNG 2015 presentation

… Traverse the taxonomic hierarchy: enter at top or any point, go down/up/sideways etc. (also resolve synonyms to current valid names, where known)

Page 26: Tony Rees IRMNG 2015 presentation

… See original publication details for “most” genera, link to same in ION for a subset of species

Page 27: Tony Rees IRMNG 2015 presentation

… Get authorship and check correct spelling for “most” names as held at genus and species level – spell checking is with author’s custom “Taxamatch” algorithm (world leading system)

Page 28: Tony Rees IRMNG 2015 presentation

… Generate custom lists of names/taxa e.g. filter by taxonomic group, year or author name

Page 29: Tony Rees IRMNG 2015 presentation

IRMNG current status / location: in transition

Author’s original version (2006-current) at CSIRO Hobart, atwww.cmar.csiro.au/datacentre/irmng/- software and content will not be developed further, but has some custom features not yet in VLIZ-hosted version as below

CSIRO Marine Labs, Hobart

Page 30: Tony Rees IRMNG 2015 presentation

IRMNG current status / location: in transition

Author’s original version (2006-current) at CSIRO Hobart, atwww.cmar.csiro.au/datacentre/irmng/- software and content will not be developed further, but has some custom features not yet in VLIZ-hosted version as below

New version / location (2015 on, still under test) at Flanders Marine Institute (VLIZ), Belgium, home of WoRMS + OBIS:www.marinespecies.org/irmng/(new content will be added to this version)

CSIRO Marine Labs, Hobart Flanders Marine Institute, Oostende

Page 31: Tony Rees IRMNG 2015 presentation

IRMNG web access point (CSIRO version)www.cmar.csiro.au/datacentre/irmng

(paste names list here)

Page 32: Tony Rees IRMNG 2015 presentation

A quick demo: names from Hillis et al. “Circle of Life”

http://www.zo.utexas.edu/faculty/antisense/downloadfilestol.html

Page 33: Tony Rees IRMNG 2015 presentation

You are here

Page 34: Tony Rees IRMNG 2015 presentation

Example copy-and-paste into IRMNG query – what are these critters?

Page 35: Tony Rees IRMNG 2015 presentation

IRMNG web access point (CSIRO version)www.cmar.csiro.au/datacentre/irmng

Corculum cardissaGaleomma takiiOstrea edulisCrassostrea virginicaNerita albicillaMytilus edulisMytilus trossulusMytilus galloprovincialisMytilus californianusGeukensia demissaMimachlamys variaChlamys hastataCrassadoma giganteaPecten maximusArgopecten gibbusArgopecten irradiansPlacopecten magellanicusChlamys islandicaAtrina pectinataArca noaeBarbatia virescensAcanthopleura japonicaLepidochitona corrugataLepidozona coreanicaEohemithyris grayiiPlatidia anomioidesStenosarina crosnieriGryphus vitreusThecidellina blochmaniiCancellothyris hedleyiTerebratulina retusaLiothyrella neozelanicaLiothyrella uvaGwynia capsulaCalloria inconspicuaGyrothyris mawsoniNeothyris parvaTerebratalia transversaMacandrevia craniumFallax neocaledonensisLaqueus californianusMegerlia truncataTerebratella sanguineaNotosaria nigricansHemithyris psittaceaeNeocrania anomalaNeocrania huttoniDiscina striataGlottidia pyramidataLingula linguaLingula anatinaPhoronis architectaPhoronis psammophilaPhoronis vancouverensisAlboglossiphonia heteroclitaHirudo medicinalisHaemopis sanguisugaBarbronia weberiEisenia fetidaLumbricus rubellusDero digitataXironogiton victoriensisSathodrilus attenuatusNereis virensAphrodita aculeataNereis limbataCapitella capitataHarmothoe imparSabella pavonina

Page 36: Tony Rees IRMNG 2015 presentation

IRMNG real time search result (<5 secs for 1000 input names)

(etc. …)

family IRMNG data source

classification traits remarks, synonymy where known

full name+ authority

Page 37: Tony Rees IRMNG 2015 presentation

“Names not found” presented in a separate section – including near matches when present:

(etc. …)

Page 38: Tony Rees IRMNG 2015 presentation

Other web functions include search/match at higher taxonomic levels e.g. genus, family; traverse taxonomic hierarchy; filter by taxonomic group; search by author, year, more; partial search e.g. by 3 or more leading characters…

At database level, can do many other custom searches e.g. look for genera with no children or verified source (could be misspellings), look for genera awaiting family allocation, etc. etc.

VLIZ copy is intended to support multiple remote editing (different editors for particular groups), as per current WoRMS (World Register of Marine Species), allowing for future distribution of effort, also leverage of data entry into WoRMS down the track…

(NB conscious effort here to transition IRMNG from a “single editor” project to more collective ownership / ongoing maintenance & development; will share software environment with WoRMS going forward…)

Page 39: Tony Rees IRMNG 2015 presentation

Selected IRMNG / Taxamatch user responses… “I've never seen someone's chin hit the floor so quickly when I

showed them IRMNG. I have been asked to show everyone in the WAM [West Australian Museum] about it, and send out an email to let them know it is there. At the morning tea it was discussed and there was a good level of excitement… Part of this work is reviewing taxonomic names in the Kimberley. We are going to keep an eye out for any names that we identify as errors and can feed that back.”

– Piers Higgs, Gaia Resources, Australia [[email protected]], 21 May 2009

Page 40: Tony Rees IRMNG 2015 presentation

Selected IRMNG / Taxamatch user responses… “I've never seen someone's chin hit the floor so quickly when I

showed them IRMNG. I have been asked to show everyone in the WAM [West Australian Museum] about it, and send out an email to let them know it is there. At the morning tea it was discussed and there was a good level of excitement… Part of this work is reviewing taxonomic names in the Kimberley. We are going to keep an eye out for any names that we identify as errors and can feed that back.”

– Piers Higgs, Gaia Resources, Australia [[email protected]], 21 May 2009

“IRMNG is the most useful web tool I've ever used - I also used it to place Micropalaeontology genera into families (ostracods and foraminifera) - Here there is always a single match in these groups, sometimes also a match in Mollusca and Arthropoda, but I just use the former ones because I know that these names are of specimens belonging to those groups… Thank you very much for developing this fantastic tool.”–

– Willem Coetzer, South African Institute for Aquatic Biodiversity [[email protected]], 2 April 2011

Page 41: Tony Rees IRMNG 2015 presentation

IRMNG to-do tasks – in no particular order!

Fill gap in more recent genus names (e.g. 2008-present) – start made over past 12 months (9k animal genera added to 2011, additional 5k in pipeline)

Improve taxonomic resolution for ~100k genus names not yet placed to family (big task, also many may be older synonyms)

Check ~10k presently “unverified” genus names – mix of misspellings and “good” names not in major compilations

Revisit and additional QA on higher taxonomy (Kingdom -> Family) using most recent sources

Add in newly available “quality” species lists: The Plant List (2010/2013), PaleoBioDB, Catalogue of Life post 2006

Update prokaryote and virus species lists

Get additional animal species names from elsewhere e.g. ION (big task!), fungal species from Index Fungorum / Mycobank

Think about data flows in the bigger picture – who tracks new names, who wants them, how to notify and transport between projects with similar needs.

Page 42: Tony Rees IRMNG 2015 presentation

Future path for IRMNG Intention is to create a new future for IRMNG with the

handover to VLIZ VLIZ with Tony will investigate best ongoing use for the

system, synergies with WoRMS etc. Other projects now active in this space e.g. EOL, OTOL,

GBIF, Global Names all have “own systems” – possible scope for further collaboration and discussions

Watch this space…

Thank you! Tony Rees Leen Vandepitte + team at VLIZ [email protected] [email protected]

Page 43: Tony Rees IRMNG 2015 presentation

Additional slides

Page 44: Tony Rees IRMNG 2015 presentation

IRMNG content (genus level) as at mid 2015 – valid names plus synonyms/misspellings as held

Hexapoda (Insects): 175k (of which 7.5k fossil)◦ includes Coleoptera 62k,

Lepidoptera 30k,Hemiptera 21k, Diptera 21k, Hymenoptera 19k

Vertebrata: 55k (incl. 15.5k fossil)◦ Mammalia 13k (incl. 6.9k fossil)◦ Aves 13k (incl. 0.6k fossil)◦ Pisces 19k (incl. 4.1k fossil)◦ Reptilia 8k (incl. 3.2k fossil)◦ Amphibia 2k (incl. 0.6k fossil)

Land Plants: (Mosses -> Angios) 50k (incl. 8.6k fossil) – includes Angiosperms 40k

Mollusca: 41k (incl. 17.2k fossil) Chelicerata: 20k (incl. 1.1k fossil) Crustacea: 20k (incl. 4.8k fossil)

Fungi: 17k (incl. 0.5k fossil) Protista (excl. Algae): 13k (incl. 4.2k

fossil) “Algae”: 9k (incl. 2.7k fossil) Cnidaria: 9k (incl. 4.1k fossil) Echinodermata: 8k (incl. 4.3k fossil) Platyhelminthes: 7k (incl. 0.2k fossil) Brachiopoda: 6k (incl. 5.9k fossil) Trilobitomorpha: 6k (all fossil) Porifera: 5k (incl. 2.1k fossil) Nematoda: 5k (incl. 0.1k fossil) Bacteria/Cyanos/Archaea: 3k (incl.

0.4k fossil) smaller invert. groups: 0k -> 3k each Viruses: 0.4k (0 fossil)

Approx. mean ratio of species to genus names varies from around 5:1 through 10:1 (Hexapoda) to 20:1 (Land plants)

Page 45: Tony Rees IRMNG 2015 presentation

IRMNG status (end 2014) >469,000 genus names (incl. 94,000 fossil) and 1.9m species names

held (1.3m valid, 0.6m synonyms) – latter from CoL, WoRMS and elsewhere e.g. national lists

Genus coverage good up to ~pub. yr. 2002 (2,000+ per year), then declines: 2002: 2,488

2003: 2,3852004: 1,9422005: 1,3722006: 1,5542007: 1,6562008: 1,2322009: 8592010: 4782011: 3352012: 2532013: 382014: 0

N.B., ~100k animal genus names from Nomenclator Zoologicus still need allocation to family (presently e.g. “Mammalia”, “Mollusca” only)

Indicates backlog exists to be filled, see next slide

Page 46: Tony Rees IRMNG 2015 presentation

Next data upload (mid 2015) 2014-2015 activity: parsing and preparing c.9k missed animal genus

names from ION database (Zoological Record) to 2012, via R. Page “Bionames” data dump

Existing genus coverage: 2002: 2,488 Add: 1082003: 2,385 642004: 1,942 6252005: 1,372 9392006: 1,554 1,0022007: 1,656 1,0332008: 1,232 1,2772009: 859 1,6612010: 478 1,6612011: 335 9622012: 253 02013: 38 02014: 0 0

Need an ongoing mechanism to trap new genus names as published (also need new botanical genera c.2009 onwards)

Page 47: Tony Rees IRMNG 2015 presentation

Next data upload (mid 2015) 2014-2015 activity: parsing and preparing c.9k missed animal genus

names from ION database (Zoological Record) to 2012, via R. Page “Bionames” data dump

Existing genus coverage: 2002: 2,488 Add: 1082003: 2,385 642004: 1,942 6252005: 1,372 9392006: 1,554 1,0022007: 1,656 1,0332008: 1,232 1,277 392009: 859 1,661 312010: 478 1,661 812011: 335 962 5952012: 253 0 1,7902013: 38 0 1,7832014: 0 0 1,187

Need an ongoing mechanism to trap new genus names as published (also need new botanical genera c.2009 onwards)

New batch #2

Page 48: Tony Rees IRMNG 2015 presentation

Tony Rees solution: “Taxamatch”, developed 2007 onwards

Page 49: Tony Rees IRMNG 2015 presentation

Input name

Target names

Taxamatch operation is a 9 (10) step process

nearmatches

Page 50: Tony Rees IRMNG 2015 presentation

Successful submission based on IRMNG,

Taxamatch and 2 other systems developed by the

author, 2002-2014

(2014 GBIF keynote address available on YouTube)