on
Importing Existing Data: Great Lakes Invasives
Before full implementation of the Zoological Museum’s Specify, I used the UW mollusks dataset found at the Great Lakes Invasives Network to test our prototype. Since that Symbiota portal has chosen to publish their data as Darwin Core Archive (DwC-A) files via iDigBio, the import process was much easier.
A taxonomic thesaurus was needed before the specimesn could be imported. Instead of importing the full Animalia tree from ITIS, I have chosen to import only the taxa associated with specimens in the source database. That will happen to most collections imported in the prototype, because sometimes the taxonomy is not fully registered in ITIS or other taxonomic repositories. Because Symbiota does not provide an automated process for taxonomic tree export, I have used the DwC-A files to generate the trees as need.
Generating taxonomic thesaurus with DwC-A files and OpenRefine
- Created new project in OpenRefine, importing the DwC-A, only with the columns that were pertinent for the taxonomic thesaurs import (
kingdom,phylum,class,order,family,scientificName,taxonID,scientificNameAuthorship,genus,specificEpithet,taxonRank,infraspecificEpithet,taxonRemarks). - Selected row mode by clicking
Show as: rows. - Sorted by
scientificNameortaxonID(by clicking the column arrow, thensort)4. - Selected the
Sortmenu, clickedReorder rows permanently. - Selecting either
scientificNameortaxonID, clicked the column arrow, thenEdit cells, thenBlank down, to remove duplicates. - Clicked the
scientificNameortaxonIDcolumn arrow, thenFacet, thenText facet. On the left panel, clicked thescientificNamefacet panel, then selected the(blank)facet to select all records that were cleared. - Selected the
Allcolumn arrow, thenEdit rows, thenRemove all matching rowsto remove duplicates. - Clicked the
(blank)facet again and excluded it from the dataset. - Exported as
.csvto manually edit the hierarchy in MS Excel.
Preparing taxonomic thesaurus in MS Excel
The exported .csv from DwC-A has to be modified so that the manual import on Symbiota runs smoothly. The thesaurus should be imported in two steps: first, the higher taxonomy should be added (from kingdom to tribe, and this order is important, so the highest ranks should be always imported before their children), then all taxa included in genera and lower ranks.
The following fields can be mapped into Symbiota using the graphic interface:
kingdomphylumclassorderfamilyscinameinput: required, full scientific name with or without authorsciname: full scientific name without authorauthor: author of taxonparentstr: strongly recommended for building hierarchy correctly, parent taxon’sscinamerankid: number assigned to rank of taxon in tabletaxonunits(for instance,kingdom=10andspecies=220)
Tip: remember to leave empty the ranks that are equal or below the taxon that is being imported. This means the spreadsheet generated with OpenRefine had to be manually cleaned for the higher taxa (for example, the record for class Bivalvia had the columns for class, order, and family cleared).