07 Jun 2023

GBIF data analysis

Format

GBIF Infrastructure: Data processing has a detailed description of the flow
- occurrences.txt is an improved/cleaned/formalized verbatim.txt
- metadata
  - meta.xml has list of all colum data types etc.
    - for all files in the zip!
    - columns links lead to DCMI: DCMI Metadata Terms
  - metadata.xml is things like download doi, license, number of rows, etc.

.zips are in Darwin format: FAQ

Because there are cases when both single and double quotes etc., and neither '/" as quotechar work.

df = vx.read_csv(DS_LOCATION,convert="verbatim.hdf5",progress=True, sep="\t",quotechar=None,quoting=3,chunk_size=500_000)

Tools

GBIF .zip parser lib:
- BelgianBiodiversityPlatform/python-dwca-reader: 🐍 A Python package to read Darwin Core Archive (DwC-A) files.
- Tried it, took a long time both for the zip and directory, so I gave up
gbif/pygbif: GBIF Python client
- API client, can also do graphs etc., neat!

Analysis

Things to try:

~~limit number of columns through pd.read_csv.usecols()¹ to the ‘interesting’ ones~~
- optionally take a smaller subset of the dataset and drop all NaNs
- take column indexes from meta.xml
- See if someone already did this:BelgianBiodiversityPlatform/python-dwca-reader: 🐍 A Python package to read Darwin Core Archive (DwC-A) files.

pandas.read_csv — pandas 2.0.2 documentation ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.

comments powered by Disqus