serhii.net

In the middle of the desert you can say anything you want

07 Jun 2023

GBIF data analysis

Format

  • GBIF Infrastructure: Data processing has a detailed description of the flow
    • occurrences.txt is an improved/cleaned/formalized verbatim.txt
    • metadata
      • meta.xml has list of all colum data types etc.
      • metadata.xml is things like download doi, license, number of rows, etc.
  • .zips are in Darwin format: FAQ
    • Because there are cases when both single and double quotes etc., and neither '/" as quotechar work.
    df = vx.read_csv(DS_LOCATION,convert="verbatim.hdf5",progress=True, sep="\t",quotechar=None,quoting=3,chunk_size=500_000)
    

Tools

Analysis

Things to try:

Nel mezzo del deserto posso dire tutto quello che voglio.