GBIF data analysis
Format
- GBIF Infrastructure: Data processing has a detailed description of the flow
occurrences.txt
is an improved/cleaned/formalizedverbatim.txt
- metadata
meta.xml
has list of all colum data types etc.- for all files in the zip!
- columns links lead to DCMI: DCMI Metadata Terms
metadata.xml
is things like download doi, license, number of rows, etc.
- .zips are in Darwin format: FAQ
- Because there are cases when both single and double quotes etc., and neither
'
/"
asquotechar
work.
df = vx.read_csv(DS_LOCATION,convert="verbatim.hdf5",progress=True, sep="\t",quotechar=None,quoting=3,chunk_size=500_000)
- Because there are cases when both single and double quotes etc., and neither
Tools
- GBIF .zip parser lib:
- BelgianBiodiversityPlatform/python-dwca-reader: 🐍 A Python package to read Darwin Core Archive (DwC-A) files.
- Tried it, took a long time both for the zip and directory, so I gave up
- gbif/pygbif: GBIF Python client
- API client, can also do graphs etc., neat!
Analysis
Things to try:
limit number of columns throughpd.read_csv.usecols()
1 to the ‘interesting’ ones- optionally take a smaller subset of the dataset and drop all
NaN
s - take column indexes from
meta.xml
- See if someone already did this:BelgianBiodiversityPlatform/python-dwca-reader: 🐍 A Python package to read Darwin Core Archive (DwC-A) files.
- optionally take a smaller subset of the dataset and drop all
Nel mezzo del deserto posso dire tutto quello che voglio.
comments powered by Disqus