Day 1619
You can add underscores to numbers in Python
TIL that for readability, x = 100000000
can be written as x = 100_000_000
etc.! Works for all kinds of numbers - ints, floats, hex etc.!1
Vaex as faster pandas alternative
I have a larger-than-usual text-based dataset, need to do analysis, pandas is slow (hell, even wc -l
takes 50 seconds…)
Vaex: Pandas but 1000x faster - KDnuggets - that’s a way to catch one’s attention.
Reading files
I/O Kung-Fu: get your data in and out of Vaex — vaex 4.16.0 documentation
vx.from_csv()
reads a CSV in memory, kwargs get passed to pandas'read_csv()
vx.open()
reads stuff lazily, but I can’t find a way to tell it that my.txt
file is a CSV, and more critically - how to pass params likesep
etcvx.from_ascii()
has a parameter called sepe rator?! API documentation for vaex library — vaex 4.16.0 documentation
- the first two support
convert=
that converts stuff to things like HDFS, optionallychunk_size=
is the chunk size in lines. It’ll create $N/chunk_size$ chunks and concat together at the end. - Ways to limit stuff:
nrows=
is the number of rows to read, works with convert etc.usecols=
limits to columns by name or id, speeds up stuff too
Writing files
I can do df.export_hdf5()
in vaex, but pandas can’t read that. It may be related to the opposite problem - vaex can’t open pandas HDF5 files directly, because one saves them as rows, other as columns. (See
FAQ)
Doing stuff
Syntax is similar to pandas, but the documentation is somehow .. can’t put my finger on it, but I don’t enjoy it somehow.
Stupid way to find columns that are all NA
l_desc = df.describe()
# We find column names that have length_of_dataset NA values
not_empty_cols = list(l_desc.T[l_desc.T.NA!=df.count()].T.columns)
# Filter the description by them
interesting_desc = l_desc[not_empty_cols]
Using a virtual environment inside jupyter
Use Virtual Environments Inside Jupyter Notebooks & Jupter Lab [Best Practices]
Create and activate it as usual, then:
python -m ipykernel install --user --name=myenv
Timing stuff in jupyter
Difference between %time and %%time in Jupyter Notebook - Stack Overflow
- when measuring execution time,
%time
refers to the line after it,%%time
refers to the entire cell - As we remember1:
- real/wall the ‘consensus reality’ time
- user: the process CPU time
- time it did stuff
- sys: the operating system CPU time due to system calls from the process
- interactions with CPU system r/w etc.
GBIF data analysis
Format
- GBIF Infrastructure: Data processing has a detailed description of the flow
occurrences.txt
is an improved/cleaned/formalizedverbatim.txt
- metadata
meta.xml
has list of all colum data types etc.- for all files in the zip!
- columns links lead to DCMI: DCMI Metadata Terms
metadata.xml
is things like download doi, license, number of rows, etc.
- .zips are in Darwin format: FAQ
- Because there are cases when both single and double quotes etc., and neither
'
/"
asquotechar
work.
df = vx.read_csv(DS_LOCATION,convert="verbatim.hdf5",progress=True, sep="\t",quotechar=None,quoting=3,chunk_size=500_000)
- Because there are cases when both single and double quotes etc., and neither
Tools
- GBIF .zip parser lib:
- BelgianBiodiversityPlatform/python-dwca-reader: 🐍 A Python package to read Darwin Core Archive (DwC-A) files.
- Tried it, took a long time both for the zip and directory, so I gave up
- gbif/pygbif: GBIF Python client
- API client, can also do graphs etc., neat!
Analysis
Things to try:
limit number of columns throughpd.read_csv.usecols()
1 to the ‘interesting’ ones- optionally take a smaller subset of the dataset and drop all
NaN
s - take column indexes from
meta.xml
- See if someone already did this:BelgianBiodiversityPlatform/python-dwca-reader: 🐍 A Python package to read Darwin Core Archive (DwC-A) files.
- optionally take a smaller subset of the dataset and drop all