Diensttagebuch - serhii.net

Day 1618 (07 Jun 2023)

Timing stuff in jupyter
- zc
- zc/it
- python
- jupyter
Difference between %time and %%time in Jupyter Notebook - Stack Overflow
- when measuring execution time, %time refers to the line after it, %%time refers to the entire cell
- As we remember¹:
  - real/wall the ‘consensus reality’ time
  - user: the process CPU time
    
    time it did stuff
  - sys: the operating system CPU time due to system calls from the process
    
    interactions with CPU system r/w etc.
1. Where’s your bottleneck? CPU time vs wallclock time ↩︎
GBIF data analysis
- zc
- zc/it
- paper
- gbif
Format
- GBIF Infrastructure: Data processing has a detailed description of the flow
  - occurrences.txt is an improved/cleaned/formalized verbatim.txt
  - metadata
    
    meta.xml has list of all colum data types etc.
    
    for all files in the zip!
    
    columns links lead to DCMI: DCMI Metadata Terms
    
    metadata.xml is things like download doi, license, number of rows, etc.
- .zips are in Darwin format: FAQ
  - Because there are cases when both single and double quotes etc., and neither '/" as quotechar work.
  df = vx.read_csv(DS_LOCATION,convert="verbatim.hdf5",progress=True, sep="\t",quotechar=None,quoting=3,chunk_size=500_000)
Tools
- GBIF .zip parser lib:
  - BelgianBiodiversityPlatform/python-dwca-reader: 🐍 A Python package to read Darwin Core Archive (DwC-A) files.
  - Tried it, took a long time both for the zip and directory, so I gave up
- gbif/pygbif: GBIF Python client
  - API client, can also do graphs etc., neat!
Analysis

Things to try:
- ~~limit number of columns through pd.read_csv.usecols()¹ to the ‘interesting’ ones~~
  - optionally take a smaller subset of the dataset and drop all NaNs
  - take column indexes from meta.xml
  - See if someone already did this:BelgianBiodiversityPlatform/python-dwca-reader: 🐍 A Python package to read Darwin Core Archive (DwC-A) files.
1. pandas.read_csv — pandas 2.0.2 documentation ↩︎
You can add underscores to numbers in Python
- zc
- zc/it
- python
TIL that for readability, x = 100000000 can be written as x = 100_000_000 etc.! Works for all kinds of numbers - ints, floats, hex etc.!¹
1. PEP 515 – Underscores in Numeric Literals | peps.python.org ↩︎
Vaex as faster pandas alternative
- zc
- zc/it
- py/lib
- py/pandas
- py/vaex
- ml
- ds
I have a larger-than-usual text-based dataset, need to do analysis, pandas is slow (hell, even wc -l takes 50 seconds…)

Vaex: Pandas but 1000x faster - KDnuggets - that’s a way to catch one’s attention.

Reading files

I/O Kung-Fu: get your data in and out of Vaex — vaex 4.16.0 documentation
- vx.from_csv() reads a CSV in memory, kwargs get passed to pandas’ read_csv()
- vx.open() reads stuff lazily, but I can’t find a way to tell it that my .txt file is a CSV, and more critically - how to pass params like sep etc
  - vx.from_ascii() has a parameter called sepe rator?! API documentation for vaex library — vaex 4.16.0 documentation
- the first two support convert= that converts stuff to things like HDFS, optionally chunk_size= is the chunk size in lines. It’ll create $N/chunk_size$ chunks and concat together at the end.
- Ways to limit stuff:
  - nrows= is the number of rows to read, works with convert etc.
  - usecols= limits to columns by name, id or callable, speeds up stuff too and by a lot
Writing files
- I can do df.export_hdf5() in vaex, but pandas can’t read that. It may be related to the opposite problem - vaex can’t open pandas HDF5 files directly, because one saves them as rows, other as columns. (See FAQ)
- When converting csv to hdf5, it breaks if one of the columns was detected as an object, in my case it was a boolean. Objects are not supported¹, and booleans are objects. Not trivial situation because converting that to, say, int, would have meant reading the entire file - which is just what I don’t want to do, I want to convert to hdf to make it manageable.
Doing stuff

Syntax is similar to pandas, but the documentation is somehow .. can’t put my finger on it, but I don’t enjoy it somehow.

Stupid way to find columns that are all NA
```
l_desc = df.describe()
# We find column names that have length_of_dataset NA values
not_empty_cols = list(l_desc.T[l_desc.T.NA!=df.count()].T.columns)
# Filter the description by them
interesting_desc = l_desc[not_empty_cols]
```
1. [BUG-REPORT] TypeError: Cannnot export column of type: object · Issue #2033 · vaexio/vaex ↩︎
Using a virtual environment inside jupyter
- zc
- zc/it
- py/jupyter
- jupyter
Use Virtual Environments Inside Jupyter Notebooks & Jupter Lab [Best Practices]

Create and activate it as usual, then:
```
python -m ipykernel install --user --name=myenv
```
Day 1617 (06 Jun 2023)

jupyter notebook, lab etc. installing extensions magic, paths etc.
- zc
- zc/it
- jupyter
- py/jupyter
- linux
It all started with the menu bar disappearing on qutebrowser but not firefox:

Broke everything when trying to fix it, leading to not working vim bindings in lab. Now I have vim bindings back and can live without the menu I guess.

It took 4h of very frustrating trial and error that I don’t want to document anymore, but - the solution to get vim bindings inside jupyterlab was to use the steps for installing through jupyter of the extension for notebooks, not the recommended lab one.
Installation · lambdalisue/jupyter-vim-binding Wiki:
```
mkdir -p $(jupyter --data-dir)/nbextensions/vim_binding
jupyter nbextension install https://raw.githubusercontent.com/lambdalisue/jupyter-vim-binding/master/vim_binding.js --nbextensions=$(jupyter --data-dir)/nbextensions/vim_binding
jupyter nbextension enable vim_binding/vim_binding
```
I GUESS the issue was that previously I didn’t use --data-dir, and tried to install as-is, which led to permission hell. Me downgrading -lab at some point also helped maybe.

The recommended jupyterlab-vim package installed (through pip), was enabled, but didn’t do anything: jwkvam/jupyterlab-vim: Vim notebook cell bindings for JupyterLab.

Also, trying to install it in a clean virtualenv and then doing the same with pyenv was not part of the solution and made everything worse.

Useful bits

Getting paths for both -lab and classic:
```
> jupyter-lab paths
Application directory:   /home/sh/.local/share/jupyter/lab
User Settings directory: /home/sh/.jupyter/lab/user-settings
Workspaces directory: /home/sh/.jupyter/lab/workspaces

> jupyter --paths
config:
    /home/sh/.jupyter
    /home/sh/.local/etc/jupyter
    /usr/etc/jupyter
    /usr/local/etc/jupyter
    /etc/jupyter
data:
    /home/sh/.local/share/jupyter
    /usr/local/share/jupyter
    /usr/share/jupyter
runtime:
    /home/sh/.local/share/jupyter/runtime
```
Removing ALL packages I had locally:
```
pip uninstall --yes jupyter-black jupyter-client jupyter-console jupyter-core jupyter-events jupyter-lsp jupyter-server jupyter-server-terminals jupyterlab-pygments jupyterlab-server jupyterlab-vim jupyterlab-widgets
pip uninstall --yes jupyterlab nbconvert nbextension ipywidgets ipykernel nbclient nbclassic ipympl notebook 
```
To delete all extensions: jupyter lab clean --all

Related: 230606-1428 pip force reinstall package

Versions of everything
```
> pip freeze | ag "(jup|nb|ipy)"
ipykernel==6.23.1
ipython==8.12.2
ipython-genutils==0.2.0
jupyter-client==8.2.0
jupyter-contrib-core==0.4.2
jupyter-contrib-nbextensions==0.7.0
jupyter-core==5.3.0
jupyter-events==0.6.3
jupyter-highlight-selected-word==0.2.0
jupyter-nbextensions-configurator==0.6.3
jupyter-server==2.6.0
jupyter-server-fileid==0.9.0
jupyter-server-terminals==0.4.4
jupyter-server-ydoc==0.8.0
jupyter-ydoc==0.2.4
jupyterlab==3.6.4
jupyterlab-pygments==0.2.2
jupyterlab-server==2.22.1
jupyterlab-vim==0.16.0
nbclassic==1.0.0
nbclient==0.8.0
nbconvert==7.4.0
nbformat==5.9.0
scipy==1.9.3
widgetsnbextension==4.0.7
```
Bad vibes screenshot of a tiny part ofhistory | grep jup

“One of the 2.5 hours I’ll never get back”, Serhii H. (2023). ~~Oil on canvas~~
Kitty terminal, scrot screenshotting tool, bash.
Docker unbuffered python output to read logs live
- zc
- zc/it
Docker image runs a Python script that uses print() a lot, but docker logs is silent because python print() uses buffered output, and it takes minutes to show.

Solution¹: tell python not to do that through an environment variable.
```
docker run --name=myapp -e PYTHONUNBUFFERED=1 -d myappimage
```
1. Python app does not print anything when running detached in docker - Stack Overflow ↩︎
pip force reinstall
- zc
- zc/it
- linux
- python
- cli
TIL about pip install packagename --force-reinstall¹
1. leaflet - Steps for doing a clean uninstall and reinstall Jupyter-Lab and then ipyleaflet on Ubuntu Linux 20.04 - Stack Overflow ↩︎
Day 1616 (05 Jun 2023)

Useful writing cliches
- zc
- zc/it
- paper
- research
- english
- language
- Since then, we have witnessed an increased research interest into
- Technical developments have gradually found their way into
- comprehensive but not exhaustive review
…

(On a third thought, I realized how good ChatGPT is at suggesting this stuff, making this list basically useless. Good news though.)
Dia save antialiased PNG
- zc
- zc/it
- dia
I love Dia, and today I discovered that:
- It can do layers! That work as well as expected in this context
- To save an antialiased PNG, you have to explicitly pick “png (antialiased)” when exporting, it’s in the middle of the list and far away from all the other flavours of .png extensions
Before and after:

1
2
3
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
99