# serhii.net

In the middle of the desert you can say anything you want

• ### Day 1319 (11 Aug 2022)

#### Pycharm shelf and changelists and 'Unshelve silently'

So - shelves! Just found out a really neat way to use them

“Unshelve silently” - never used it and never cared, just now - misclick and I did. It put the content of the shelf in a separate changelist named like the shelf, without changing my active changelist.

This is neat!

One of my main uses for both changelists and shelves are “I need to apply this patch locally but don’t want to commit that”, and this basically automates this behaviour.

• ### Day 1318 (10 Aug 2022)

#### Huggingface utils ExplicitEnum python bits

In the Huggingface source found this bit:

class ExplicitEnum(str, Enum):
"""
Enum with more explicit error message for missing values.
"""

@classmethod
def _missing_(cls, value):
raise ValueError(
f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}"
)


… wow?

(Pdb++) IntervalStrategy('epoch')
<IntervalStrategy.EPOCH: 'epoch'>
(Pdb++) IntervalStrategy('whatever')
*** ValueError: whatever is not a valid IntervalStrategy, please select one of ['no', 'steps', 'epoch']


Was MyEnum('something') allowed the whole time? God I feel stupid.

• ### Day 1317 (09 Aug 2022)

#### Python sorted sorting with multiple keys

So sorted()’s key= argument can return a tuple, then the tuple values are interpreted as multiple sorting keys!

## Pytest logging in pycharm

In Pycharm running config, there are options to watch individual log files which is nice.

But the main bit - all my logging issues etc. were the fault of Pycharm’s Settings for pytest that added automatically a -q flag. Removed that checkmark and now I get standard pytest output that I can modify!

And now caplog1 works:

def test_split_ds(caplog):
caplog.set_level(logging.DEBUG, logger="anhaltai_bbk.data.train_dev_splitter.splitter")
caplog.set_level(logging.DEBUG)
# ...


## Dropping into debugger on uncaught exception + pytest plugin

Anyway, solution was on pytest level, installing this package was the only thing needed: pytest-pycharm · PyPI

Installed it at the same time as this pycharm plugin, might’ve been either of the two:

Anyway now life’s good:

#### Creating representative test sets

Thinking out loud and lab notebook style to help me solve a problem, in this installment - creating representative train/test splits.

## Problem

Goal: create a test set that looks like the train set, having about the same distribution of labels.

In my case - classic NER, my training instances are documents whose tokens can be a number of different labels, non-overlapping, and I need to create a test split that’s similar to the train one. Again, splitting happens per-document.

Added complexity - in no case I want tags of a type ending up only in train or only in test. Say, I have 100 docs and 2 ORGANIZATIONs inside them - my 20% test split should have at least one ORGANIZATION.

Which is why random selection doesn’t cut it - I’d end up doing Bogosort more often than not, because I have A LOT of such types.

Simply ignoring them and adding them manually might be a way. Or intuitively - starting with them first as they are the hardest and most likely to fail

## Implementation details

My training instance is a document that can have say 1 PEOPLE, 3 ORGANIZATIONS, 0 PLACES.

For each dataset/split/document, I have a dictionary counting how many instances of each entity does it have, then changed it to a ratio “out of the total number of labels”.

{
"O": 0.75,
"B-ORGANIZATION": 0.125,
"I-ORGANIZATION": 0,
"B-NAME": 0,
"I-NAME": 0,
}


I need to create a test dataset with the distribution of these labels as close as the train dataset. In both, say, 3 out of 4 labels should be "O".

So - “which documents do I pick so that when their labels are summed up I get a specific distribution”, or close to it. So “pick the numbers from this list that sum up close to X”, except multidimensional.

Initial algo was “iterate by each training instance and put it in the pile it’ll improve the most”.

Started implementing something to do this in
HuggingFace Datasets , and quickly realized that “add his one training instance to this HF Dataset” is not trivial to do, and iterating through examples and adding them to separate datasets is harder than expected.

Generally we’re in the area of concepts like Subset sum problem / Optimization problem / Combinatorial optimization

### Scikit-learn

More usefully, specifically RE datasets, How to Create a Representative Test Set | by Dimitris Poulopoulos | Towards Data Science mentioned sklearn.model_selection.StratifiedKFold.

Which led me to sklearn’s “model selection” functions that have a lot of functions doing what I need! Or almost

API Reference — scikit-learn 1.1.2 documentation

And the User Guide specifically deals with them: 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.1.2 documentation

Anyway - StratifiedKFold as implemented is “one training instance has one label”, which doesn’t work in my case.

My training instance is a document that has 1 PEOPLE, 3 ORGANIZATIONS, 0 PLACES.

### Other places

Dataset Splitting Best Practices in Python - KDnuggets

## Brainstorming

Main problem: I have multiple labels/ys to optimize for and can’t directly use anything that splits based on a single Y.

Can I hack something like sklearn.model_selection.StratifiedGroupKFold for this?

Can I read about how they do it and see if I can generalize it? (Open source FTW!) scikit-learn/_split.py at 17df37aee774720212c27dbc34e6f1feef0e2482 · scikit-learn/scikit-learn

Can I look at the functions they use to hack something together?

… why can’t I use the initial apporach of adding and then measuring?

Where can I do this in the pipeline? In the beginning on document level, or maybe I can drop the requirement of doing it per-document and do it at the very end on split tokenized training instances? Which is easier?

Can I do a random sample and then add what’s missing?

Will going back to numbers and “in this train set I need 2 ORGANIZATIONS” help me reason about it differently than the current “20% of labels should be ORGANIZATION”?

### Looking at vanilla StratifiedKFold

scikit-learn/_split.py at 17df37aee774720212c27dbc34e6f1feef0e2482 · scikit-learn/scikit-learn

They sort the labels and that way get +/- the number of items needed. Neat but quite hard for me to adapt to my use case.

### OK, NEXT

Can I think of this as something like a sort with multiple keys?..

Can I use the rarity of a type as something like a class weight? Ha, that might work. Assign weights in such a way that each type is 100 and

This feels relevant. Stratified sampling - Wikipedia

Can I chunk them in small pieces and accumulate them based on the pieces, might be faster than by using examples?

THIS looked like something REALLY close to what I need, multiple category names for each example, but ended up being the usual stratified option I think:

python - Split data into train/ test files such that at least one sample is picked for both the files - Stack Overflow

This suggests to multiply the criteria and get a lot of bins - not what I need but I keep moving

Can I stratify by multiple characteristics at once?

I think “stratification of multilabel data” is close to what I need

Found some papers, yes this is the correct term I think

### scikit-multilearn

scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python

In multi-label classification one can assign more than one label/class out of the available n_labels to a given object.

This is really interesting, still not EXACTLY what I need but a whole new avenue of stuff to look at

scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python

The idea behind this stratification method is to assign label combinations to folds based on how much a given combination is desired by a given fold, as more and more assignments are made, some folds are filled and positive evidence is directed into other folds, in the end negative evidence is distributed based on a folds desirability of size.

Yep back to the first method!

They link this lecture explaining the algo: On the Stratification of Multi-Label Data - VideoLectures.NET

### That video was basically what I needed

Less the video than the slides, didn’t watch the video and hope I won’t have to - the slides make it clear enough.

Yes, reframing that as “number of instances of this class that are still needed by this fold” was a better option. And here binary matrices nicely expand to weighted stratification if I have multiple examples of a class in a document. And my initial intuition of starting with the least-represented class first was correct

Basic algorithm:

• Get class with smallest number of instances in the dataset
• Get all training examples with that class and distribute them first
• Go to next class

Not sure if I can use the source of the implementation: scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python

I don’t have a good intuition of what they mean by “order”, for now “try to keep labels that hang out together in the same fold”? Can I hack it to

I still have the issue I tried to avoid with needing to add examples to a fold/Dataset, but that’s not the problem here.

Generally - is this better than my initial approach?

What happens if I don’t modify my initial approach, just the order in which I give it the training examples?

Can I find any other source code for these things? Ones easier to adapt?

## Anyway

I’ll implement the algo myself based on the presentation and video according to my understanding.

The main result of this session was finding more related terminology and a good explanation of the algo I’ll be implementing, with my changes.

I’m surprised I haven’t found anything NER-specific about creating representative test sets based on the distribution of multiple labels in the test instances. Might become a blog post or something sometime.jj

• ### Day 1316 (08 Aug 2022)

#### Huggingface datasets set_transform

So you have the various mapping functions, but there’s a set_transform which executes a transform when getitem() is called.

Main classes

• ### Day 1311 (03 Aug 2022)

#### Slurm pyxis using a docker

If I sent you a link to this you probably want the TL;DR at the bottom

## Context

Problem: I have a docker image in a private docker registry that needs user/pass.

I need to use it in slurm’s pyxis.

The default srun --container-image .. syntax has no obvious place for a Docker registry user/pass.

Trying to use an image from a private registry does this:

$srun --mem=16384 -c2 --gres=gpu:v100:2 --container-image comp/myimage:latest slurmstepd: error: pyxis: child 2505947 failed with error code: 1 slurmstepd: error: pyxis: failed to import docker image slurmstepd: error: pyxis: printing contents of log file ... slurmstepd: error: pyxis: [INFO] Querying registry for permission grant slurmstepd: error: pyxis: [INFO] Authenticating with user: <anonymous> slurmstepd: error: pyxis: [INFO] Authentication succeeded slurmstepd: error: pyxis: [INFO] Fetching image manifest list slurmstepd: error: pyxis: [INFO] Fetching image manifest slurmstepd: error: pyxis: [ERROR] URL https://registry-1.docker.io/[...] returned error code: 401 Unauthorized  Slurm’s pyxis1 uses enroot2 to do the container magic that includes interfacing with Docker. enroot is installed on the box, Docker isn’t, I have no root access. ## Option/attempt 1: Using enroot config to pass a credentials file I need to pass through srun configs to enroot, so it can access the docker registry. To pass credentials to it, create a credentials file in $ENROOT_CONFIG_PATH/.credentials:

# DockerHub


That env var is not set in the base system, set it to /home/me/enroot/ and put the file there - same (no) result.

After googling, found this really detailed thread about the way pyxis handles environment variables: enroot/import.md at master · NVIDIA/enroot Especially this specific comment: pyxis doesn’t use environment variables defined in enroot .env files · Issue #46 · NVIDIA/pyxis

So basically, enroot and pyxis are behaving in opposite ways:

• if a ‘dynamic’ env var is defined in enroot conf files, enroot passes it to the container, but not pyxis
• if it’s not defined in enroot conf files, enroot doesn’t pass it to the container, but pyxis does.

I don’t have write access to the enroot config files, but the $ENROOT_CONFIG_PATH isn’t set there, I should be able to change it. No effect though. Giving up for now, though that would’ve been the most beautiful solution. ## Attempt 2: Get the image separately through enroot I could use pure enroot to get the docker image, then pass the file to srun. Run “Docker” Containers with NVIDIA Enroot To use a oath authentication and a token you would need to sign-up/sign-in and create a token (which you can save for reuse) and then do the container import as, enroot import 'docker://$oauthtoken@nvcr.io#nvidia/tensorflow:21.04-tf1-py3'

Awesome, let’s create a token and try:

… okay, what’s the address of the docker hub? The hub.docker.com one that’s default and ergo not used anywhere, but I need to pass it explicitly?..

Anyway let’s try to get bitnami/minideb from a public repo to pin the syntax down.

hub.docker.com returned 404s, trial and error led me to docker.io:

[INFO] Querying registry for permission grant
[INFO] Permission granted
[INFO] Fetching image manifest list
[ERROR] Could not process JSON input
curl: (23) Failed writing body (1011 != 4220)


registry-1.docker.io actually asked me for a password!

enroot import 'docker://$token@registry-1.docker.io#bitnami/minideb:latest' [INFO] Querying registry for permission grant [INFO] Authenticating with user:$token
Enter host password for user '$token': [ERROR] URL https://auth.docker.io/token returned error code: 401 Unauthorized  Without providing the token the image gets downloaded! Then I found index.docker.io3 that seems to be the correct one. Okay, let’s get my private one me@slurm-box:/slurm/me$ ENROOT_CONFIG_PATH=/home/me/enroot enroot import 'docker://index.docker.io#comp/myimage:latest'


401 error unauthorized, still ignoring my .credentials or env variable pointing to it.

enroot import 'docker://mydockerusername@index.docker.io#comp/myimage:latest'


Asks me for a password and then imports correctly! And creates a file called myimage.sqsh in the current dir.

Woohoo, working way to get docker images from private registry!

$enroot start myimage.sqsh enroot-nsenter: failed to create user namespace: Operation not permitted  Okay, so I’m not allowed to start them with enroot - not that I had any reason to.  srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/data --container-workdir /data  --pty bash


Drops me inside a shell in the container - it works!

Next step - using the Docker token.

Docker seems to see it as password replacement, this conflicts with official docus:

# Import Tensorflow 19.01 from NVIDIA GPU Cloud
$enroot import --output tensorflow.sqsh 'docker://$oauthtoken@nvcr.io#nvidia/tensorflow:19.01-py3'


On further googling - that’s a thing specific for nvcr.io, Docker Hub uses Docker stuff and I use that token as password replacement, period. Okay.

Had issues with mounting stuff as /data by default, but that specific bit is used in the docker image too - used something else.

The Dockerfile also has an ENTRYPOINT and sbin wants something to execute, true can be passed. Couldn’t get this to work, no true means sbin refuses to start, passing true makes it ignore the entrypoint altogether. --[no-]container-entrypoint from docu didn’t help - leaving for later.

Final line:

srun  --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/SLURMdata --container-writable python3 -m trainer_module -i /data/ -o /SLURMdata/Checkpoints/ --config-file /SLURMdata/config.yaml  This: • makes the image writable, so huggingface and friends can download stuff • makes /slurm/me/data available as /SLURMdata inside the image; • passes a config file to it that I have inside /data/config.yaml to the trainer (that accesses it as /SLURMdata/config.yaml) • runs the training on a dataset inside the directory that the Dockerfile puts inside /data in the image itself (the one that conflicted with mine earlier), • puts training results in a directory inside /SLURMdata which means it’s available to me after sbin is done in my /slurm/me/data directory. ## TODO / for later • Try again to find a way to use a .credentials file, one command less to run then • How to run my docker image’s ENTRYPOINT ## (More) resources ## TL;DR Two ways I found, passing credentials for the docker registry didn’t work, separately downloading the image and then running it did. Read the entire post if you want details on most of this. ### Getting the image: enroot import 'docker://mydockerusername@index.docker.io#comp/myimage:latest'  Replace mydockerusername with your docker username, comp with companyname and myimage with the name of the image. It will ask you for your Docker pass or Personal Access Token. Will download the image into a *.sqsh file in the current directory or whatever you pass through the -o parameter. ### Running the image srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/$(id -u -n)/data:/SLURMdata --container-writable your_command_to_run

# or - if you are running the thing I'm running - ...

srun  --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./Docker/myimage.sqsh --container-mounts=/slurm/\$(id -u -n)/data:/SLURMdata --container-writable python3 -m trainer_module -i /data/ -o /SLURMdata/Checkpoints/ --config-file /SLURMdata/config.yaml


In decreasing order of interest/generality:

• pass the downloaded *.sqsh file to --container-image.
• Environment variables get passed as-is in most cases. If you’d do docker run --env ENV_VAR_NAME, here you’d say ENV_VAR_NAME=whatever srun ... or just export ... it before running and it should work.
• --container-writable is needed to make the filesystem writable, huggingface needs that to write cache files
• --container-mounts
• are /dir_in_your_fs:/dir_inside_docker_image
• Make sure the Docker itself doesn’t have anything unexpected located at /dir_inside_docker_image

• ### Day 1304 (27 Jul 2022)

#### Huggingface dataset analysis tool

Really nice, and the blog post introducing it has a lot of general info about datasets that I found very interesting.

• ### Day 1303 (26 Jul 2022)

#### Python typing classmethods return type

If you have a classmethod and want to annotate the return value as that same class you’re now defining, you can actually do the logical thing!

from __future__ import annotations

class Whatever:
# ...
@classmethod what(cls) -> Whatever:
return cls()


#### Python dataclass libraries, pydantic and dataclass-wizard

It started with writing type hints for a complex dict, which led me to TypedDict, slowly went into “why can’t I just do a dataclass as with the rest”.

Found two libraries:

• pydantic
• Really nice esp. for validation
• MyClass(**somedict)
• Couldn’t find an easy way to parse nested dataclasses from the same dict
• Dataclass Wizard