Diensttagebuch

Day 1303 (27 Jul 2022)

Huggingface dataset analysis tool
- zc
- zc/it
- huggingface
- DataMeasurementsTool - a Hugging Face Space by huggingface
- Introducing the Data Measurements Tool: an Interactive Tool for Looking at Datasets
Really nice, and the blog post introducing it has a lot of general info about datasets that I found very interesting.
Day 1302 (26 Jul 2022)

Python typing classmethods return type
- zc
- zc/it
- python
- py/typing
From python - How do I type hint a method with the type of the enclosing class? - Stack Overflow:

If you have a classmethod and want to annotate the return value as that same class you’re now defining, you can actually do the logical thing!
```
from __future__ import annotations

class Whatever:
	# ...
	@classmethod what(cls) -> Whatever:
		return cls()
```
Inter-annotator agreement (IAA) metrics
- zc
- zc/it
- ml
Kohen’s Kappa
- sklearn.metrics.cohen_kappa_score — scikit-learn 1.1.1 documentation
- Extremely clear explanation: Cohen’s Kappa Statistic - Statistics How To
- Cohen’s Kappa. Understanding Cohen’s Kappa coefficient | by Kurtis Pykes | Towards Data Science
- Important bits:
  - It includes the likelihood that a correct guess happens randomly
  - Only 2 annotators
  - Compares classification on the same items - so as input it gets y_1 and y_2 of the same length with matching annos
Python dataclass libraries, pydantic and dataclass-wizard
- zc
- zc/it
- python
- py/dataclasses
It started with writing type hints for a complex dict, which led me to TypedDict, slowly went into “why can’t I just do a dataclass as with the rest”.

Found two libraries:
- pydantic
  - Really nice esp. for validation
  - MyClass(**somedict)
  - Couldn’t find an easy way to parse nested dataclasses from the same dict
- Dataclass Wizard
  - Basically a wrapper to read/save (nested!) dataclasses from dicts / json.
  - Does one thing and does it well!
  - When dumping, converted my field_names to camelcase fieldNames, one can disable that from settings: Extending from Meta — Dataclass Wizard 0.22.1 documentation
Day 1301 (25 Jul 2022)

Python for..else syntax
- zc
- zc/it
- python
TIL another bit I won’t ever use: 21. for/else — Python Tips 0.1 documentation

This exists:
```
for a in whatveer:
	a.whatever()
else:
	print("Whatever is empty!")
```
Found it after having a wrong indentation of an else that put it inside the for loop.
Day 1298 (22 Jul 2022)

Python interval libraries
- zc
- zc/it
- python
Found at least three:
- portion · PyPI
  - Seems to be more about operations
- GitHub - brentp/interlap: fast, pure-python interval overlap testing
  - More focused on querying
- intervaltree · PyPI
  - Too about querying
Day 1297 (21 Jul 2022)

Python str lower bug - callable function vs function return value
- zc
- zc/it
- python
Spent hours tracking down a bug that boiled down to:
```
A if args.sth.lower == "a" else B
```
Guess what - args.sth.lower is a callable, and will never be equal to a string. So args.sth.lower == "a" is always False.

Of course I needed args.sth.lower().
Day 1296 (20 Jul 2022)

Python set operations
- zc
- zc/it
- python
Python sets have two kinds of methods:
- a.intersection(b) which returns the intersection
- a.intersection_update(b) which updates a by removing elements not found in b.
It calls the function-like ones (that return the result) operators, as opposed to the ‘update_’ ones.

(Built-in Types — Python 3.10.5 documentation)
Dataset files structure Huggingface recommendations
- zc
- zc/it
- ml
- hf
Previously: 220622-1744 Directory structure for python research-y projects, 220105-1142 Order of directories inside a python project

Datasets.

HF has recommendations about how to Structure your repository, where/how to put .csv/.json files in various splits/shards/configurations.

These dataset structures are also ones that can be easily loaded with load_dataset(), despite being CSV/JSON files.

Filenames containing ’train’ are considered part of the train split, same for ’test’ and ‘valid’

And indeed I could without issues create a Dataset through ds = datasets.load_dataset(my_directory_with_jsons).

Python argparse pass multiple values for argument
- zc
- zc/it
- python
- py/argparse
Given an argument -l, I needed to pass multiple values to it.

python - How can I pass a list as a command-line argument with argparse? - Stack Overflow is an extremely detailed answer with all options, but the TL;DR is:
1. nargs:
```
parser.add_argument('-l','--list', nargs='+', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 2345 3456 4567
```
1. append:
```
parser.add_argument('-l','--list', action='append', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 -l 2345 -l 3456 -l 4567
```
Details about values for nargs:
```
# This is the correct way to handle accepting multiple arguments.
# '+' == 1 or more.
# '*' == 0 or more.
# '?' == 0 or 1.
# An int is an explicit number of arguments to accept.
parser.add_argument('--nargs', nargs='+')
```
Related, a couple of days ago used nargs to allow an empty value (explicitly passing -o without an argument that becomes a None) while still providing a default value that’s used if -o is omitted completely:
```
    parser.add_argument(
        "--output-dir",
        "-o",
        help="Target directory for the converted .json files. (%(default)s)",
        type=Path,
        default=DEFAULT_OUTPUT_DIR,
        nargs="?",  
    )
```

Day 1303 (27 Jul 2022)

Day 1302 (26 Jul 2022)

Kohen’s Kappa

Day 1301 (25 Jul 2022)

Day 1298 (22 Jul 2022)

Day 1297 (21 Jul 2022)

Day 1296 (20 Jul 2022)