In the middle of the desert you can say anything you want
Spent hours tracking down a bug that boiled down to:
A if args.sth.lower == "a" else B
Guess what - args.sth.lower
is a callable, and will never be equal to a string. So args.sth.lower == "a"
is always False
.
Of course I needed args.sth.lower()
.
Python sets have two kinds of methods:
a.intersection(b)
which returns the intersectiona.intersection_update(b)
which updates a
by removing elements not found in b.It calls the function-like ones (that return the result) operators, as opposed to the ‘update_’ ones.
Previously: 220622-1744 Directory structure for python research-y projects, 220105-1142 Order of directories inside a python project
Datasets.
HF has recommendations about how to Structure your repository, where/how to put .csv/.json files in various splits/shards/configurations.
These dataset structures are also ones that can be easily loaded with load_dataset()
, despite being CSV/JSON files.
Filenames containing ’train’ are considered part of the train split, same for ’test’ and ‘valid’
And indeed I could without issues create a Dataset through ds = datasets.load_dataset(my_directory_with_jsons)
.
Given an argument -l
, I needed to pass multiple values to it.
python - How can I pass a list as a command-line argument with argparse? - Stack Overflow is an extremely detailed answer with all options, but the TL;DR is:
nargs
:parser.add_argument('-l','--list', nargs='+', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 2345 3456 4567
parser.add_argument('-l','--list', action='append', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 -l 2345 -l 3456 -l 4567
Details about values for nargs
:
# This is the correct way to handle accepting multiple arguments.
# '+' == 1 or more.
# '*' == 0 or more.
# '?' == 0 or 1.
# An int is an explicit number of arguments to accept.
parser.add_argument('--nargs', nargs='+')
Related, a couple of days ago used nargs
to allow an empty value (explicitly passing -o
without an argument that becomes a None
) while still providing a default value that’s used if -o
is omitted completely:
parser.add_argument(
"--output-dir",
"-o",
help="Target directory for the converted .json files. (%(default)s)",
type=Path,
default=DEFAULT_OUTPUT_DIR,
nargs="?",
)
A training that worked on my laptop gets kliled on the slurm node.
sstat
was hard to parse and read, wasn’t sure what I want there.
Find out the CPU time and memory usage of a slurm job - Stack Overflow
sstat
is for running jobs, sacct
is for finished jobssacct
in its examples told me that column name capitalization doesn’t matterEnded up with this:
sacct -j 974 --format=jobid,jobname,maxvmsize,avevmsize,maxrss,averss,maxpages,avepages,avecpu,alloccpus,elapsed,state,exitcode,reqcpufreqmax,reqcpufreqgov,reqmem
For running jobs:
sstat -j 975 --format=jobid,maxvmsize,avevmsize,maxrss,averss,maxpages,avepages,avecpu,reqcpufreqmax,reqcpufreqgov
(Half can be removed, but my goal was to just get it to fit on screen)
W|A is still the best for conversions: 18081980K in gb - Wolfram|Alpha
Other things I learned:
You can use suffixes in args like --mem=200G
--mem=0
should give access to all the memory, doesn’t work for me though
You can do a task farm to run many instances of the same command with diff params: Slurm task-farming for Python scripts | Research IT | Trinity College Dublin
Found more helpful places
Things that work for my specific instance:
ssh-copy-id
to log in via public keykitty +kitten ssh shamotskyi@v-slurm-login
sshfs
set -o vi
in ~/.bashrc
Problem: how to install my python packages?
Sample from documentation about using pyxls:
srun --mem=16384 -c4 --gres=gpu:v100:2 \
--container-image tensorflow/tensorflow:latest-gpu \
--container-mounts=/slurm/$(id -u -n):/data \
--container-workdir /data \
python program.py
Sadly my code needs some additional packages not installed by default there or anywhere, I need to install spacy language packs etc.
I have a Docker image I can use with everything installed on it, but it’s not on any public registry and I’m not gonna setup one just for this.
You can start interactive jobs, in this case inside a docker container and it drops you inside a shell:
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image tensorflow/tensorflow:latest-gpu --container-mounts=/slurm/$(id -u -n):/data --container-workdir /data --pty bash
Couldn’t add users or install packages because nothing was writeablea, so I open the documentation, find interesting flags there:
--container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH
[pyxis] the image to use for the container
filesystem. Can be either a docker image given as
an enroot URI, or a path to a squashfs file on the
remote host filesystem.
--container-name=NAME [pyxis] name to use for saving and loading the
container on the host. Unnamed containers are
removed after the slurm task is complete; named
containers are not. If a container with this name
already exists, the existing container is used and
the import is skipped.
--container-save=PATH [pyxis] Save the container state to a squashfs
file on the remote host filesystem.
--container-writable [pyxis] make the container filesystem writable
--container-readonly [pyxis] make the container filesystem read-only
So, I can get an image from Docker hub, save that container locally, and then provide that saved one instead of the image from the registry. Nice.
Or just give it a name, it will reuse it instead of reading it.
I can also make it writable.
=> I can create my own docker image, install everything there, and just go inside it to start trainings?
Final command:
srun --mem=16384 -c4 --gres=gpu:v100:2 --container-image ./test_saved_path --container-save ./test_saved_path_2 --container-mounts=/slurm/$(id -u -n)/data:/data --container-workdir /data --container-name my_container_name --container-writable --pty bash
It:
./test_saved_paths_2
, just in case the open-the-named-container-by-name ever fails me.And a folder that locally I have mounted with sshfs
that the docker image also has transparent access to makes the entire workflow fast.
The final solution was:
(But I still wonder how the rest are doing it, I can’t believe that’s the common way to run stuff that needs an installed package…)
Magic line:
docker rm -f $(docker ps -aq) && docker volume rm -f $(docker volume ls -q)
Was unhappy about the order of suggestions for completion in Pycharm, more current stuff I can remember than arguments to a function I don’t.
Started looking for ways to order them, but then realized that I ACTUALLY want documentation for the thing under the cursor - that I have in vim/jedi and use but somehow not in pycharm.
Code reference information | PyCharm:
<Ctrl-Shift-I>
does this “Quick definition”“View -> Quick type definition” exists too! Can be bound to a key, but available though the menu.
That menu has A LOT of stuff that is going to be transformative for the way I code. Describing here in full to remember it, it’s worth it.
My understanding is:
def ou()..
”, “It’s a variable the function got through this part of the signature: a: str,
”<C-S-i>
by default<Alt-K>
for me, default <Ctrl-P>
,str
!”
<Alt-P>
for me, default <Ctrl-Shift-P>
str
- well now I know that a str
has a long definition. No default shortcut.<Alt-q>
<Alt-K>
is now quick documentation<Alt-P>
is now type infoOnwards!
A (DatasetInfo
) object contains dataset metadata like version etc.
Adding pre-existing attributes described here: Create a dataset loading script. But apparently you can’t add custom ones through it.
Build and load touches the topic and suggests subclassing BuilderConfig
, it’s the class that then is used by the DatasetBulider.
Fine-tuning with custom datasets — transformers 3.2.0 documentation
Example shown, not for this problem, and I don’t really like it but whatever.
Ended up just not adding metadata, I basically needed things that can be recovered anyway from a Features
object with ClassLabels
.
No easy support for custom metadata is really strange to me - sounds like something quite useful to many “Dataset created with version XX of converter program” and I see no reason why HF doesn’t do this.
Strong intuitive feeling that I’m misunderstanding the logic on some level and the answer I need is closer in spirit to “why would you want to add custom attributes to X, you could just ….”
Does everyone use separate key/values in the dataset itself or something?
EDIT: https://huggingface.co/datasets/allocine/edit/main/README.md cool example.
The *
operator works to get a list from dictionary keys!
my_dict.keys()
returns a dict_keys
object.[*my_dict.keys()]
returns the keys as list of str
list(..)
would do the same but in a more readable way :)Anyway filling this under “cool stuff I won’t ever use”