In the middle of the desert you can say anything you want
Really nice google colab showing more advanced datasets
bits in addition to what’s on the label:
Custom Named Entity Recognition with BERT.ipynb - Colaboratory
Pasting this example from there:
class dataset(Dataset):
def __init__(self, dataframe, tokenizer, max_len):
self.len = len(dataframe)
self.data = dataframe
self.tokenizer = tokenizer
self.max_len = max_len
def __getitem__(self, index):
# step 1: get the sentence and word labels
sentence = self.data.sentence[index].strip().split()
word_labels = self.data.word_labels[index].split(",")
# step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
# BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
encoding = self.tokenizer(sentence,
is_pretokenized=True,
return_offsets_mapping=True,
padding='max_length',
truncation=True,
max_length=self.max_len)
# step 3: create token labels only for first word pieces of each tokenized word
labels = [labels_to_ids[label] for label in word_labels]
# code based on https://huggingface.co/transformers/custom_datasets.html#tok-ner
# create an empty array of -100 of length max_length
encoded_labels = np.ones(len(encoding["offset_mapping"]), dtype=int) * -100
# set only labels whose first offset position is 0 and the second is not 0
i = 0
for idx, mapping in enumerate(encoding["offset_mapping"]):
if mapping[0] == 0 and mapping[1] != 0:
# overwrite label
encoded_labels[idx] = labels[i]
i += 1
# step 4: turn everything into PyTorch tensors
item = {key: torch.as_tensor(val) for key, val in encoding.items()}
item['labels'] = torch.as_tensor(encoded_labels)
return item
def __len__(self):
return self.len
For aligning tokens, there’s Code To Align Annotations With Huggingface Tokenizers. It has a repo: LightTag/sequence-labeling-with-transformers: Examples for aligning, padding and batching sequence labeling data (NER) for use with pre-trained transformer models
Also the official tutorial (Token classification) has a function to do something similar:
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples[f"ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
previous_word_idx = None
label_ids = []
for word_idx in word_ids: # Set the special tokens to -100.
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx: # Only label the first token of a given word.
label_ids.append(label[word_idx])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
git rebase -i SHA_of_commit_to_delete^
drops you into the usual screen, three you can change pick
to drop
in the first line (or any others) to just delete that commit.
Generally, On undoing, fixing, or removing commits in git seems like The README for that.
git branch -d some-branch
deletes a local branchgit push origin --delete some-branch
deletes a remote branch(as usual, remembering that branches are pointers to commits)
Changing the timeout delay for wrong logins on linux has a lot of details, in my case the TL;DR was:
/etc/pam.d/login
change the number, in microseconds;/etc/pam.d/common-auth
by adding nodelay
to: auth [success=1 default=ignore] pam_unix.so nullok_secure nodelay
The second one works also for everything inheriting that, which is a lot.
debugging - I have a hardware detection problem, what logs do I need to look into? - Ask Ubuntu:
Then, causing the problem to happen, and listing the system’s logs in reverse order of modification time:
ls -lrt /var/log
,tail -n 25
on recently modified log files (for reasonable values of 25), anddmesg
.Read, wonder, think, guess, test, repeat as needed
Causing the problem and then looking at the recently modified logs is common sense but brilliant.
And saving ls -lrt
as “list by modification time”.
-t
is “sort by modification time” and is easy to remember.
When debugging an issue I had with my monitor, found a mention of inxi
1, which seems to colorfully output basic system (incl. hardware) info.
The post asked for inxi -SMCGx
, inxi help told me inxi -F
is the fullest possible output.
Neat!
So, noisetorch says it’s potentially compromised: Release POTENTIAL COMPROMISE · noisetorch/NoiseTorch.
An improvement for the previous more dramatic formulation: Community code review? · noisetorch/NoiseTorch@b4bb8e6
This project is dead, i’ve failed you.
Thoughts and prayers (honestly! I loved it), with a heavy heart I keep looking.
Option1: werman/noise-suppression-for-voice: Noise suppression plugin based on Xiph’s RNNoise
Reading how to install it made me very sad, kept looking.
Saw EasyEffects mentioned, but it runs on Pipewire.
TIL Pipewire is a Pulseaudio replacement.
Installed via this guide: How to install PipeWire on Ubuntu Linux - Linux Tutorials - Learn Linux Configuration
Installed and ran EasyEffects using flatpak:
flatpak install easyeffects
flatpak run com.github.wwmm.easyeffects
EasyEffects’ GUI looks awesome!
Had to choose another input source in pavucontrol, then once the input is piped thorugh it - the effect “Noise Reduction” works! Removes both keyboard and random background white noise.
You can even save the config as preset and make it run automagically on startup!
TIL about git bisect.
git help bisect
for help.
TL;DR: uses binary search to find a commit that introduced a change. You run it, it gives you a commit, you tell it if it’s good or bad, and it keeps narrowing down the options.
git bisect start
-> git bisect good
-> git bisect bad
-> git bisect reset
HF Datasets’ README links this nice google colab that explain the basics: HuggingFace datasets library - Overview - Colaboratory
I use # TODO
s for “Do later”.
If they exist, Pycharm asks me every time before committing if I really want to.
I guess the idea is to use them to mark things to do before committing, so much smaller scale and here-and-now?
sanitize-filename · PyPI does what it says on the box.
It’s more complex than the replace--/ that I had in mind: sanitize_filename/sanitize_filename.py · master · jplusplus / sanitize-filename · GitLab
And intution tells me using external semi-unknown libraries like this might be a security risk.
TODO - what is the best practice for user-provided values that might become filenames?.. Something not smelling of either injection vulns or dependency vulns?