In the middle of the desert you can say anything you want
New phone, need to set up again sync and friends to my VPS - I’ll document it this time.
This is part of the success story of “almost completely de-Google my life” that’s one of the better changes I ever did.
Goal: separate commands running separate taskwarrior reports/filters. But also usable to add tasks etc.
Previously (Day 728 - serhii.net) I used things like this in my zshrc:
th () {task s project.not:w sprint.not:s "$*"}
Found a better way:
## TASKWARRIOR
# All todos from both work and home
TW_WORK="rc.default.project:w rc.default.command:s"
TW_HOME="rc.default.project: rc.default.command:th"
# "Important tasks"
TW_I="rc.default.command:i"
# Work
alias s="task $TW_WORK"
# Home
alias t="task $TW_HOME"
# All pending tasks from all projects
alias ta="task rc.default.command:next"
# "Important" tags - report `i`
alias ti="task $TW_I"
This means:
s
runs taskwarrior and the s
report, which shows work-only tasks; if I do s add whatever
the task gets added automatically inside project:w
.
For completeness, the code for each of these reports (~/.taskrc
):
############
# REPORTS
############
report.s.description='Work tasks'
report.s.columns=id,project,tags,due.relative,description
report.s.labels=ID,Project,T,D,Desc
#report.s.sort=due+
report.s.sort=project-/,urgency+
report.s.filter=status:pending -s
report.s.filter=status:pending ((project:w -s) or (+o or +a or +ACTIVE))
report.i.description='Important / priority'
report.i.columns=id,project,tags,due.relative,description
report.i.labels=ID,Project,T,D,Desc
report.i.sort=project-/,urgency+
report.i.filter=status:pending (+o or +a or +ACTIVE)
report.th.description='Home tasks'
report.th.columns=id,project,tags,due.relative,description
report.th.labels=ID,Project,T,D,Desc
report.th.sort=project-/,urgency+
report.th.filter=status:pending -s
# report.th.filter=status:pending ((project.not:w project.not:l -srv -sd) or (+o or +a or +w or +ACTIVE))
report.th.filter=status:pending ((project.not:w project.not:l -srv -sd) or (+o or +a or +ACTIVE))
#Someday
report.sd.columns=id,start.age,depends,est,project,tags,sprint,recur,scheduled.countdown,due.relative,until.remaining,description,urgency
report.sd.labels=D,Active,Deps,E,Project,Tag,S,Recur,S,Due,Until,Description,Urg
report.sd.filter=status:pending (sprint:s or +sd)
# srv -- for continuously needed tasks like starting to work etc
report.srv.description='srv'
report.srv.columns=id,project,tags,pri,est,description,urgency
report.srv.labels=ID,Project,T,P,E,Description,U
report.srv.sort=urgency-
report.srv.filter=status:pending +srv
# Currently active task - for scripts
report.a.description='Currently active task'
report.a.columns=id,description #,project
report.a.labels=ID,D #,P
report.a.filter=+ACTIVE
report.next.filter=status:pending -srv -sd
urgency.user.tag.o.coefficient=10
urgency.user.tag.a.coefficient=5
urgency.user.tag.w.coefficient=3
Problem: tokenizer adds trailing dots to the token in numbers, which I don’t want to. I also want it to split words separated by a dash. Also p.a.
at the end of the sentences always became p.a..
, the end-of-sentence period was glued to the token.
100,000,000.00
, What-ever
, p.a..
The default rules for various languages are fun to read:
German:
General for all languages: spaCy/char_classes.py at master · explosion/spaCy
nlp.tokenizer.explain()
shows the rules matched when doing tokenization.
Docu about customizing tokenizers and adding special rules: Linguistic Features · spaCy Usage Documentation
Solution:
# Period at the end of line/token
trailing_period = r"\.$"
new_suffixes = [trailing_period]
suffixes = list(pipeline.Defaults.suffixes) + new_suffixes
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
# Add infix dash between words
bindestrich_infix = r"(?<=[{a}])-(?=[{a}])".format(a=ALPHA)
infixes = list(pipeline.Defaults.infixes)
infixes.append(bindestrich_infix)
infix_regex = compile_infix_regex(infixes)
# Add special rule for "p.a." with trailing period
# Usually two traling periods become a suffix and single-token "p.a.."
special_case = [{'ORTH': "p.a."}, {'ORTH': "."}]
pipeline.tokenizer.add_special_case("p.a..", special_case)
pipeline.tokenizer.suffix_search = suffix_regex.search
pipeline.tokenizer.infix_finditer = infix_regex.finditer
The p.a..
was interesting - p.a.
was an explicit special case for German, but the two trailing dots got parsed as SUFFIX
for some reason (ty explain()
). Still no idea why, but given that special rules override suffixes I added a special rule specifically for that case, p.a..
with two periods at the end, it worked.
python3 -m pdb your_script.py
is usual
For modules it’s unsurprisingly intuitive:
python3 -m pdb -m your.module.name
For commands etc:
python3 -m pdb -c 'until 320' -m your.module.name
fnmatch — Unix filename pattern matching — Python 3.10.6 documentation:
Similar to Unix shell ones but without special handling of path bits, identical otherwise, and much simpler than regex:
*
matches everything?
matches any single character[seq]
matches any character in seq[!seq]
matches any character not in seqI have a list of names, I allow the user to select one or more by providing either a single string or a glob and returning what matches.
First it was two parameters and “if both are passed X takes precedence, but if it doesn’t have matches then fallback is used …”.
Realized that a simple string is a glob matching itself - and I can use the same field for both simplifying A LOT. The users who don’t know about globs can just do strings and everything’s fine. Still unsure if it’s a good idea, but nice to have as option.
Then - OK, what happens if his string is an invalid glob? Will this lead to a “invalid regex” type of exception?
Well - couldn’t find info about this, in the source code globs are converted to regexes and I see no exceptions raised, and couldn’t provoke any errors myself.
Globs with only mismatched brackets etc. always match themselves , but the best one:
>>> fnmatch.filter(['aa]ab','bb'],"aa]*a[bc]")
['aa]ab']
It ignores the mismatched bracket while correctly interpreting the matched ones!
So - I just have to care that a “name” doesn’t happen to be a correctly formulated glob, like [this one]
.
So - shelves! Just found out a really neat way to use them
“Unshelve silently” - never used it and never cared, just now - misclick and I did. It put the content of the shelf in a separate changelist named like the shelf, without changing my active changelist.
This is neat!
One of my main uses for both changelists and shelves are “I need to apply this patch locally but don’t want to commit that”, and this basically automates this behaviour.
In the Huggingface source found this bit:
class ExplicitEnum(str, Enum):
"""
Enum with more explicit error message for missing values.
"""
@classmethod
def _missing_(cls, value):
raise ValueError(
f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}"
)
… wow?
(Pdb++) IntervalStrategy('epoch')
<IntervalStrategy.EPOCH: 'epoch'>
(Pdb++) IntervalStrategy('whatever')
*** ValueError: whatever is not a valid IntervalStrategy, please select one of ['no', 'steps', 'epoch']
Was MyEnum('something')
allowed the whole time? God I feel stupid.
So sorted()
’s key=
argument can return a tuple, then the tuple values are interpreted as multiple sorting keys!
In Pycharm running config, there are options to watch individual log files which is nice.
But the main bit - all my logging issues etc. were the fault of Pycharm’s Settings for pytest that added automatically a -q
flag. Removed that checkmark and now I get standard pytest output that I can modify!
And now caplog
1 works:
def test_split_ds(caplog):
caplog.set_level(logging.DEBUG, logger="anhaltai_bbk.data.train_dev_splitter.splitter")
caplog.set_level(logging.DEBUG)
# ...
So, previously I thought about this here: 220214-1756 python run pdb on exception
Anyway, solution was on pytest level, installing this package was the only thing needed: pytest-pycharm · PyPI
Installed it at the same time as this pycharm plugin, might’ve been either of the two:
pytest imp - IntelliJ IDEA & PyCharm Plugin | Marketplace / theY4Kman/pycharm-pytest-imp: PyCharm pytest improvements plugin
Anyway now life’s good:
Thinking out loud and lab notebook style to help me solve a problem, in this installment - creating representative train/test splits.
Goal: create a test set that looks like the train set, having about the same distribution of labels.
In my case - classic NER, my training instances are documents whose tokens can be a number of different labels, non-overlapping, and I need to create a test split that’s similar to the train one. Again, splitting happens per-document.
Added complexity - in no case I want tags of a type ending up only in train or only in test. Say, I have 100 docs and 2 ORGANIZATIONs inside them - my 20% test split should have at least one ORGANIZATION.
Which is why random selection doesn’t cut it - I’d end up doing Bogosort more often than not, because I have A LOT of such types.
Simply ignoring them and adding them manually might be a way. Or intuitively - starting with them first as they are the hardest and most likely to fail
My training instance is a document that can have say 1 PEOPLE, 3 ORGANIZATIONS, 0 PLACES.
For each dataset/split/document, I have a dictionary counting how many instances of each entity does it have, then changed it to a ratio “out of the total number of labels”.
{
"O": 0.75,
"B-ORGANIZATION": 0.125,
"I-ORGANIZATION": 0,
"B-NAME": 0,
"I-NAME": 0,
}
I need to create a test dataset with the distribution of these labels as close as the train dataset. In both, say, 3 out of 4 labels should be "O"
.
So - “which documents do I pick so that when their labels are summed up I get a specific distribution”, or close to it. So “pick the numbers from this list that sum up close to X”, except multidimensional.
Initial algo was “iterate by each training instance and put it in the pile it’ll improve the most”.
Started implementing something to do this in
HuggingFace Datasets
, and quickly realized that “add his one training instance to this HF Dataset
” is not trivial to do, and iterating through examples and adding them to separate datasets is harder than expected.
Generally we’re in the area of concepts like Subset sum problem / Optimization problem / Combinatorial optimization
More usefully, specifically RE datasets, How to Create a Representative Test Set | by Dimitris Poulopoulos | Towards Data Science mentioned sklearn.model_selection.StratifiedKFold.
Which led me to sklearn’s “model selection” functions that have a lot of functions doing what I need! Or almost
API Reference — scikit-learn 1.1.2 documentation
And the User Guide specifically deals with them: 3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.1.2 documentation
Anyway - StratifiedKFold as implemented is “one training instance has one label”, which doesn’t work in my case.
My training instance is a document that has 1 PEOPLE, 3 ORGANIZATIONS, 0 PLACES.
Dataset Splitting Best Practices in Python - KDnuggets
Main problem: I have multiple labels/y
s to optimize for and can’t directly use anything that splits based on a single Y.
Can I hack something like sklearn.model_selection.StratifiedGroupKFold for this?
Can I read about how they do it and see if I can generalize it? (Open source FTW!) scikit-learn/_split.py at 17df37aee774720212c27dbc34e6f1feef0e2482 · scikit-learn/scikit-learn
Can I look at the functions they use to hack something together?
… why can’t I use the initial apporach of adding and then measuring?
Where can I do this in the pipeline? In the beginning on document level, or maybe I can drop the requirement of doing it per-document and do it at the very end on split tokenized training instances? Which is easier?
Can I do a random sample and then add what’s missing?
Will going back to numbers and “in this train set I need 2 ORGANIZATIONS” help me reason about it differently than the current “20% of labels should be ORGANIZATION”?
scikit-learn/_split.py at 17df37aee774720212c27dbc34e6f1feef0e2482 · scikit-learn/scikit-learn
They sort the labels and that way get +/- the number of items needed. Neat but quite hard for me to adapt to my use case.
Can I think of this as something like a sort with multiple keys?..
Can I use the rarity of a type as something like a class weight? Ha, that might work. Assign weights in such a way that each type is 100 and
This feels relevant. Stratified sampling - Wikipedia
Can I chunk them in small pieces and accumulate them based on the pieces, might be faster than by using examples?
THIS looked like something REALLY close to what I need, multiple category names for each example, but ended up being the usual stratified option I think:
This suggests to multiply the criteria and get a lot of bins - not what I need but I keep moving
Can I stratify by multiple characteristics at once?
I think “stratification of multilabel data” is close to what I need
Found some papers, yes this is the correct term I think
YES! scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python
scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python
scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python:
In multi-label classification one can assign more than one label/class out of the available n_labels to a given object.
This is really interesting, still not EXACTLY what I need but a whole new avenue of stuff to look at
scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python
The idea behind this stratification method is to assign label combinations to folds based on how much a given combination is desired by a given fold, as more and more assignments are made, some folds are filled and positive evidence is directed into other folds, in the end negative evidence is distributed based on a folds desirability of size.
Yep back to the first method!
They link this lecture explaining the algo: On the Stratification of Multi-Label Data - VideoLectures.NET
Less the video than the slides, didn’t watch the video and hope I won’t have to - the slides make it clear enough.
Yes, reframing that as “number of instances of this class that are still needed by this fold” was a better option. And here binary matrices nicely expand to weighted stratification if I have multiple examples of a class in a document. And my initial intuition of starting with the least-represented class first was correct
Basic algorithm:
Not sure if I can use the source of the implementation: scikit-multilearn: Multi-Label Classification in Python — Multi-Label Classification for Python
I don’t have a good intuition of what they mean by “order”, for now “try to keep labels that hang out together in the same fold”? Can I hack it to
I still have the issue I tried to avoid with needing to add examples to a fold/Dataset
, but that’s not the problem here.
Generally - is this better than my initial approach?
What happens if I don’t modify my initial approach, just the order in which I give it the training examples?
Can I find any other source code for these things? Ones easier to adapt?
I’ll implement the algo myself based on the presentation and video according to my understanding.
The main result of this session was finding more related terminology and a good explanation of the algo I’ll be implementing, with my changes.
I’m surprised I haven’t found anything NER-specific about creating representative test sets based on the distribution of multiple labels in the test instances. Might become a blog post or something sometime.jj