serhii.net

In the middle of the desert you can say anything you want

16 Aug 2022

Spacy custom tokenizer rules

Problem: tokenizer adds trailing dots to the token in numbers, which I don’t want to. I also want it to split words separated by a dash. Also p.a. at the end of the sentences always became p.a.., the end-of-sentence period was glued to the token.

100,000,000.00, What-ever, p.a..

The default rules for various languages are fun to read:

German:

General for all languages: spaCy/char_classes.py at master · explosion/spaCy

nlp.tokenizer.explain() shows the rules matched when doing tokenization.

Docu about customizing tokenizers and adding special rules: Linguistic Features · spaCy Usage Documentation

Solution:

# Period at the end of line/token
trailing_period = r"\.$"
new_suffixes = [trailing_period]
suffixes = list(pipeline.Defaults.suffixes) + new_suffixes
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
                                                                      
# Add infix dash between words
bindestrich_infix = r"(?<=[{a}])-(?=[{a}])".format(a=ALPHA)
infixes = list(pipeline.Defaults.infixes)
infixes.append(bindestrich_infix)
infix_regex = compile_infix_regex(infixes)
                                                                      
# Add special rule for "p.a." with trailing period
# Usually two traling periods become a suffix and single-token "p.a.."
special_case = [{'ORTH': "p.a."}, {'ORTH': "."}]
pipeline.tokenizer.add_special_case("p.a..", special_case)
                                                                      
pipeline.tokenizer.suffix_search = suffix_regex.search
pipeline.tokenizer.infix_finditer = infix_regex.finditer

The p.a.. was interesting - p.a. was an explicit special case for German, but the two trailing dots got parsed as SUFFIX for some reason (ty explain()). Still no idea why, but given that special rules override suffixes I added a special rule specifically for that case, p.a.. with two periods at the end, it worked.

Nel mezzo del deserto posso dire tutto quello che voglio.