Spacy custom tokenizer rules
Problem: tokenizer adds trailing dots to the token in numbers, which I don’t want to. I also want it to split words separated by a dash. Also p.a.
at the end of the sentences always became p.a..
, the end-of-sentence period was glued to the token.
100,000,000.00
, What-ever
, p.a..
The default rules for various languages are fun to read:
German:
- spaCy/punctuation.py at master · explosion/spaCy
- spaCy/tokenizer_exceptions.py at master · explosion/spaCy
General for all languages: spaCy/char_classes.py at master · explosion/spaCy
nlp.tokenizer.explain()
shows the rules matched when doing tokenization.
Docu about customizing tokenizers and adding special rules: Linguistic Features · spaCy Usage Documentation
Solution:
# Period at the end of line/token
trailing_period = r"\.$"
new_suffixes = [trailing_period]
suffixes = list(pipeline.Defaults.suffixes) + new_suffixes
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
# Add infix dash between words
bindestrich_infix = r"(?<=[{a}])-(?=[{a}])".format(a=ALPHA)
infixes = list(pipeline.Defaults.infixes)
infixes.append(bindestrich_infix)
infix_regex = compile_infix_regex(infixes)
# Add special rule for "p.a." with trailing period
# Usually two traling periods become a suffix and single-token "p.a.."
special_case = [{'ORTH': "p.a."}, {'ORTH': "."}]
pipeline.tokenizer.add_special_case("p.a..", special_case)
pipeline.tokenizer.suffix_search = suffix_regex.search
pipeline.tokenizer.infix_finditer = infix_regex.finditer
The p.a..
was interesting - p.a.
was an explicit special case for German, but the two trailing dots got parsed as SUFFIX
for some reason (ty explain()
). Still no idea why, but given that special rules override suffixes I added a special rule specifically for that case, p.a..
with two periods at the end, it worked.