In the middle of the desert you can say anything you want

10 Oct 2023

Adventures in UTF8

Had PDF files, extracted text with Pymupdf, in some of the output txts I had weird strings:

# sometimes with real chars mixed in
# sometimes - often - not
������ ������

Tried to understand what the “” actually are, guess the encoding etc. Encoding was always utf8, according to python chardet and debian uchardet.

Remembered and tried CyberChef, it returned it all as identical repeating code points.

hexdump showed me that they actually ARE repeating code points!

Remembered vim can do this - it can1 - vim’s g8 binding for the same, as well as :as to show info about the char under the cursor, confirmed it - it’s all one character, specifically (:as) ef bf bd.

I googled that string, found2 that it’s Unicode Character ‘REPLACEMENT CHARACTER’ (U+FFFD).

Basically it’s when input is not valid UTF8, and we replace the character with that symbol. The original characters are lost.

Python’s unicodedata has that returns directly 'REPLACEMENT CHARACTER'.

This explains why all the character detection bits said utf-8 - it was utf-8 characters, the exact same one in fact, haha.

Nel mezzo del deserto posso dire tutto quello che voglio.