Adventures in UTF8
Had PDF files, extracted text with Pymupdf, in some of the output txts I had weird strings:
# sometimes with real chars mixed in �������#&'���()��"#��*����������%� # sometimes - often - not ������ ������
Tried to understand what the “
�” actually are, guess the encoding etc. Encoding was always utf8, according to python
chardet and debian
Remembered and tried CyberChef, it returned it all as identical repeating code points.
hexdump showed me that they actually ARE repeating code points!
Remembered vim can do this - it can1 -
g8 binding for the same, as well as
:as to show info about the char under the cursor, confirmed it - it’s all one character, specifically (
ef bf bd.
I googled that string, found2 that it’s Unicode Character ‘REPLACEMENT CHARACTER’ (U+FFFD).
Basically it’s when input is not valid UTF8, and we replace the character with that symbol. The original characters are lost.
unicodedata.name() that returns directly
This explains why all the character detection bits said utf-8 - it was utf-8 characters, the exact same one in fact, haha.