Adventures in UTF8
Had PDF files, extracted text with Pymupdf, in some of the output txts I had weird strings:
# sometimes with real chars mixed in
�������#&'���()��"#��*����������%�
# sometimes - often - not
������ ������
Tried to understand what the “�
” actually are, guess the encoding etc. Encoding was always utf8, according to python chardet
and debian uchardet
.
Remembered and tried CyberChef, it returned it all as identical repeating code points.
hexdump
showed me that they actually ARE repeating code points!
Remembered vim can do this - it can1 - vim
’s g8
binding for the same, as well as :as
to show info about the char under the cursor, confirmed it - it’s all one character, specifically (:as
) ef bf bd
.
I googled that string, found2 that it’s Unicode Character ‘REPLACEMENT CHARACTER’ (U+FFFD).
Basically it’s when input is not valid UTF8, and we replace the character with that symbol. The original characters are lost.
Python’s unicodedata
has unicodedata.name()
that returns directly 'REPLACEMENT CHARACTER'
.
This explains why all the character detection bits said utf-8 - it was utf-8 characters, the exact same one in fact, haha.