In the middle of the desert you can say anything you want
Had PDF files, extracted text with Pymupdf, in some of the output txts I had weird strings:
# sometimes with real chars mixed in
�������#&'���()��"#��*����������%�
# sometimes - often - not
������ ������
Tried to understand what the “�
” actually are, guess the encoding etc. Encoding was always utf8, according to python chardet
and debian uchardet
.
Remembered and tried CyberChef, it returned it all as identical repeating code points.
hexdump
showed me that they actually ARE repeating code points!
Remembered vim can do this - it can1 - vim
’s g8
binding for the same, as well as :as
to show info about the char under the cursor, confirmed it - it’s all one character, specifically (:as
) ef bf bd
.
I googled that string, found2 that it’s Unicode Character ‘REPLACEMENT CHARACTER’ (U+FFFD).
Basically it’s when input is not valid UTF8, and we replace the character with that symbol. The original characters are lost.
Python’s unicodedata
has unicodedata.name()
that returns directly 'REPLACEMENT CHARACTER'
.
This explains why all the character detection bits said utf-8 - it was utf-8 characters, the exact same one in fact, haha.
python - Split / Explode a column of dictionaries into separate columns with pandas - Stack Overflow taught be about pandas.json_normalize — pandas 2.1.1 documentation:
In: json-like (dict, list, ..) Out: pandas dataframe!
eval-UA-tion
/ eval_UA_tion
-
if they end up in the name
eval-UA-tion
since it works both with colors and as plain monotype text!Some drafts I did in inkscape:
And just for fun:
ChatGPT generated this:
It’s internal prompt for the picture, based on inspect element, was
alt="Logo design for 'eval-UA-tion', a benchmark for Ukrainian language models. Incorporate the word 'eval-UA-tion' in a stylish font, with a sunflower replacing the letter 'o'. Add elements that give a Ukrainian touch, such as traditional Ukrainian patterns or colors (blue and yellow). The design should be modern, clear, and professional, suitable for a technical and academic setting."
# 2 after comma
pd.set_option("display.precision", 2)
# Suppress scientific notation
pd.options.display.float_format = "{:.0f}".format
# for more natural 100,233.23-like output
pd.options.display.float_format = "{:,.3f}".format
Setting as a context1:
with pd.option_context('display.float_format', lambda x: f'{x:,.3f}'):
display(df.describe())
Also: I can format a float column (’temporarily’) not just how I always did, but also in a way simpler way2:
# before
ds["percent"].apply(lambda x: f"{x:.2%}")
# after
ds["percent"].apply("{:.2%}".format)
I forgot you can do "string".format(variable)
!
Also TIL display()
for jupyter-notebooks when it’s not the return value (e.g. if you’re exiting a context, df.describe()
alone there would not have shown the description)
One way to do it, if it’s all for all:
df.groupby("collection")[
["num_pages", "num_chars", "num_tokens", "num_sentences"]
].agg(
[
# "count",
"sum",
"mean",
# "std",
]
)
An even better way:
# ...
].agg(
num_documents=("num_pages", "count"),
num_pages=("num_pages", "sum"),
mean_pages=("num_pages", "mean"),
mean_tokens=("num_tokens", "mean"),
)
They are literally named tuples! Yay for Named Aggregation1!
Draft.
Context: 230529-2208 Seaborn matplotlib labeling data points
Given: need to make the limits larger to fit text, the last lines here:
data = df_pages.reset_index().sort_values('num_pages')
ax = sns.barplot(data,y="collection",x="num_pages")
# label points
for i in ax.axes.containers:
ax.bar_label(
i,
)
# make the labels fit the limits
xlim = ax.axes.get_xlim()[1]
new_xlim = xlim + 14600
ax.axes.set_xlim(0, new_xlim)
Question: by how much?
Answer:
for i in ax.axes.containers:
an = ax.bar_label(
i,
)
# `an` is a list of all Annotations
an[0].get_window_extent()
>>> Bbox(88.66956472198585, 388.99999999999994], [123.66956472198585, 402.99999999999994)
def get_text_size(anno): # Annotation
""" TODO: get array of annos, find the leftmost one etc."""
bbox = anno.get_window_extent()
ext = bbox.bounds
# > (91.43835300441604, 336.19999999999993, 35.0, 14.0)
x=ext[2]
y=ext[3]
return x,y
"""
ano = an[1]
bbox = ano.get_window_extent()
bbox.bounds
> (91.43835300441604, 336.19999999999993, 35.0, 14.0)
"""
get_text_size(an[6])
Gitlab introduced tasks, and they get shown by default in the issue list. Type != task
in the search leaves only the issues.
Can one save search templates?..
Is this needed or I can just use one of the existing ones? I’ll use one of the existing ones!
Then this is about notes about choosing one and adapting my own tasks for it.
First of all, I’d like the generator things to be runnable through Docker, especially the pravda crawler!
Related:
General:
Other / useful libraries:
exact match: true/false
multiple choice
lmentry/lmentry/predict.py at main · aviaefrat/lmentry contains the predicting code used to evaluate it using different kinds of models - I’ll need this.
SWAG seems the closest out of the modern models to UA-CBT — one-word completions etc. I should look into what exactly they do
NarrativeQA!
The online version1 has cool tests at the end!
Generally: a lot of it is about languages/power, indigenous languages etc. Might be interesting for me wrt. UA/RU and colonialism
https://peps.python.org/pep-0673/
from typing import Self
class Shape:
def set_scale(self, scale: float) -> Self:
self.scale = scale
return self
Related: 220726-1638 Python typing classmethods return type
I remember writing about the typevar approach but cannot find it…