In the middle of the desert you can say anything you want
One way to do it, if it’s all for all:
df.groupby("collection")[
["num_pages", "num_chars", "num_tokens", "num_sentences"]
].agg(
[
# "count",
"sum",
"mean",
# "std",
]
)
An even better way:
# ...
].agg(
num_documents=("num_pages", "count"),
num_pages=("num_pages", "sum"),
mean_pages=("num_pages", "mean"),
mean_tokens=("num_tokens", "mean"),
)
They are literally named tuples! Yay for Named Aggregation1!
Draft.
Context: 230529-2208 Seaborn matplotlib labeling data points
Given: need to make the limits larger to fit text, the last lines here:
data = df_pages.reset_index().sort_values('num_pages')
ax = sns.barplot(data,y="collection",x="num_pages")
# label points
for i in ax.axes.containers:
ax.bar_label(
i,
)
# make the labels fit the limits
xlim = ax.axes.get_xlim()[1]
new_xlim = xlim + 14600
ax.axes.set_xlim(0, new_xlim)
Question: by how much?
Answer:
for i in ax.axes.containers:
an = ax.bar_label(
i,
)
# `an` is a list of all Annotations
an[0].get_window_extent()
>>> Bbox([[88.66956472198585, 388.99999999999994], [123.66956472198585, 402.99999999999994]])
def get_text_size(anno): # Annotation
""" TODO: get array of annos, find the leftmost one etc."""
bbox = anno.get_window_extent()
ext = bbox.bounds
# > (91.43835300441604, 336.19999999999993, 35.0, 14.0)
x=ext[2]
y=ext[3]
return x,y
"""
ano = an[1]
bbox = ano.get_window_extent()
bbox.bounds
> (91.43835300441604, 336.19999999999993, 35.0, 14.0)
"""
get_text_size(an[6])
Gitlab introduced tasks, and they get shown by default in the issue list. Type != task
in the search leaves only the issues.
Can one save search templates?..
Is this needed or I can just use one of the existing ones? I’ll use one of the existing ones!
Then this is about notes about choosing one and adapting my own tasks for it.
First of all, I’d like the generator things to be runnable through Docker, especially the pravda crawler!
Related:
General:
Other / useful libraries:
exact match: true/false
multiple choice
lmentry/lmentry/predict.py at main · aviaefrat/lmentry contains the predicting code used to evaluate it using different kinds of models - I’ll need this.
SWAG seems the closest out of the modern models to UA-CBT — one-word completions etc. I should look into what exactly they do
NarrativeQA!
Also: 231002-2311 Meta about writing a Masterarbeit
Relevant papers in Zotero will have a ’toread’ tag.
When can we trust model evaluations? — LessWrong
How truthful is GPT-3? A benchmark for language models — LessWrong
Code:
lists: AI Evaluations - LessWrong
Datasets - The Best Ukrainian Language Datasets of 2022 | Twine some aren’t ones I addedj
Victoria Amelina: Ukraine and the meaning of home | Ukraine | The Guardian
Ukrainian and Russian: Two Separate Languages and Peoples – Ukrainian Institute of America
Bender and friensd:
Eval
“Питон для продвинутой группы лингвистов, 2020-2021” (lecture): klyshinsky/AdvancedPyhon_2020_21
I should read through everything here: A quick tour
<_(@inclusion) “The state and fate of linguistic diversity and inclusion in the NLP world” (2020) / Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit Choudhury: z / https://arxiv.org/abs/2004.09095 / _> ↩︎
Literature Review For Academic Outsiders: What, How, and Why — LessWrong
‘Literature review’ the process is a way to become familiar with what work has already been done in a particular field or subject by searching for and studying previous work
Every time I do research I perform a simple thought experiment: assuming somewhere in the world exists evidence that would prove or disprove my hypothesis, where is it?
Citations are a hierarchy of ideas
My old note about tenses in a bachelor thesis: Day 155 - serhii.net linking to the excellent Effective Writing | Learn Science at Scitable
Leipzig Glossing rules seems to be the key for me:
Markdown and python and stuff
Markdown
<span style="font-variant:small-caps;">Hello World</span>
1Python
The online version1 has cool tests at the end!
Generally: a lot of it is about languages/power, indigenous languages etc. Might be interesting for me wrt. UA/RU and colonialism
https://peps.python.org/pep-0673/
from typing import Self
class Shape:
def set_scale(self, scale: float) -> Self:
self.scale = scale
return self
Related: 220726-1638 Python typing classmethods return type
I remember writing about the typevar approach but cannot find it…
pravda.com.ua1 має статті трьома мовами:
The difference seems to be only in that one part of the URL!
Article; title; tags; date,author.
Then article title+classification might be one of the benchmark tasks!
Is there anything stopping me from scraping the hell out of all of it?
Google finds 50k articles in /eng/
, 483k in /rus/
, assumption: all english articles were translated to Russian as well.
=> For each english article, try to get the Russian and Ukrainian one from the URI.
©2000-2023, Українська правда. Використання матеріалів сайту лише за умови посилання (для інтернет-видань - гіперпосилання) на “Українську правду” не нижче третього абзацу.
Related: ua-datasets/ua_datasets/src/text_classification at main · fido-ai/ua-datasets Related: facebook/flores · Datasets at Hugging Face frow wikinews in infinite languages including UA!
eg could other langs help for that?2
Same goes for Економічна правда and friends. ↩︎
(172) Detailed walkthrough of procedure to uncensor models : LocalLLaMA.g. ↩︎
Officially - I’m doing this!
This post will be about dumping ideas and stuff.
Related posts for my first paper on this topic:
Procedural:
Github
#nlp #benchmark
s Repository search resultsCool model with links to datasets etc.! robinhad/kruk: Ukrainian instruction-tuned language models and datasets
Datasets UA, almost exclusively
Benchmarks UA
ua_datasets is a collection of Ukrainian language datasets. Our aim is to build a benchmark for research related to natural language processing in Ukrainian.
UA grammar/resources/…
> curl -F json=false -F data='привіт мене звати Сірьожа' -F tokenizer= -F tagger= -F parser= https://api.mova.institute/udpipe/process
General evaluation bits:
Random UA:
@ruisLargeLanguageModels2022
(2022) z/d/>Reproducing ARC Evals’ recent report on language model agents — LessWrong ↩︎
<@labaContextualEmbeddingsUkrainian2023
Contextual Embeddings for Ukrainian (2023) z/d/> / Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology ↩︎