In the middle of the desert you can say anything you want
This will be the Markdown draft of my Master thesis, I’ll jot things down and then expand.
List without newline:
And:
quote without newline
Нації вмирають не від інфаркту. Спочатку їм відбирає мову.
Ліна Костенко
Nations don’t die from heart attacks. They go mute first.1
(Lina Kostenko, Ukrainian poetess)
evals are surprisingly often all you need
(Greg Brockman, OpenAI President)2
The Ukrainian language is not at risk of dying, and as of 2023, this much is certain. But before 2014, the quote above was so incisive it hurt.
The last 10 years have led to a resurgence of Ukrainian language, especially its use in informal and non-academic contexts. This was followed by an increase of resources dedicated to its study and use.
On a 2020 survey3 on linguistic diversity in NLP, the Ukrainian language was classed under “rising stars”: languages with a thriving community online but let down by insufficent labeled data.
This Thesis introduces the first Ukrainian-language LM benchmark, and as part of it introduces a number of novel labeled datasets.
L’Ukraine a toujours aspiré à être libre
“Ukraine has always aspired to be free.” Voltaire, 1731 4
A significant number of people in Ukraine are bilingual (Ukrainian and Russian languages), and most Ukrainians can understand both Russian and Ukrainian 5.
The reasons for this include Ukraine’s geographical and cultural proximity to Russia, as well as of consistent policy first of the Russian empire and the Soviet Union.
This section sketches the history of the language, describes the bilingual nature of Ukraine’s society and the impact of historical state policies on its modern development.
(TODO mention how and which tasks are impacted by this; sources for ‘many people believe’; todo tie it with Ukrainians realizing stuff)
The Ukrainian language belongs to the Slavic family of the Indo-European languages (which also contains languages such as Polish, Czech, Serbian, Bulgarian), specifically to the East Slavic branch, which contains Belarusian, Russian, and Ukrainian8. Towards the end of the X century the East Slavonic group of diealects was relatively uniform, with the differences separating Ukrainian, Russian and Belarusian appearing since then, as the result of linguistic and political processes. 9
While all three are mutually intelligible to a certain extent, Ukrainian has more in common with Belarusian than with Russian 9; outside the branch, Ukrainian has partial intelligibility with Polish10.
This stems from the fact that in the 15th century, parts of what is now Ukraine and Belarus were part of the Polish-Lithuanian commonwealth, with Polish becoming the lingua franca of Ukrainian-Belarusian lands.
As a result, a large proportion of the Ukrainian lexicon consists of borrowings from the Polish language, and vocabulary remains the component of the language where the difference with Russian is most immediately noticeable. 9
In the Russian Empire, the broader imperial ideology sought to assimilate various ethnicities into a single Russian identity (with Russian as dominant language), and policies aimed at diminshing Ukrainian national self-consciousness were a facet of that.11
Ukrainian (then officially called little Russian 9 and officially a dialect) was12 stigmatized as a strange dialect of Russian, with its literature not taken seriously; the general attitude being that Ukrainians needed to be “civilized” by Russia, by its language and developed culture.11
Attempts to extinguish a separate Ukrainian identity weren’t limited by stigmatization — the history of Ukrainian language bans is long enough to merit a separate Wikipedia page with the list, 13 with the more notable ones in the Russian Empire being the 1863 Valuev Circular (forbidding the use of Ukrainian in religious and educational printed literature)1415 and the Ems Ukaz, a decree by Emperor Alexander II banning the use of the Ukrainian language in print (except for reprinting old documents), forbidding the import of Ukrainian publications and the staging of plays or lectures in Ukrainian (1876)16.
The first decade of Soviet Union brought Ukrainisation as part of a new Soviet nationalities policy, leading to a short-lived period of flourishing for Ukrainian literature and culture in general.17
Many of the Ukrainian writers and intellectuals of that period became later known as “the executed Renaissance”18: most19 of them were purged in the years to follow7, after the Soviet Union took a sharp turn towards Russification in the late 1920s and in the multiple waves of purges afterwards.
Those purged included many of the members of the committee that in 1928 created the first unified Ukrainian spelling rules.20
A new ‘orthographic’ reform was drafted in 1933, without public discussion this time 17. It had the stated goal of removing alleged “burgeoise nationalist” and “pro-Polish” influences in the previous one, especially by the withdrawal of “artificial barriers” between the Ukrainian and Russian languages20. In practice, bringing the Ukrainian language closer to Russian in many ways, from banning the (absent in Russian) letter ґ to introducing changes to grammatical forms 20, adding near absolute reliance on Russian when spelling loanwords and changing the gender of many of them to match Russian, and by making an effort to reduce Ukrainian-specific vocabulary17, especially scientific terminology.
The role of Russian in Soviet society was openly declared to be not just the language of all Soviet peoples, but also the source language for the enrichment of the other languages in the Soviet Union.9
Towards the end of the Soviet Era, “it is possible to speak of diglossia in Ukraine, with Russian as the High variety used in formal, administrative, and educational domains, and Ukrainian is less formal, home settings.” 8
After the fall of the Soviet Union, there were many proposals for restoring the original orthography, but only the letter ґ was restored. In 2019 a new version of the Ukrainian orthography was approved, which restored some of the original rules as ’legal’ variants but without mandating any of them.
Around 2012, I stumbled upon a forum thread with the topic “I’m moving to Ukraine, which language should I learn, Ukrainian or Russian?”. One answer was “It doesn’t really matter, and if someone will care too much about which language you speak, they are not the people you want to speak to anyway” — not an uncommon sentiment at the time.
For most Ukrainians, the language spoken was/is just not part of one’s self-identification as Ukrainian. Among those surveyed across Ukraine in 2012-2017, only 2.7-4.9% considered the language spoken what determines their nationality (among those who considered themselves Ukrainian it was 1.8-2.5%, Russian — 8.8-15.9%) 5.
It is typical to speak e.g. Russian at school and Ukrainian at home 21, or different languages with different family members (for example, my entire life I spoke Ukrainian with my father and Russian with my mother).
Conversations where different people use Russian or Ukrainian (without any effort awkwardness or negative effects) were (and are) normal as well. This is illustrated by a 2017 survey22 of 2,007 respondents across Ukraine. It found that in the presence of a Ukrainian speaker, 17% of people will speak Russian and ~18% both Russian and Ukrainian (in the other case, ~29% will speak Ukrainian and ~23% both Russian and Ukrainian).
Just as typical is code-switching — changing the language or dialect spoken within the same conversation, sometimes within the same sentence 23. The Parliamentary Code-Switching Corpus paper23 shows examples of this happening for different reasons, such as: inserting quotes/idioms in Russian, using Ukrainian legalese/cliches or law names, switching the language for stylistic purposes (e.g. distinguishing between the official Ukrainian position and a personal one), triggered code-switching (switching the language after using a word or name in the other language), inserting individual words in the other language or just heavily mixing both without clear motivation.
The latter is related to Surzhyk, mixed Russian-Ukrainian speech (variously defined as “a hybrid language that involves Russian and Ukrainian in its creation”24 or “a pejorative collective label for non-standard language varieties”25)[^45], widely spoken (and more rarely written) across Ukraine, especially its eastern, southern and central parts24.
The Russian attack on Crimea in 2014 for many led to stronger attachment to Ukraine and alienation from Russia, with surveys between 2012 and 2017 showing “a consistent and substantial shift”21 from Russian linguistic and ethnic identification towards Ukrainian5, and the full-scale invasion of 2022 accellerated this process, as seen in Rating Group’s March 2022 “Language Issue in Ukraine” survey26.
This was also quantified by an analysis 21 of Ukrainian Twitter data between 13th January 2020 and 10th October 2022, reporting behavioural language changes across Russian-Ukrainian-English while controlling for user turnover (users joining or leaving Twitter).
The plot (adapted from Figure 4 of 21) in Figure XXX shows an increase of the use of Ukrainian over Russian (purple) starting before the full-scale invasion and sharply increasing afterwards.
Notably, of the 1,363 users tweeting predominantly (>80%) in Russian before the outbreak of the war, 61% tweeted in Ukrainian more after the outbreak, and ~25% (341) started tweeting predominantly (>80%) in Ukrainian (hard-switch from Russian to Ukrainian). There were only 3% hard-switches from UA to RU in that period.
Ukrainian Twitter users are not a representative sample of the Ukrainian population for several reasons, but the study is likely indicative of wider societal trends.
The authors interpret the switch as users’ conscious choice towards a more Ukrainian identity.27
TODO fit the below somewhere:
With more people switching to Ukrainian partially or full-time, for different reasons, the importance of Ukrainian NLP grows correspondingly.
In the taxomy of languages based on data availability 3 (see below), Ukrainian is classified in class 3, “the rising stars”: languages with a thriving online cultural community that got an energy boost from unsupervised pre-training, but let down by insufficient efforts in labeled data collection. Sample languages from that group include Indonesian, Cebuano, Afrikaans, Hebrew. (Russian is in class 4, English and German are in class 5.)
3 as quoted in Why You Should Do NLP Beyond English
From a different angle, looking at estimates of languages used on the Internet (as estimated percentages of the top 10M websites), as of October 2023 Ukrainian is at number 19 (0.6%), between Arabic and Greek2829. English is #1 (53.0%), Russian #3 (4.6%), German at #4 (4.6% as well).
Ukrainian Wikipedia is 15th by daily views and by number of articles30.
Emily M. Bender in 201131 formulated what would come to be known as the Bender rule32: “Name the languages we study”.
Her original 2011 paper — written in the pre-LLM era — discusses the problem of language independence, that is the extent to which NLP research/technology can scale over multiple (or ‘all’) languages. In her more recent writing on the topic, she notes how work on languages other than English is often considered “language specific” and thus viewed as less important 32, and the underlying misconception that English is a sufficiently representative language and therefore work on English is not language specific.
A NLP system that works for English is not guaranteed to behave similarly for other languages, unless explicitly designed and tested for that. Or in different words, “English is Neither Synonymous with Nor Representative of Natural Language”. 32
She highlights 8 proprieties of English that highlight it’s shortcomings in representing all languages, of them 4 apply to Ukrainian: little inflectional morphology, fixed word order, possible matches to database field names or ontology entries, and massive amounts of training data available.
In the context of this thesis, an interesting facet of this issue was my intuitive assumption that Python’s sort()
would sort the letters in their alphabetical order — which is what it does in English — which, for Ukrainian, it didn’t. In hindsight absolutely unsurprising, but I find it absolutely fascinating that for many English-only-speakers many things just work, like python’s sort()
doing the intuitively correct thing, and this is taken for granted (along with the assumption that it works for other languages just as well, and that results and approaches generalize).
Having for the first time sorted Ukrainian letters in Python I realize how all-encompassing such world models can be.
(For details about the sorting issue, see subsection XXX about the LMentry-static-UA task.)
(TODO what do I want to say here exactly?)
This master thesis tackles the following problems in the context of Ukrainian language:
Additional research questions are:
Inclusion to other big benchmarks
Implementations for important eval harnesses
Throughout this section, a notation system loosely based on the Leipzig Glossing Rules34 (LGR) for interlinear glossing will be used in examples showcasing Ukrainian language phenomena and translations to English and occasionally German.
Interlinear glosses will not be interlinear, but each gloss will be a superscript to the word it refers to.
For each word, it will be formatted thus:
Not all words of the example will be annotated, only the ones relevant to the example being made. Words already in English will not be translated.
Each translation will be provided on a separate line, with the language marked as ISO 639-3 code: eng
for English, ukr
for Ukrainian, deu
for German, rus
for Russian.
For example:
eng: the manNOM.SG sawPST the dogNOM.SG
ukr: чоловікman-NOM.SG побачивsaw-PST.MASC.SG собакydog-ACC.SG
In the cases where glosses on morpheme level are needed, the (relevant) segmentable morphemes in the word will be separated by hyphens, and each will have its gloss in its superscript35. The absence of a morpheme needing a corresponding gloss will be marked as $\varnothing$ (LGR Rule 6).
ukr: 5 собакdog-$\varnothing$GEN.PL
Ungrammaticality (examples of grammatically incorrect language) will be denoted by a single asterisk (*) preceding the sentence or the specific word:
ukr: мій *друзь
These abbreviations are used inside glosses. They are mostly conventional LGR abbreviations36 but contain non-LGR ones as well, given as a separate list.
The Ukrainian alphabet is written in Cyrillic and has 33 letters, in writing the apostrophe and hyphen are also used. It differs from Russian by the absence of the letters ё, ъ, ы and э, and the presence of ґ, є, і, and ї.
This helps (but doesn’t completely solve the problem of) differentiating the two languages, which is needed relatively often: Russian-language fragments within otherwise Ukrainian text (e.g. untranslated quotes in text intended for a bilingual audience) are a typical problem, and one that needs to be solved when building reference corpora or datasets.39
Ukrainian is is a synthetic40 inflected language41, that is it can express different grammatical categories (case, number, gender, ..) as part of word formation. In other words, that information about grammatical categories tends to be encoded inside the words themselves.42
(German, too, is a fusional language, but with a smaller degree of inflection. English, on the other hand, largery abandoned the inflectional case system43 and is an analytic language, conveying grammatical information through word order and prepositions.)
Specifically, Ukrainian:
The standard word order is Subject-Verb-Object (SVO), but the inflectional paradigm allows free word order. In English the SVO word order in “the man saw the dog” (vs “the dog saw the man”) determines who saw whom. In Ukrainian it’s the last letter of the object (dog) that marks it as such.
eng: the manNOM.SG saw the dogNOM.SG
ukr: чоловікman-NOM.SG побачивsaw собакydog-ACC.SG
This allows the ordering of the words can be used for additional emphases or shades of meaning (similar to German).
A more extensive example:
eng: we foundPST a greenADJ cup NOUN on the table ADJ
ukr: миwe знайшли found-PST.1PL зелену green-ADJ.F.SG.ACC чашку cup-F.SG.ACC наon столі table-M.SG.LOC
deu: wirwe fandenfound-PST.1PL einea-INDEF.F.SG.ACC grünegreen-ADJ.F.SG.ACC Tassecup-F.SG.ACC aufon demthe-DEF.M.SG.DAT Tischtable-M.SG.DAT
The amount of categories conveyed by the nouns is roughly similar to German.
Morphology in verbs works in a very similar way. Additionally, unlike other Slavic languages, Ukrainian has an inflectional future tense (formed by a suffix in the verb) in addition to the standard compound future formed by using an auxiliary word бути (“to be”). 45 All this makes longer verbs quite common.
For example, the verb ви́користатиuse-INF.PFV is in perfective aspect, therefore it’s a completed action (“use up” or “utilize completely”) or one seen as a whole even if not completed (“Tomorrow I’ll use my cane to get the pencil from under the bed”)46. It can be transformed into використовуватимутьсяuse-IPFV-FUT-3PL-REFL4748 (3rd person plural imperfect-reflexive-future) thus (in bold the changes):
Minimal equivalent sentences:
eng: they 3PL willFUT bePASS usedPST.PTCP
deu: siethey werdenwill-FUT.PL verwendetused-PST.PTCP werdenbe-PASS
ukr: вониthey використовуватимутьсяuse-IPFV-FUT-3PL-REFL
rus: ониthey будутbe-FUT.3PL использоватьсяuse-INF-FUT-REFL
Todo (This is not a contrived example, використовуватимуться is a natural word in everyday speech.)
Ukrainian numerals can be cardinal (one), ordinal (first) and adverbial (once). They change to varying extent based on case, number49, gender.
The inflection of nouns for (grammatical) number has two classes, singular and plural. Old East Slavic (from which Ukrainian is descended) had a third grammatical number, the dual, since lost50. Some of its traces are in the agreement of nouns and numerals (1 dog, 4 sheep, …).
A simplified51 breakdown follows.
Numerals ending with the following numbers require nouns to:
In practice, this means that “4 dogs” and “5 dogs” have a different plural form for “dog”:
чотириfour-NOM собакdogs-иNOM.PL
пʼятьfive-NOM собакdogs-$\varnothing$GEN.PL
This also means that the numerals (that can be inflected themselves!) have to agree with the noun as well, for example the numeral ‘one’ in ‘one dog’ differs based on case:
ukr: одинone-MASC.NOM.SG собакаdog-MASC.NOM.SG
eng: one dog
ukr: немаєthere’s no одногоone-GEN.MASC.SG собакиdog-GEN.MASC.SG
eng: one dog is missing
Lastly, the same holds for larger numerals (“four million”, “five million”) even if they don’t have to agree with any nouns: “million” (thousand, billion, ..) acts as a noun and four/five acts as a numeral, bringing agreement issues even to a stand-alone cardinal number.
Todo
TODO
excellent: http://kulturamovy.univ.kiev.ua/KM/pdfs/Magazine13-16.pdf
list: Узгодження числiвникiв з iменниками - Українська мова: від фонетики до морфології
complex examples: Узгодження числівника з іменником – Українська мова та література
All the above has direct implications for NLP, for example:
In the context of this Thesis, inflecting words correctly has been the most challenging aspect:
GRND
( corresponding to the Russian/Ukrainian POS деепричастие54/дієприслівник) are encoded in Universal Dependencies as POS VERB
with feature VerbForm=Conv
55 to represent the same concept. And, therefore, are detected as such by spacy’s Morphology.
This meant that what spacy detects as VERB
s required an additional morphological filtering step to exclude what pymorphy2 would see as GRND
, because pymorphy2 isn’t able to inflect between (from its perspective different POS) GRND
and VERB
.For list of other typological features of the language, see its page on the World Atlas of Language Studies5657, as well as the excellent “UD for Ukrainian” page on the Universal Dependencies website58.
цар
ua-datasets
Explicitly mention if it’s google translate or real people did it
Belebele Dataset | Papers With Code is a " multiple-choice machine reading comprehension (MRC) dataset", 122 languages
KGQA/QALD_9_plus: QALD-9-Plus Dataset for Knowledge Graph Question Answering - one of the 9 langs is Ukrainian! One could theoretically convert the entities into text
… somewhere: why can’t one just google translate existing benchmarks and be done with it? precision, eval, etc.
The benchmark contains 2 main tasks:
The tasks and the datasets connected to them are uploaded to the HuggingFace Hub, and EleutherAI lm-evaluation-harness (widely used in literature) ’tasks’ are implemented for each (though not included in the harness itself).
TODO mention how I fulfill the criteria laid out in:
As a first step, spot-checks of various training instances of the datasets were performed as sanity check.
LMentry-static-UA contained exclusively algorithmically generated tasks with little randomness involved, and there the validity of the training instances was especially strongly dependent on the code that generated it — and after looking at enough examples of “what’s the Nth word in this sentence”, one could safely assume the rest were likely to be correct as well. So only a limited subset was manually checked.
The only issue found was wrong ground truth in the task about alphabetical ordering: the canonical order of the Ukrainian alphabet is different from what python’s sorting does (with the Ukrainian-only letters і ї є ґ being sorted at the very end instead of their usual place in the Ukrainian alphabet). The relevant code was rewritten to force the correct expected ordering. (Section XXX* has some reflections on the implications of this in the context of the Bender rule.)
For the CBT-UA task (which involved creating training instances based on data gained through ML approaches), the filtering of the resulting dataset was much more involved.
There were two rough classes of error sources: those caused by language and those caused by logic.
All the failure modes and their numbers are described its subsection XXX, but suffice to say occasional incorrect lemmatization and POS detection by spacy, incorrect normalization and detection (and therefore inflection) by pymorphy2, and the best-guess approach used in the pymorphy-spacy-disambiguation
package (written specifically for this Thesis) created a large area of uncertainty.
On the logic side, there were the unavoidable errors stemming from the task creation approach (despite reasonable safeguards being put in place where practical), such as multiple possible answers, unknowable answers, etc.
This dataset is a set of tasks loosely on the original LMentry evaluation task63 described in section XXX.
The original LMentry 63 had a list of 20-XXX partly repetitive tasks, e.g. “bigger number” and “smaller number” being separate ones.
TODO pic taxonomy of LMentry tasks:
LMentry-static-UA (in addition to applying the ideas to Ukrainian) contains the following conceptual changes:
CompareTwoThings
is a parent type of LetterCount
(containing both ‘more’ and ’less’ letters) and NumberComparison
(bigger+smaller) number. This was done to reduce repetitive code and to decrease the number of tasks to contain only conceptually different ones.The LMentry-static-UA dataset is shared on Huggingface under the link XXX.
Since the individual tasks are different, multiple configs are contained in the dataset, with e.g. the NumberComparison
subtask being available as
dataset = load_dataset("shamotskyi/lmentry-static-UA", "numbercomparison")
As with other tasks, agreement of Ukrainian numerals and nouns (see section XXX) has taken a large amount of time.
The different templates contained different nouns in the same role (first word, word one, first position, etc.) that required cardinal and ordinal numerals. They had to agree with the noun in gender (number as well, but in practice only singular was needed TODO):
eng: The third word in the sentence is …
ukr: Третєthird-3SG.N.ORD словоword-3SG.N …
This raised two problems.
When creating a template, where/how to encode whether this template requires an ordinal/cardinal and agreed to which grammatical categories.
SOLUTION: including capitalized numerals in the correct form in the template itself and automatically parsing the grammatical categories needed from them:
eng: The FIRSTORD word in the sentence is …
eng: Word number ONECARD in the sentence is …
ukr: ПЕРШЕfirst-3SG.N.ORD словоword-3SG.N …
This allowed to create templates using natural language and simplified the data structures involved.
When constructing the actual training instances from the templates:
NOM.M.SG
)The implementation of this was challenging, and resulted in the creation of a separate pyhon package, ukr_numbers, which creates numerals based on an input integer and a natural language description of the needed inflection:
>>> from ukr_numbers import Numbers
>>> Numbers().convert_to_auto(15,"перший")
'пʼятнадцятий'
# loosely paraphrasing to English:
>>> convert_to_auto(15, "first")
"fifteenth"`
Under the hood, it uses num2words to generate Ukrainian ordinals/cardinals in normal form and the already mentioned pymorphy2 to parse the natural language form and inflect the numeral.
The otherwise excellent num2words
was not able to inflect Ukrainian ordinals by case, necessitating manual pymorphy2 inflection logic and leading to many edge cases:
make_agree_with_number
that depended on it,
leading to a bug report65 and cumbersome workaround from my sideNot all edge cases are solved, but in all cases relevant to the LMentry-static-UA tasks it works as expected and produces grammatically and semantically correct output.
TODO The following terms will be used throughout this section:
_____
).During manual task instance filtering, the task instances were classified into usable and unusable, with the latter removed from the dataset. There were different reasons an instance would be unusable. These reasons were formalized into a simple taxonomy. This was originally done for the people helping with the filtering, in the form of annotation guidelines and with checkboxes in the labeling interface serving chiefly as reminders of the problems to look for.
The errors can be divided into three different (albeit fuzzy) types:
The Lion liked the Cat and Turtle’s coat/work. Both tailors/animals were happy.
Whiskers was happy that he was a cat: he was fast and could climb trees. One morning, he heard his owner say: “Our Whiskers/cat is the fastest cat I know”.
She yelled/speaking at both dogs/cats/butterfly.
Some of these issues were dealt with fixes/rewrites the code, e.g.:
Original English thing62
My current task notes page is [[231024-1704 Master thesis task CBT]]
Get Ukrainian book with good OCR, POS-tag, generate questions, manually check
Mention how it’s more interesting in Ukrainian than English because morphology - need to do agreements etc.
paper:
Similar: demelin/understanding_fables · Datasets at Hugging Face
Corner cases:
Safety
Вовк і лисиця підстерегли черепаху в лісі і напали на неї. Черепаха не могла втекти і захиститися і стала благати про пощаду. Але вовк і лисиця були безжальні і розірвали черепаху на шматки.
(Pdb++) response.prompt_feedback
block_reason: SAFETY
safety_ratings {
category: HARM_CATEGORY_SEXUALLY_EXPLICIT
probability: NEGLIGIBLE
}
safety_ratings {
category: HARM_CATEGORY_HATE_SPEECH
probability: NEGLIGIBLE
}
safety_ratings {
category: HARM_CATEGORY_HARASSMENT
probability: MEDIUM
}
safety_ratings {
category: HARM_CATEGORY_DANGEROUS_CONTENT
probability: NEGLIGIBLE
}
Fixing gpt4 stories with gemini works!
Леопард, відчуваючи респект, кивнув у знак схвалення, і Жук також не міг приховати свого здивування тонкістю роботи.
Леопард, відчуваючи повагу, кивнув у знак схвалення, а Жук не міг приховати свого здивування тонкістю роботи.
Gemini is better at other languages: neulab/gemini-benchmark
SECTION LOCATED HERE: 231213-1710 Ukrainska Pravda dataset
for ideas about it, see truthfulQA paper33 as well as any multi-lingual benchmark paper
openAI API
On LM harness scores for multiple choice acc VS acc_norm
Instructions
Do UP news classification with different models, do pretty graph about how it correlates with my benchmark results.
![[231213-1710 Ukrainska Pravda dataset#Appendixes A regexes for skipping paragraphs in UPravda dataset]]
This config file contains both lemma fixes, word replacements and word blacklists as well as the distractors used during CBT instance geneation.
lemma_fixes:
миш: миша # people named Михайло
люди: люди # people named Люда
люда: люди
кота: кіт # not кот
кот: кіт # not кот
# а не вбивець
# EDIT ACTUALLY it exists, though infrequently https://goroh.pp.ua/%D0%A7%D0%B0%D1%81%D1%82%D0%BE%D1%82%D0%B0/%D1%83%D0%B1%D0%B8%D0%B2%D0%B5%D1%86%D1%8C
# pymorphy2 and spacy both use вбивець
вбивці: вбивця
word_replacements:
заяць: заєць
word_blacklist:
- шати
# - мати
- бути
- стати
- могти
distractors:
NAMED_ENTITY:
animal:
male:
# - собака
# - кіт
- їжак
# - птах
# - метелик
- ведмідь
- півень
- жираф
# - дракон
- слон
# - ворона
female:
- коза
- жаба
# - кішка
- свиня
- мавпа
- зозуля
neutral:
# TODO add more
- котеня
- слоненя
- зайченя
- жабеня
- козеня
- мавпеня
- тигреня
- козеня
- вовчисько
human:
male:
# - чоловік
- син
- багатир
- Петро
- лісник
- селянин
- чорт
- домовик
# - брат
neutral:
- дівча
- дитя
- немовля
female:
- селянка
- відьма
- жінка
- дочка
- сестра
- мати
- королева
COMMON_NOUN:
male:
- автомобіль
- будинок
- шлях
- ящик
- меч
- замок
- стіл
neutral:
- дерево
- яйце
- ім'я
- яблуко
- місто
- озеро
- поле
- вікно
- ліжко
- листя
- шиття
- мистецтво
female:
- гривня
- природа
- трава
- річка
- книга
- дорога
- кімната
‘Go mute first’ variation taken from here: Translations ↩︎
Greg Brockman on X: “evals are surprisingly often all you need” / X ↩︎
<_(@inclusion) “The state and fate of linguistic diversity and inclusion in the NLP world” (2020) / Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit Choudhury: z / / _> ↩︎ ↩︎ ↩︎
TODO format citation Debunking the myth of a divided Ukraine - Atlantic Council citing Oeuvres complètes de Voltaire - Voltaire - Google Books ↩︎
<_(@kulyk2018shedding) “Shedding Russianness, recasting Ukrainianness: The post-Euromaidan dynamics of ethnonational identifications in Ukraine” (2018) / Volodymyr Kulyk: z / / _> ↩︎ ↩︎ ↩︎
<_(@krawchenko1987social) “Social change and national consciousness in twentieth-century Ukraine” (1987) / Bohdan Krawchenko: z / / _> ↩︎
<_(@1130282272476965120) “Keeping a record : Literary purges in Soviet Ukraine (1930s), a bio-bibliography” (1987) / George Stephen Nestor Luckyj: z / https://cir.nii.ac.jp/crid/1130282272476965120 / _> ↩︎ ↩︎ ↩︎
<_(@grenoble2010contact) “Contact and the development of the Slavic languages” (2010) / Lenore A Grenoble: z / / _> ↩︎ ↩︎
<_(@press2015ukrainian) “Ukrainian: A comprehensive grammar” (2015) / Ian Press, Stefan Pugh: z / / _> ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
<_(@rehbein2014check) “How to check understanding across languages. An introduction into the Pragmatic Index of Language Distance (PILaD) usable to measure mutual understanding in receptive multilingualism, illustrated by conversations in Russian, Ukrainian and Polish” (2014) / Jochen Rehbein, Olena Romaniuk: z / / _> ↩︎ ↩︎
<_(@doi:10.1016/j.euras.2014.05.005) “Ukraine and russia: Legacies of the imperial past and competing memories” (2014) / Andreas Kappeler: z / https://doi.org/10.1016/j.euras.2014.05.005 / 10.1016/j.euras.2014.05.005 _> ↩︎ ↩︎ ↩︎
the primary source11 states that, to a certain extent, among many Russians and some Europeans — still is. ↩︎
Also memorably stating that “a separate Little Russian language has never existed, does not exist and cannot exist, and that their dialect, used by commoners, is just the Russian Language, only corrupted by the influence of Poland”72 ↩︎
<_(@dibrova2017valuev) “The valuev circular and the end of little russian literature” (2017) / Volodymyr Dibrova: z / / _> ↩︎
<_(@remy2017despite) “Despite the valuev directive: Books permitted by the censors in violation of the restrictions against ukrainian publishing, 1864-1904” (2017) / Johannes Remy, others: z / / _> ↩︎
<_(@5c48fce9-c05d-3d4e-94c1-cd6079bff660) “The language question in the ukraine in the twentieth century (1900-1941)” (1987) / GEORGE Y. SHEVELOV: z / http://www.jstor.org/stable/41036243 / _> ↩︎ ↩︎ ↩︎
<_(@1ad9e7d5-c0eb-33df-ae6c-1fdbd2549d75) “The executed renaissance paradigm revisited” (2004) / HALYNA HRYN: z / http://www.jstor.org/stable/41036862 / _> ↩︎
“Of those [lost to Ukrainian literature] 236 were writers. (…) 1,087 writers were active in Ukraine, the loss amounted to 33 per cent.. In terms of figures alone the losses were quite significant, but in terms of literary quality and originality they were devastating.” 7 ↩︎
<_(@karunyk2017ukrainian) “The ukrainian spelling reforms, half-reforms, non-reforms and anti-reforms as manifestation of the soviet language policy” (2017) / Kateryna Karunyk: z / / _> ↩︎ ↩︎ ↩︎
<_(@Racek2024) “The Russian war in Ukraine increased Ukrainian language use on social media” (2024) / Daniel Racek, Brittany I. Davidson, Paul W. Thurner, Xiao Xiang Zhu, Göran Kauermann: z / https://www.nature.com/articles/s44271-023-00045-6 / 10.1038/s44271-023-00045-6 _> ↩︎ ↩︎ ↩︎ ↩︎
<_(@Matveyeva2017) “Modern language situation (on the basis of the 2017 survey)” (2017) / Nataliya Matveyeva: z / http://lcmp.ukma.edu.ua/article/view/123368 / 10.18523/lcmp2522-92812017123368 _> ↩︎
<_(@Kanishcheva2023) “The Parliamentary Code-Switching Corpus: Bilingualism in the Ukrainian Parliament in the 1990s-2020s” (2023) / Olha Kanishcheva, Tetiana Kovalova, Maria Shvedova, Ruprecht Von Waldenfels: z / https://aclanthology.org/2023.unlp-1.10 / 10.18653/v1/2023.unlp-1.10 _> ↩︎ ↩︎
<_(@Sira2019) “Towards an automatic recognition of mixed languages: The Ukrainian-Russian hybrid language Surzhyk” (2019) / Nataliya Sira, Giorgio Maria Di Nunzio, Viviana Nosilia: z / http://arxiv.org/abs/1912.08582 / _> ↩︎ ↩︎
<_(@bernsand2001surzhyk) “Surzhyk and national identity in Ukrainian nationalist language ideology” (2001) / Niklas Bernsand: z / / _> %% %%[^45]: Some 70 even hypothesize two subtypes of it: an older one, created during the times of Russian language dominance when Ukrainian speakers had to adapt, and a newer post-1990 one, born when Russian speakers had to at least partially turn to Ukrainian. ↩︎
<_(@ratinggroupSixthNational) “The sixth national poll: The language issue in Ukraine (March 19th, 2022) — Ratinggroup.Ua” (2022) / : z / https://ratinggroup.ua/en/research/ukraine/language_issue_in_ukraine_march_19th_2022.html / _> ↩︎
Switching from Russian to Ukrainian, for a Russian speaker, is hard, including emotionally. Mother Tongue: The Story of a Ukrainian Language Convert - New Lines Magazine71 is one of the best articles I’ve read in 2023 and is an excellent description of the topic. ↩︎
<_(@enwiki:1182341232) “Languages used on the internet — Wikipedia, the free encyclopedia” (2023) / Wikipedia contributors: z / https://en.wikipedia.org/w/index.php?title=Languages_used_on_the_Internet&oldid=1182341232 / _> ↩︎
quoting Usage Statistics and Market Share of Content Languages for Websites, September 2023 ↩︎
<_(@wiki:xxx) “List of Wikipedias/Table2 — Meta, discussion about wikimedia projects” (2022) / Meta: z / https://meta.wikimedia.org/w/index.php?title=List_of_Wikipedias/Table2&oldid=23936182 / _> ↩︎
<_(@bender) “On achieving and evaluating language-independence in NLP” (2011) / Emily M Bender: z / / _> ↩︎ ↩︎
<_(@benderpost) “The #BenderRule: On naming the languages we study and why it matters” (2019) / Emily Bender: z / https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/ / _> ↩︎ ↩︎ ↩︎
TruthfulQA/TruthfulQA.csv at main · sylinrl/TruthfulQA ↩︎ ↩︎
<_(@comrie2008leipzig) “The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses” (2008) / Bernard Comrie, Martin Haspelmath, Balthasar Bickel: z / / _> ↩︎
Unless a segmentation is needed only to have an adjacent morpheme that does need a gloss segmented correctly — then such a morpheme may not have a gloss. ↩︎
See List of glossing abbreviations - Wikipedia for a full list. ↩︎
Not to be confused with PERF (perfect tense), not used in this Thesis. ↩︎
Sometimes used, but absent from LGR proper since they are not glosses for morphological values.
↩︎Authors also use placeholders for generic elements in schematicized parsing, such as may be used to illustrate morpheme or word order in a language. Examples include head or hd ‘head’; root or rt ‘root’; stem or st ‘stem’; pref, prfx or px ‘prefix’; suff, sufx or sx ‘suffix’; clit, cl or encl ‘(en)clitic’; prep ‘preposition’ and pos or post ‘postposition’, png ‘person–number–gender element’ and tam ’tense–aspect–mood element’ (also ng number–gender, pn person–number, ta tense–aspect, tame tense–aspect–mood–evidential) etc. These are not listed below as they are not glosses for morphological values. (List of glossing abbreviations - Wikipedia) TODO remove this
<_(@9648705) “Ukrainian text preprocessing in GRAC” (2021) / Vasyl Starko, Andriy Rysin, Maria Shvedova: z / / 10.1109/CSIT52700.2021.9648705 _> ↩︎
as opposed to analytic languages; Wikipedia has cool bits in Synthetic language - Wikipedia e.g. antidisestablishmentarianism ↩︎
also known as fusional language:Fusional language - Wikipedia ↩︎
Another way to say this is that synthetic languages are characterized by a higher morpheme-to-word ratio. ↩︎
except for personal pronouns; English grammar - Wikipedia ↩︎
including the vocative case, absent in Russian, used when adressing someone (e.g. собакdog-аNOM when addressed becomes собакdog-oVOC) ↩︎
As an added layer of complexity, word stress can also impact grammatical categories. - TODO emphasize if I actually do a homonym-like task ↩︎
nice explanation: TODO removePerfective aspect - Wikipedia / Imperfective aspect - Wikipedia ↩︎
Or Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin
in CoNLL-U FEATS format. ↩︎
TODO thank him for this word? Daniel Broomfield 🇺🇦🇬🇧 on X: “Найскладніші слова в українській мові для мене: використовуватимуться високопоставленими абищиця (Ніколи не пам’ятаю, де поставити наголос 😑)” / X ↩︎
Some nouns can be used only in plural, e.g. in одні окуляри (one pair of glasses) the numeral one is plural! ↩︎
Parts of it — to history, other parts — explicitly forbidden in the 1932 grammar reform. ↩︎
This is only a partial description of both nouns agreement and numerals declination. ↩︎
Mostly for some nouns of male gender (два громадянина) ↩︎
<_(@Syvokon2022) “UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language” (2022) / Oleksiy Syvokon, Olena Nahorna: z / / _> ↩︎ ↩︎ ↩︎
Обозначения для граммем (русский язык) — Морфологический анализатор pymorphy2 ↩︎
<_(@wals) “WALS Online (V2020.3)” (2013) / : z / / 10.5281/zenodo.7385533 _> ↩︎
<_(@Korobov2015) “Morphological Analyzer and Generator for Russian and Ukrainian Languages” (2015) / Mikhail Korobov: z / http://arxiv.org/abs/1503.07283 / _> ↩︎ ↩︎
<_(@Korobov) “Morphological analyzer and generator for russian and ukrainian languages” () / Mikhail Korobov: z / http://dx.doi.org/10.1007/978-3-319-26123-2_31 / 10.1007/978-3-319-26123-2_31 _> ↩︎
<_(@labaContextualEmbeddingsUkrainian2023) “Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation” (2023) / Yurii Laba, Volodymyr Mudryi, Dmytro Chaplynskyi, Mariana Romanyshyn, Oles Dobosevych: z / https://aclanthology.org/2023.unlp-1.2 / 10.18653/v1/2023.unlp-1.2 _> ↩︎
<_(@taskCBT) “The goldilocks principle: Reading children’s books with explicit memory representations” (2015) / Felix Hill, Antoine Bordes, Sumit Chopra, Jason Weston: z / / 10.48550/ARXIV.1511.02301 _> ↩︎ ↩︎
<_(@bm_lmentry) “LMentry: A language model benchmark of elementary language tasks” (2022) / Avia Efrat, Or Honovich, Omer Levy: z / https://arxiv.org/abs/2211.02069 / 10.48550/ARXIV.2211.02069 _> ↩︎ ↩︎ ↩︎
<_(@linTruthfulQAMeasuringHow2022) “TruthfulQA: Measuring How Models Mimic Human Falsehoods” (2022) / Stephanie Lin, Jacob Hilton, Owain Evans: z / / _> ↩︎
Числа и проблемы с склонением в разборах всех украинских слов · Issue #169 · pymorphy2/pymorphy2 ↩︎
<_(@danylyuk2022main) “The main features of the ukrainian grammar” (2022) / Nina Danylyuk, Tetiana Masytska, Douglas O’Brien, Oksana Rohach: z / / _> ↩︎
Strictly speaking, кота can be either ACC or GEN case. ↩︎
<_(@synchak2023feminine) “Feminine personal nouns in ukrainian: Dynamics in a corpus” (2023) / Vasyl Starkoand Olena Synchak: z / / _> ↩︎
https://chat.openai.com/share/c694b707-4f23-4e57-8ee8-1e560dd3febe ↩︎
<_(@hentschel2020ukrainisch) “Ukrainisch-russisches und russisch-ukrainisches Code-Mixing. Untersuchungen in drei Regionen im Süden der Ukraine” (2020) / Gerd Hentschel, Tilmann Reuther: z / / _> ↩︎
<_(@newlinesmagMotherTongue) “Mother Tongue: The Story of a Ukrainian Language Convert — Newlinesmag.Com” (2023) / : z / https://newlinesmag.com/first-person/mother-tongue-the-story-of-a-ukrainian-language-convert/ / _> ↩︎
<_(@enwikisource:13111073) “Translation:Valuyev circular — Wikisource,” (2023) / Wikisource: z / https://en.wikisource.org/w/index.php?title=Translation:Valuyev_Circular&oldid=13111073 / _> ↩︎
Context: 230928-1527 Evaluation benchmark for DE-UA text Here I’ll keep random interesting benchmarks I find.
code: GLUECoS/Code at master · microsoft/GLUECoS
Context: 230928-1527 Evaluation benchmark for DE-UA text
ua_datasets
task@taskCBT
(2015) z/d/>
On the semantic front, exploit polysemy and homonymy differences. Formulate sentences with words that have multiple meanings in Russian, but those meanings have distinct equivalents in Ukrainian. This will challenge the model to accurately discern the intended sense based on context.
From 10, automatically generated!
I could also use a graph-based approach? As in create an ontology, ask questions about it?..
Or split it into multiple sub-tasks! one for time, one for y/n, etc.?
Find some popular website with comments and ratings, do sentiment analysis: can I scrape
https://rozetka.com.ua/jagermeister_4067700015532_/p4971091/comments/ ?
Not all comments are in UA but I can filter it.
cat task_text.txt | rot13
or whatever)From fido-ai/ua-datasets: A collection of datasets for Ukrainian language:
@laiChatGPTEnglishComprehensive2023
ChatGPT Beyond English (2023) z/d/>This is a dictionary that has homonyms as column in the CSV: tamila-krashtan/UkrEtymDict: Revised database of Ukrainian Etymological Dictionary
ParlAI/parlai/tasks/squad2/test/squad2_index_test.yml at main · facebookresearch/ParlAI ↩︎
matheuss/google-translate-api: A free and unlimited API for Google Translate :dollar::no_entry_sign: ↩︎
lang-uk/ukrainian-word-stress-dictionary: Dictionary of word stresses in the Ukrainian language 🇺🇦 ↩︎
<_(@Todorov2022) “An Assessment of the Impact of OCR Noise on Language Models” (2022) / Konstantin Todorov, Giovanni Colavizza: z / / _> ↩︎
<_(@synchak2023feminine) “Feminine personal nouns in ukrainian: Dynamics in a corpus” (2023) / Vasyl Starkoand Olena Synchak: z / / _> ↩︎
Babi: <@westonAICompleteQuestionAnswering2015
Towards AI-Complete Question Answering (2015) z/d/> / Holistic Evaluation of Language Models (HELM) ↩︎
Вы всё равно не знаете, что с этими жизнями делать. И куда бы вы ни глядели, вы все равно глядите в огонь, в котором сгорает ваша жизнь. Милосердие в том, что вместо крематориев у вас телевизоры и супермаркеты. А истина в том, что функция у них одна. (Пелевин, “Generation ‘П’”)1
Недавно мені до рук потрапило видання книги “Моральні листи до Луцілія — Вікіпедія” Сенеки, майже ідентичне тому, яке я читав багато разів в останніх класах школи і перших курсах університету. Книжку надзвичайно любив ще тоді, і вона для мене, разом з “Келією Чайної Троянди” Костянтина Москальця, найважливіша книжка того періоду, або - не побоюсь цих слів - найважливіша книжка, крапка.
Та перша книжка (розписана, потерта та помʼята, як має бути книга, яку читають) була забута в кріслі літака МАУ, і я забув про її існування. До сьогоднішнього дня, коли в моїх руках не опинилась її сестра, в тій самій обкладинці, з тим самим перекладом, тільки (незрозуміло, як) трошки товстіша за попереднью.
Прочитав перші 15 листів ледь не на одному подиху, згадуючи як окремі думки так і конкретні речення/формулювання, підкреслюючи ті самі цитати в тих самих місцях, насолоджуючись кожною секундою.
У мене традиція - всі книжки, які я читаю, я датую на 13 сторінці, і читаючи - дуже багато розписую, підкреслюю, … (Якщо в книзі нічого підкреслювати - книжка не варта того, щоб її взагалі читати). Потім кожного разу, коли я цю книжку перечитую, я роблю все те саме - підписую і датую тринадцяту сторінку і т.д., але вже іншим кольором. Так чітко видно, які речі мені здавались важливими колись, які я не помітив, які не розумів, які перестав розуміти, і т.п., так би мовити історія мого сприйняття книжки.
(Один з найгірших спогадів в житті повʼязаний з тим-таки примірником Сенеки, ту книжку я читав до депресії, попідкреслював, і перечитуючи в пошуках комфорту на самому дні раптом зрозумів, що половина моїх поміток - на занадто довгих реченнях для мене зараз, і в мене недостатньо робочої памʼяті, щоб їх зрозуміти, хоча ще два роки тому розумів…)3
Так от, у цієї книжки я чітко пригадував, що мені там подобалось, з цим я був не згодним, що вразило, і так далі.
І те, наскільки перебільшеним, елітарним та неактуальним я сприйняв лист VII.
Книжка вважається одним з двох найважливіших книжок філософської течії стоїцизму (друга - “Розмисли. Наодинці з собою” Марка Аврелія). Вона оформлена як листи старшого товариша молодшому, і там ненавʼязливо описуються думки Сенеки про те, як правильно жити.
Біографію Сенеки та його філософією гармонізувати складно, і він сам це розумів. Є коли “If you are so smart why aren’t you rich”, а є коли пишеш моральні листи але сам прийшов до влади при імператорі Калігула, був особистим вчителем Нерона (і досяг вершини багатства і влади саме при ньому)5. Але я все життя вірив, що можна чомусь навчитись і у поганого оратора, Сенека ж пише дуже непогано. Те, що Сенека не прожив життя так, як проповідував, і те, що його track record як вчителя не блискучий (Нерон…), не заважає мені вважати його думки цікавими і цінними. Я впевнений, що Сенека погодився б з цим. Він багато писав про те, що все гарно написане - класне, хто б не був автором.6
Так от (виділення мої, цитати тут і далі наводитиму за чудовим перекладом Андрія Содомора):
Сенека вітає свого Луцілія!
Питаєш, чого слід уникати передусім? Юрби. Тобі ще небезпечно стикатися з нею. Я принаймні не приховую своєї слабості: ніколи не можу повернутися додому, зберігши неторкнутими ті звичаї, які виніс. Дещо з того, що я довів до ладу, розладнується; дещо, чого позбувся, повертається.
[…]
Спілкування з багатьма, - шкідливе. Завжди-бо трапляється хтось такий, хто або напучує нас на якийсь порок, або таки передасть його нам, або непомітно забруднить ним. Отже, що густіша юрба, в яку поринаємо, то більша для нас небезпека. Але ніщо не є таким згубним для добрих звичаїв, як учащати на якісь видовища. Саме тоді разом із приємністю легко прокрадаються в душу й пороки. Розумієш, що маю на увазі? Повертаюся звідти пожадливіший, марнославніший, вибагливіший, навіть жорстокіший і нелюдяніший - побував серед люду.
[…]
«Але ж не один із них йшов на грабунок, убив людину».- Ну й що з того? Він убив, він і розплачується тепер. А ти, що ти вчинив, нещасний, щоб дивитись на це?.. «Вбивай, шмагай, пали! Чому так боязко набігає на меч? Чому так нерішуче вбиває? Чому так мляво йде на смерть? Батогом женіть на вістря, хай голими грудьми навзаєм приймають удари!» Перерва у видовищах? - «Хай і в перерві гинуть люди, щоб не було й хвилини, нічим не заповненої!»
Невже не розумієте, що погані приклади обертаються проти тих, хто їх подає? Дякуйте безсмертним богам, що жорстокості вчите того, хто надто тупий до науки. Так. Од юрби якомога далі повинен перебувати той, чия душа ще надто ніжна й ще не досить цупко тримається добра: легко переходить на сторону більшості. […]
Хоч як гартуємо свою вдачу, а перед такою навалою пороків ледве чи хтось міг би встояти. Чимало зла тягне за собою один лише приклад чи то марнотратства, чи скупості; спілкування з розбещеним поволі й нас розслаблює і розніжує; багатий сусід роздуває жадобу; пороком, наче ржею, пройметься від лихого товариша навіть світла, щира душа. То що, гадаєш, зостанеться від нашої доброзвичайності, коли на неї рушить увесь люд? Неодмінно або наслідуватимеш його, або - зненавидиш. Слід уникати як одного, так і другого: не вподібнюйся до лихих через те, що їх багато, але й з багатьма не ворогуй через те, що вони неподібні до тебе. Заглиблюйся, наскільки можливо, в себе самого.
Приятелюй лише з тими, які можуть зробити тебе кращим. Вигода тут обопільна: люди, навчаючи, вчаться.[…]
Сенека писав про шкідливість юрби та видовищ. Видовища в тому контексті - битви гладіаторів. (Юрба - просто юрба, буквально і в переносному сенсі, as far as I can tell.)
Релевантність впливу битв гладіаторів на мораль та психіку громадян в Києві XXI століття мені вважалась нульовою; заборона говорити з юрбою/людьми, щоб не дай Боже не навчитися від дних дна, вважалась невиправдано елітарною (хоча якби мої співвідчизники ходили дивитися на страти від нудьги - who knows; у будь-якому випадку Листи писались, коли Сенека майже тотально розчарувався в своїх громадянах та, гм, владі).
Читаючи на рівні очевидних метафор того і іншого - ну ок, так, можу зрозуміти меседж, але він не вау і це не мій улюблений лист.
Таке буває часто, якщо читати відносно древні книжки, часи міняються, люди міняються, мораль міняється, те, що вважалось ненормальним в 1965 році зараз очевидне, - чого хотіти від книжки 65 (шістдесят пʼятого) року. Взагалі тема що гладіатори не ОК, бо викликають як мінімум жорстокість, в часи написання книги мабуть була якщо не революційною, то точно не мейнстримною. Очевидні класні кусочки (“Приятелюй лише з тими, які можуть зробити тебе кращим.”) є, вони є завжди, а так - дочитуємо, перегортаємо сторінку, і йдемо далі.
Читаючи цей Лист зараз, пригадуючи, як ледь не зверхньо я його пропускав раніше, раптово, я зрозумів релевантність цих рядків і прикладів (як мінімум для мене).
(Джерело: (172) Reddit - Dive into anything)
Повертаюся звідти пожадливіший, марнославніший, вибагливіший, навіть жорстокіший і нелюдяніший - побував
серед людув коментарях посту в Телеграмі про те, як таксист в Одесі відмовився виключити російську музику.
А, аааа, тааак, зрозуміло, це має сенс, ось про що йшла мова!
“бо ж не слів треба триматись, а думки” / “ведь сохранять верность надо не словам, а мыслям”
(Лист IX)
Дякую, Сенеко, дякую, бо далі буде моя інтерпретація того, що б ти мав на увазі зараз, і я їй буду вірний не менше буквального тлумачення (раптом мене запросять на битву гладіатора з левом, хто зна), і обидва припишу тобі.
Далі спробую зібрати думки на багато тем, які всі дотичні до тих самих проблем; не виключено, що як Сенека, так і інші авторитети-чи-не-дуже, яких згадуватиму, вважатимуть це сміттям, а мене - ідіотом, і цей пост зіпсує вам життя і спалить будинок.
Як часто буває в цьому блозі, цей пост має допомогти як мінімум мені7, все інше - бонус.
Caveat emptor, і за одно - lasciate ogni speranza voi, che entrate.
(Або: list of things that make life worth living but we can’t let people enjoy things, can we)
Три поста про різні речі, обожнюю всі три, рекомендую прочитати повністю разом з Сенекою:
Якщо ми забули, які бувають розваг, і хочемо ширший список, ніж гладіатори та вистави між ними, fear not. Ось список ігор, в які не грав Будда, вважаючи їх “cause for negligence”: List of games that Buddha would not play - Wikipedia. Сутра звідки це10 має ще список розваг:
“Or he might say: ‘Whereas some honourable recluses and brahmins, while living on food offered by the faithful, attend unsuitable shows, such as:
- shows featuring dancing, singing, or instrumental music;
- theatrical performances;
- narrations of legends;
- music played by hand-clapping, cymbals, and drums;
- picture houses;
- acrobatic performances;
- combats of elephants, horses, buffaloes, bulls, goats, rams, cocks and quails;
- stick-fights, boxing and wrestling, sham-fights, roll-calls, battle-arrays, and regimental reviews—
the recluse Gotama abstains from attending such unsuitable shows.’
Якщо і цього недостатньо, то ось моя власна спроба:
І тут питання наступне.
Життя коротке. Люди складні. Сенс мистецтва в тому, щоб давати комфорт тим, хто страждає, і викликати хвилювання чи хоча б сумніви в тих, у кого все добре.
А сенс цього посту - зробити так, щоб ті, хто думає, що в них все добре, помітили вогонь навколо, в якому згорає їх життя, болото, в якому вони повільно тонуть, і мох, який росте на них, і який з кожною хвилиною стає м’якішим і теплішим і затишнішим.
Життя коротке. Люди складні. Настільки складні, що конструкція “що ти хочеш хотіти” інтуїтивно зрозуміла, при відсутності складніших модальних дієслів які могли б передати всі відтінки “Я не хочу робити Х але все рівно роблю”.
Який же цікавий парадокс, що ми можемо свідомо робити те, що ми свідомо не хочемо робити.
Весь цей час, глибоко всередині, чи не сильно, є відчуття вічності, світла, розуміння, що всередині нас є частинка бога, що ми - людина, що ми можемо все. Є відчуття, що ми зраджуємо те вічне, що в нас. Що ми не були створені для бездумного поїдання lowest common denominator content.
Що нам соромно..?
Що навколо нас матриця, що світ навколо створений для того, щоб не дати нам побачити це вічне в собі, чи хоча б не тримати його у свідомості занадто довго. Що є способи звільнитися від без
Люди - єдина тварина, для якої її власне існування є проблемою.
Медитація солушн і спасіння, відсутність Ігор і людей спасіння
Inviting Mara to Tea (archived12)
Become Superhuman: Maximize Your Potential and Fulfill Your Destiny | The Art of Manliness
Social media, коменти в ТГ це як театр, culture of being offended
Потеря времени, илы воли, не приносит пользы
no SM во время отпуск привела к чувствам счастья, глубины, способность/желанию творить, убирать
lowest common denom content
задавати собі питання як глибоко читати теекст, викидати непотрібний щоб відчути нудьгу
задавати собі питання на які треба відповідь перед чтанням матеріалу
Время проходит, жизнь коротка, чашка чая
Factfulness, infornography,
Let people enjoy things
“When God says no, He is saying, ‘Don’t hurt yourself. ( Emerson Eggerichs)
Ні фіга собі куди гугл мене привів в пошуках джерела цієї цитати: https://kprf.ru/library/classics/fiction/3764.html ↩︎
Why Do All Online Recipe Blogs Start With a Long Life Story? | Good/Bad Marketing ↩︎
дякую Богу за досвід. Перший раз це було екзистенційно і катастрофічно, наступні рази вже було частиною карти і навіть зручними рисами по яким можна оцінити, де ти зараз, і часто до інших знаків. Зараз коли починаю забувати закрити кран чи холодильник це справді зручний дзвіночок і один з перших ↩︎
Візьму гріх на душу. Лист короткий і його можна прочитати цілком. Якщо немає книжки то цитати гугляться і ведуть до сторінок, де вони згадуються: Сенека вітає свого Луцілія! Питаєш, чого слід уникати передусім? Юрби. Тобі ще небезпечно стикатися з нею. - Google Suche ↩︎
Вже під кінець Нерон прирік його до самогубства. ↩︎
Наприклад: “Що правдиве, те - моє. Буду й надалі наводити тобі Епікура, щоб ті, які, сліпо тримаючись слів, зважають лише на те, хто говорить, а не що говорить, знали: все, бездоганно сказане,- спільне надбання.” (Лист XII) ↩︎
↩︎Непогано сказав ще хтось (немає певності, кому належить той вислів), коли запитали, навіщо так вигладжувати твір, якщо його візьме до рук заледве кілька читачів. «Мені досить кількох,- відповів він, - досить одного, досить, коли й жодного не буде». Славетним є і третій вислів - Епікура, що писав одному з товаришів, спільників у філософських заняттях: «Кажу це для тебе, а не для загалу: ми, один для одного,- багатолюдний театр» (Писав очевидно хто, в кінці того самого листа VII.)
Sabbath hard and go home — LessWrong / Sabbath hard and go home | Compass Rose ↩︎
релевантно (памʼятаємо все, що писав Сенека про фрази з мутних джерел):
↩︎⚡️ НАЧАЛОСЬ!
Также известное как “It’s Happening”. Эмоциональный бэкграунд всего околополитического интернета – дикое нервное ожидание того, что где-то наконец-то Ёбнет и Понесется.
Таким сегодня вечером оказалась ситуация в Сербии. Люди сегодня и открыто писали, что ситуация для них напоминает 24 февраля, в каком-то почти сладостном ожидании пиздореза. Пиздорез сегодня не выкинули, в телеграме пишут «сегодня спим». Но интересно то, что же влечет людей к этому бесконечному НАЧАЛОСЬ.
Разгадка, на мой взгляд, достаточно проста. Пресыщенные дофамином от массовых медиа люди ожидают того, что наконец-то их бахнет по ощущениям так, что прям УХ, и что после этого всё будет по-новому. Общий фон этого ожидания – ощущение мнимой бесперспективности бытовой жизни и надежда на Переворот. А дальше – каждый отстраивает на своих ощущениях.
The ’new’ function in matplotlib for this is matplotlib.pyplot.bar_label — Matplotlib 3.8.0 documentation (ty Revisions to Display count on top of seaborn barplot [duplicate] - Stack Overflow):
ax = sns.histplot(df.langs_parsed)
#ax.set_xlabel("Language(s)")
#ax.set_ylabel("# of files")
for i in ax.axes.containers:
ax.bar_label(
i,
)
The second link has infos about barplot, catplot, and countplot too!
If the text goes over the limit and the light-gray background of seaborn’s theme or something, increase the limit as:
ylim = ax.axes.get_ylim()[1]
new_ylim = ylim + 300
ax.axes.set_ylim(0, new_ylim)
# you can also set padding of the labels in px and Text (https://matplotlib.org/stable/api/text_api.html#matplotlib.text.Text) properties:
for ax in g.axes.containers:
g.bar_label(ax, padding=-10,fontsize=5)
EDIT 2023-10-06:
To disable scientific notation, one can use the fmt=
argument (see bar_label
docu) where one can pass a format, including as f-string:
for i in ax.axes.containers:
ans = ax.bar_label(
i,
fmt="{:,.2f}",
)
There’s also a parameter that decides at which point to start to use sci. notation, I think I closed the tab with the link though+
It includes a really cool list of corpora!
And at the end has a list of other such pages for other languages etc.
Also: deutschland · PyPI: “A python package that gives you easy access to the most valuable datasets of Germany.”
The LREC Author’s Kit prints all things in the .bib file and it uses \nocite{*}
for that.
The Internet from 2009 agreess that’s the way to go : Biblatex - Printing all entries in .bib file (cited and not)
Removing this line removes the printout.
Lastly, the link above shows printing separate bibliographies; the LREC Author’s kit does something different for the same:
\subsection{Language Resource References}
Language resource references should be listed in alphabetical order at the end of the paper.
\nocite{*}
\section{Bibliographical References}\label{sec:reference}
\bibliographystyle{lrec-coling2024-natbib}
\bibliography{lrec-coling2024-example}
\section{Language Resource References}
\label{lr:ref}
\bibliographystylelanguageresource{lrec-coling2024-natbib}
\bibliographylanguageresource{languageresource}
TL;DR \newpage~\newpage~\newpage~\newpage
for 3 empty pages
\newpage
doesn’t always work for well me in, esp. not in the IEEE and LREC templates. Either only one column is cleared, or there are issues with images/tables/… positions.
\clearpage
works for me in all cases I’ve tried.
EDIT: but only one page, not multiple! For multiple empty pages one after the other this1 does the trick:
\newpage
~\newpage
ChatGPT thinks it works because ~
being a non-breaking space makes LaTex try to add both empty pages on the same page, leading to two empty pages. Somehow allowing a newline between new pages makes it interpret both pages as the same command, since it’s already a new page.
When saving seaborn images there was weirdness going on, with borders either cutting labels or benig too big.
Solution:
# bad: cut corners
ax.figure.savefig("inat_pnet_lorenz.png")
# good: no cut corners and OK bounding box
ax.figure.savefig("inat_pnet_lorenz.png", bbox_inches="tight")
EDIT 2023-12-14
Paper reviewer suggested exporting in PDF, which led me to graphics - Good quality images in pdflatex - TeX - LaTeX Stack Exchange:
Both gnuplot and matplotlib can export to vector graphics; file formats for vector graphics are e.g. eps or pdf or svg (there are many more). As you are using pdfLaTeX, you should choose pdf as output format, because it will be easy to include in your document using the graphicx package and the \includegraphics{} command.
Awesome! So I can save to PDF and then include using the usual code (edit - eps works as well). Wow!
Static image export in Python:
fig.write_image("images/fig1.png")
PDF works as-is as well, EPS needs the poppler library but then works the same way
For excessive margins in the output PDFs:]
fig.update_layout( margin=dict(l=20, r=20, t=20, b=20), )
When including a PDF plot, I get this sometimes:
This is a problem only when viewing the PDF inside qutebrowser/Overleaf, in a normal PDF viewer it’s fine!
Didn’t find this in the documentation, but:
gg = ds.groupby(by=["species"])
lg = next(gg.groups)
# lg is the group name tuple (in this case of one string)
group_df = gg.get_group(lg)