03 Dec 2023

Masterarbeit eval task LMentry-static-UA

Context: 220120-1959 taskwarrior renaming work tasks from previous work

First notes

Just tested this: DAMN!

User
Can you, in English, name one word for each of these tasks:
1. Rhymes with "chair"
2. Is a number larger than eleven
3. Has two letters "a"
4. Ends with the letter "k"
5. In the sentence "the cat had four paws and a good mood" is BEFORE the word "paws"

Also:
6. A sentence that starts with the word "dogs"
7. A sentence that ends with the word "beaver"
8. A sentence that uses the word "metal" twice

https://chat.openai.com/share/3fdfaf05-5c13-44eb-b73f-d66f33b73c59

lmentry/data/all_words_from_category.json at main · aviaefrat/lmentry

Not all of it needs code and regexes! lmentry/data/bigger_number.json at main · aviaefrat/lmentry

I can really do a small lite-lite subset containing only tasks that are evaluatable as a dataset.

LMentry-micro-UA

// minimal, micro, pico

Plan:

go methodically through all of those task, divide them into regex and not regex, clone the code translate the prompts generate the dataset

Decision on 231010-1003 Masterarbeit Tagebuch#LMentry-micro-UA: doing a smaller version works!

LMentry-static-UA

Basics

Will contain only a subset of tasks, the ones not needing regex. They are surprisingly many.

The code will generate a json dataset for all tasks.

Implementation

Original task/code/paper analysis

lmentry/resources at main · aviaefrat/lmentry has the JSONs with static words etc. used to generate the tasks
lmentry/lmentry/predict.py at main · aviaefrat/lmentry contains the predicting code used to evaluate it using different kinds of models - I’ll need this later.
tasks enumeration:
- lmentry/lmentry/tasks/lmentry_tasks.py at main · aviaefrat/lmentry
- from the paper ¹
lmentry/lmentry/constants.py at main · aviaefrat/lmentry list of all tested models
static tasks (~~won’timplement~~, completed):
- ~~Sentence containing word~~
- ~~Sentence not containing word~~
- ~~Word containing letter~~
- ~~Word not containing letter~~
- Most associated word
- Least associated word
- Any words from category
- All words from category
- First alphabetically
- More letters
- Less letters
- Bigger number
- Smaller number
- Rhyming word
- ~~Homophones~~ Won’t do because eight/ate won’t work in ukr
- Word after in sentence
- Word before in sentence
- ~~Sentence starting with word~~
- ~~Sentence ending with word~~
- ~~Word starting with letter~~
- ~~Word ending with letter~~
- First word of the sentence
- Last word of the sentence
- First letter of the word
- Last letter of the word

Distractors

order of words
template content
adj tasks VS arg. content

My changes

I’d like to have words separated by:
- frequency
- length,
- … and maybe do cool analyses based on that
(DONE) in addition to first/last letter/word in word/sentence, add arbitrary “what’s the fourth letter in the word ‘word’?”
Longer/shorter words: add same length as option

My code bits

I have to write it in a way that I can analyze it for stability wrt morphology etc. later

Ukrainian numerals creation

Problem: ‘1’ -> один/перший/(на) першому (місці)/першою

Existing solutions:

pymorphy can inflect existing words but needs
savoirfairelinux/num2words: Modules to convert numbers to words. 42 –> forty-two can’t do ordinals+case
- https://github.com/savoirfairelinux/num2words/blob/master/num2words/lang_UK.py
- https://github.com/savoirfairelinux/num2words/blob/974f7a8669682b18e2e14aadf23f4626c7c41b6e/tests/test_uk.py

Created my own! TODO document

TODO https://webpen.com.ua/pages/Morphology_and_spelling/numerals_declination.html

More tagsets fun

Parse(word='перша', tag=OpencorporaTag('ADJF,compb femn,nomn'), normal_form='перший', score=1.0, methods_stack=((DictionaryAnalyzer(), 'перша', 76, 9),))

compb

Nothing in docu, found it only in the Ukr dict converter tagsets mapping: LT2OpenCorpora/lt2opencorpora/mapping.csv at master · dchaplinsky/LT2OpenCorpora

I assume it should get converted to comp but doesn’t - yet another future bug report to pymorphy4

Even more tagsets fun

pymorphy2 doesn’t add the sing tag for Ukrainian singular words. Then any inflection that deals with number fails.

Same issue I had in 231024-1704 Master thesis task CBT

Found a way around it:

@staticmethod
def _add_sing_to_parse(parse: Parse) -> Parse:
	"""
	pymorphy sometimes doesn't add singular for ukrainian
	(and fails when needs to inflect it to plural etc.)

	this creates a new Parse with that added.
	"""
	if parse.tag.number is not None:
		return parse

	new_tag_str = str(parse.tag)
	new_tag_str+=",sing"
	new_tag = parse._morph.TagClass(tag=new_tag_str)
	new_best_parse = Parse(word=parse.word, tag=new_tag, normal_form=parse.normal_form, score=parse.score, methods_stack=parse.methods_stack) 
	new_best_parse._morph=parse._morph
	return new_best_parse


# Not needed for LMentry, but I'll need it for CBT anyway...
@staticmethod
def _make_agree_with_number(parse: Parse, n: int)->Parse:
	grams = parse.tag.numeral_agreement_grammemes(n)
	new_parse = Numbers._inflect(parse=parse, new_grammemes=grams)
	return new_parse

parse._morph is the Morph.. instance, without one added inflections of that Parse fail.
TagClass follows the recommendations of the docu² that say better it than a new OpencorporaTag, even though both return the same class.

Notes by task

Taxonomy

Comparing two things

Words of different lengths, alphabet order of words, etc.

Main relationship is kind=less|more, where less means “word closer to beginning of the alphabet”, “smaller number”, “word with fewer letters” etc., more is the opposite.

Alphabet order of words

(DONE) which word is closer to beginning of alphabet
Are these words in alphabet order?

Which word is longer

TODO Which number is bigger

use the one-million bits and add to the text that this is why I needed to care about agreemnet
do comparisons of entities! one box has a million pencils, the other has five hundred thousand. Which has more pencils?

GPT4 agreement issues

https://chat.openai.com/share/b52baed7-5d56-4823-af3e-75a4ea8d5b8c: 1.5 errors, but I’m not sure myself about the fourth one.

    LIST = [
        "Яке слово стоїть ближче до початку алфавіту: '{t1}' чи '{t2}'?",
        "Що є далі в алфавіті: '{t1}' чи '{t2}'?",
        "Між '{t1}' та '{t2}', яке слово розташоване ближче до кінця алфавіту?",
        # TODO - в алфавіті?
        "У порівнянні '{t1}' і '{t2}', яке слово знаходиться ближче до A в алфавіті?",
        # ChatGPT used wrong відмінок внизу:
        #  "Визначте, яке з цих слів '{t1}' або '{t2}' знаходиться далі по алфавіті?",
    ]

HF Dataset

I want a ds with multiple configs.

Base patterns

HA! Lmentry explicitly lists base patterns: lmentry/lmentry/scorers/first_letter_scorer.py at main · aviaefrat/lmentry

starts = "(starts|begins)"

base_patterns = [
rf"The first letter is {answer}",
rf"The first letter {of} {word} is {answer}",
rf"{answer} is the first letter {of} {word}",
rf"{word} {starts} with {answer}",
rf"The letter that {word} {starts} with is {answer}",
rf"{answer} is the starting letter {of} {word}",
rf"{word}: {answer}",
rf"First letter: {answer}",
]

For more: lmentry/lmentry/scorers/more_letters_scorer.py at main · aviaefrat/lmentry

Looking for example sentences

spacy example sentences
political ones from UP!
implemented
for words, I really should use some normal dictionary.

Assoc. words and resources

Another dictionary I found: slavkaa/ukraine_dictionary: Словник слів українською (слова, словоформи, синтаксичні данні, літературні джерела)

Excel_word_v10.xslx
sql as well
a lot of columns

Next tasks

List
- Most associated word
- Least associated word
- Any words from category
- All words from category

All basically need words and their categories. E.g. Animals: dog/cat/racoon

I wonder how many different categories I’d need

Ah, the O.G. benchmark has 5 categories: lmentry/resources/nouns-by-category.json at main · aviaefrat/lmentry

Anyway - I can find no easy dictionary about this.

options:

Wikipedia:
- Категорія:Тварини за алфавітом — Вікіпедія
- Категорія:Категорії за алфавітом — Вікіпедія

Wordnet

Other Wordnet: olgakanishcheva/WordNet-Affect-UKR: WordNet-Affect (http://wndomains.fbk.eu/wnaffect.html) is an extension of WordNet Domains, including a subset of synsets suitable to represent affective concepts correlated with affective words.
- multi-lang incl. Ukr
- emotions only
- txts are tab-separated, colums are lists of words in column-language
WordNet — someone started it for UKR!
- wordnet/resources/wn-ua-2015 at main · lang-uk/wordnet
  - wordnet/assumptions.md at main · lang-uk/wordnet this is so cool!
  - wordnet/notebooks/wn-translation-analysis-khrystyna.md at main · lang-uk/wordnet 10/10 would love to work on/at/with that/them
- 2015
  - all-dict-entries: definitions
  - all-in-one-file: relationships between these definitions
WordNet - Wikipedia:
- hypernym: Y is a hypernym of X if every X is a (kind of) Y (canine is a hypernym of dog)
- hyponym: Y is a hyponym of X if every Y is a (kind of) X (dog is a hyponym of canine)
- holonym: Y is a holonym of X if X is a part of Y (building is a holonym of window)
- meronym: Y is a meronym of X if Y is a part of X (window is a meronym of building)

for all-in-one:

> grep -o "_\(.*\)(" all-in-one-file.txt | sort | uniq -c
     49 _action(
      8 _action-and-condition(
     58 _holonym(
    177 _hyponym(
     43 _meronym(
     12 _related(
     51 _sister(
    102 _synonym(

looking through it it’s sadly prolly too small

2009’s hyponym.txt is nice and much more easy to parse.

GPT

Ideas: WordNet Search - 3.1 Ask it to give me a list of:

emotions
professions
sciences
body parts
animals
times (dow, months, evening, etc.)
sports It suggests also
musical instruments
dishes
clothing

<_(@bm_lmentry) “LMentry: A language model benchmark of elementary language tasks” (2022) / Avia Efrat, Or Honovich, Omer Levy: z / / 10.48550/ARXIV.2211.02069 _> ↩︎
API Reference (auto-generated) — Морфологический анализатор pymorphy2 ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net