Masterarbeit Tagebuch
2023-10-10 10:03
- First conversation with CH about the topic
- Bits:
- I have a lower bound of about 1000 instances/examples for my own tasks
- Asking for help translating is OK!
- Using existing datasets, tasks, translated tasks is OK (if I cite ofc)
- A simple task measuring perplexity on a dataset is not required anymore in benchmarks but it’s possible
- As I thought, easy tasks are OK because not everyone uses gpt, most train their own models for tasks
- In the theory part I don’t need to explain the very basics, just Transformers+ and LLMs should be enough
- I have a lower bound of about 1000 instances/examples for my own tasks
- Decisions:
- I’ll use an existing eval harness!
- I’ll test the existing LMs on it at the end
- Also: finally set up Obsidian+Zotero and can now easily add citations! (231010-2007 A new attempt at Zotero and Obsidian)
2023-10-12 16:12
- Spontaneusly started writing the UP crawler thing and loving it!
- Read the first chapters of the basics of linguistics book, with more concentration this time, loving it too
- Wrote part of the chapter about UA from a linguistics perspective
2023-10-16 17:22
- Almost finished the UP crawler! It now:
- Accepts a date range and saves the URIs of articles posted in the days in that date range
- Crawls and saves all these URIs, with unified tags (their id + their Russian and Ukrainian name
2023-10-17 09:57
- Conversation with CH about possible tasks
- Deemphasize perplexity in general
- UP dataset can be used as downstream task to compare scores w/ benchmark
- OK for interference, OK for gendered language
2023-11-09 22:37
- UA-CBT day!
- Refactored UA-CBT code so it’s much cleaner!
- ADDED AGREEMENT/morphology to the options! Word shape too!
<SingleTask context='Одного разу селянин пішов у поле орати. Дружина зібрала йому обід. У селянина був семирічний син. Каже він матері: — Мамо, дай-но я віднесу обід батькові. — Синку, ти ще малий, не знайдеш батька, — відповіла мати. — Не бійтеся, матінко.' question='Дорогу я знаю, обід віднесу. Мати врешті погодилась, зав’язала хліб у вузлик, приладнала йому на спину, вариво налила у миску, дала синові в руки та й відправила у поле. Малий не заблукав, доніс обід батькові. — Синку, як ти мене знайшов? — запитав батько. — Коли вже так, віднеси обід до ______ , я туди прийду і поїмо. — Ні, батьку, — сказав син.' options=['цар', 'рибки', 'хлопця', 'сина', 'джерела'] answer='джерела' >,
- Found out that pymorphy2 is not as good as I hoped :(
- **Can I use spacy for getting morphology info, and pymorphy only for inflecting?**
## 2023-11-27 23:33
- Decided to work on morphology just a bit,
- was a deep dive in morphology tagging systems (FEATS, Russian OpenCorpora etc.,) that I documented in [231024-1704 Master thesis task CBT](/dtb/2023-10-24-231024-1704-master-thesis-task-cbt/)
- Realized that picking the correct result from pymorphy is critical for me because I need it for correct changing-into-different-morphology later
- for this I need some similarity metric
- there's a project to convert between different tagging systems [OpenCorpora/russian-tagsets: Russian morphological tagset converters library.](https://github.com/OpenCorpora/russian-tagsets) that will help me a lot, and then I can convert OpenCorpora to spacy/UD
## 2023-11-29 19:55
- started and finishing writing the program that discriminates between pymorphy2 morphologies based based on spacy data! will be a separate python package maybe
## 2023-12-01 16:53
- Uploaded the package to a public github repo!
- [pchr8/pymorphy-spacy-disambiguation: A package that picks the correct pymorphy2 morphology analysis based on morphology data from spacy](https://github.com/pchr8/pymorphy-spacy-disambiguation)
- No real release process etc., but now it can be not just `pip install git+https://github.com/pchr8/pymorphy-spacy-disambiguation`, but also `poetry add git+https://github.com/pchr8/pymorphy-spacy-disambiguation`
- UA-CBT task is surprisingly complex because pymorphy2-ua seems to have no singular tags for Ukrainian languages etc., described in [231024-1704 Master thesis task CBT](/dtb/2023-10-24-231024-1704-master-thesis-task-cbt/). Which leads me to: **Given how long and interesting each step is, maybe I can write blog posts for each of the CBT tasks**? With more details than used in the Master thesis, AND with an easy place to copypaste from.
## 2023-12-02
- [231201-1401 How to read and write a paper according to hackernews](/dtb/2023-12-01-231201-1401-how-to-read-and-write-a-paper-according-to-hackernews/) suggested drawing a mindmap on paper, I did, and DAMN IT WAS HELPFUL
- Got a much better idea of the entire structure, but mostly - got many more new fresh awesome ideas
- Creativity does work much better when you're doing pen and paper!
## 2023-12-04
- Spent the day doing more brainstorming and mindmaps but mainly looking into python LLM packages
- Was able to write a working example of Python program that generates instances for the [231204-1642 Masterarbeit evaluation task new UA grammar and feminitives](/dtb/2023-12-04-231204-1642-masterarbeit-evaluation-task-new-ua-grammar-and-feminitives/) task by talking to OpenAI and getting back JSON!
Nel mezzo del deserto posso dire tutto quello che voglio.