serhii.net

In the middle of the desert you can say anything you want

10 Oct 2023

Masterarbeit Tagebuch

2023-10-10 10:03

  • First conversation with CH about the topic
  • Bits:
    • I have a lower bound of about 1000 instances/examples for my own tasks
      • Asking for help translating is OK!
      • Using existing datasets, tasks, translated tasks is OK (if I cite ofc)
    • A simple task measuring perplexity on a dataset is not required anymore in benchmarks but it’s possible
    • As I thought, easy tasks are OK because not everyone uses gpt, most train their own models for tasks
    • In the theory part I don’t need to explain the very basics, just Transformers+ and LLMs should be enough
  • Decisions:
    • I’ll use an existing eval harness!
    • I’ll test the existing LMs on it at the end
  • Also: finally set up Obsidian+Zotero and can now easily add citations! (231010-2007 A new attempt at Zotero and Obsidian)

2023-10-12 16:12

  • Spontaneusly started writing the UP crawler thing and loving it!
  • Read the first chapters of the basics of linguistics book, with more concentration this time, loving it too
  • Wrote part of the chapter about UA from a linguistics perspective

2023-10-16 17:22

  • Almost finished the UP crawler! It now:
    • Accepts a date range and saves the URIs of articles posted in the days in that date range
    • Crawls and saves all these URIs, with unified tags (their id + their Russian and Ukrainian name

2023-10-17 09:57

  • Conversation with CH about possible tasks
    • Deemphasize perplexity in general
    • UP dataset can be used as downstream task to compare scores w/ benchmark
    • OK for interference, OK for gendered language

2023-11-09 22:37

  • UA-CBT day!
    • Refactored UA-CBT code so it’s much cleaner!
    • ADDED AGREEMENT/morphology to the options! Word shape too!
      <SingleTask
      context='Одного разу селянин пішов у поле орати. Дружина зібрала йому
      обід. У селянина був семирічний син. Каже він матері:  Мамо, дай-но я віднесу обід
      батькові.  Синку, ти ще малий, не знайдеш батька,  відповіла мати.  Не бійтеся,
      матінко.'
      question='Дорогу я знаю, обід віднесу. Мати врешті погодилась, зав’язала
      хліб у вузлик, приладнала йому на спину, вариво налила у миску, дала синові в руки та й
      відправила у поле. Малий не заблукав, доніс обід батькові.  Синку, як ти мене знайшов? 
      запитав батько.  Коли вже так, віднеси обід до ______ , я туди прийду і поїмо.  Ні,
      батьку,  сказав син.'
      options=['цар', 'рибки', 'хлопця', 'сина', 'джерела']
      answer='джерела'
      >,
      
- Found out that pymorphy2 is not as good as I hoped :(
	- **Can I use spacy for getting morphology info, and pymorphy only for inflecting?**
 
## 2023-11-27 23:33
- Decided to work on morphology just a bit,
- was a deep dive in morphology tagging systems (FEATS, Russian OpenCorpora etc.,) that I documented in [231024-1704 Master thesis task CBT](/dtb/2023-10-24-231024-1704-master-thesis-task-cbt/)
- Realized that picking the correct result from pymorphy is critical for me because I need it for correct changing-into-different-morphology later
	- for this I need some similarity metric 
	- there's a project to convert between different tagging systems [OpenCorpora/russian-tagsets: Russian morphological tagset converters library.](https://github.com/OpenCorpora/russian-tagsets) that will help me a lot, and then I can convert OpenCorpora to spacy/UD

## 2023-11-29 19:55
- started and finishing writing the program that discriminates between pymorphy2 morphologies based based on spacy data! will be a separate python package maybe 


## 2023-12-01 16:53
- Uploaded the package to a public github repo! 
	-  [pchr8/pymorphy-spacy-disambiguation: A package that picks the correct pymorphy2 morphology analysis based on morphology data from spacy](https://github.com/pchr8/pymorphy-spacy-disambiguation)
	- No real release process etc., but now it can be not just `pip install git+https://github.com/pchr8/pymorphy-spacy-disambiguation`, but also   `poetry add git+https://github.com/pchr8/pymorphy-spacy-disambiguation` 
 - UA-CBT task is surprisingly complex because pymorphy2-ua seems to have no singular tags for Ukrainian languages etc., described in [231024-1704 Master thesis task CBT](/dtb/2023-10-24-231024-1704-master-thesis-task-cbt/). Which leads me to:  **Given how long and interesting each step is, maybe I can write blog posts for each of the CBT tasks**? With more details than used in the Master thesis, AND with an easy place to copypaste from.
 

## 2023-12-02
- [231201-1401 How to read and write a paper according to hackernews](/dtb/2023-12-01-231201-1401-how-to-read-and-write-a-paper-according-to-hackernews/) suggested drawing a mindmap on paper, I did, and DAMN IT WAS HELPFUL
- Got a much better idea of the entire structure, but mostly - got many more new fresh awesome ideas
- Creativity does work much better when you're doing pen and paper!

## 2023-12-04
- Spent the day doing more brainstorming and mindmaps but mainly looking into python LLM packages
- Was able to write a working example of Python program that generates instances for the [231204-1642 Masterarbeit evaluation task new UA grammar and feminitives](/dtb/2023-12-04-231204-1642-masterarbeit-evaluation-task-new-ua-grammar-and-feminitives/) task by talking to OpenAI and getting back JSON!
Nel mezzo del deserto posso dire tutto quello che voglio.