In the middle of the desert you can say anything you want
In the context of 240127-2101 Checklist for backing up an android phone, I wanted to back up my TrackAndGraph data, for which I a) manually created a file export, and b) just in case created a backup through Google Drive/Sync/One/whatever
I then forgot to move the backup file :( but fear not, instead of a clean start I can then use the Google Drive backup of all apps and that one specifically — but it was missing.
It was present in the google backup info as seen in the google account / devices / backups interface, but absent in the phone recovery thing during set up.
Installed it through Google Play again, still nothing, did a new phone factory reset, still nothing.
Googled how to access the information from device backups through google drive w/o a device: you can’t.
Was sad about losing 6month of quantified self data, thought about how to do it better (sell my soul to Google and let it sync things from the beginning?) and gave up
Then I installed the excellent Sentien Launcher through F-droid (was not part of the back up as well, but I didn’t care) and noticed it had my old favourites.
Aha. Aha.
Okay. I see.
Android 13, Samsung phone.
I’ll be more minimalistic though
> adb shell pm list packages | ag "(lazada|faceb|zalo)"
package:com.facebook.appmanager
package:com.facebook.system
package:com.lazada.android
package:com.facebook.services
package:com.facebook.katana
adb shell pm uninstall -k --user 0 com.facebook.appmanager
adb shell pm uninstall -k --user 0 com.facebook.system
adb shell pm uninstall -k --user 0 com.lazada.android
adb shell pm uninstall -k --user 0 com.facebook.services
adb shell pm uninstall -k --user 0 com.facebook.katana
adb shell pm uninstall -k --user 0 com.samsung.android.bixby.agent
adb shell pm uninstall -k --user 0 com.samsung.android.bixby.wakeup
adb shell pm uninstall -k --user 0 com.samsung.android.bixbyvision.framework
First heard about them here: (185) Samsung’s privacy policy for Oct 1st is crazy. : Android
adb shell sh /sdcard/Android/data/moe.shizuku.privileged.api/start.sh
Related:
220904-1605 Setting up again Nextcloud, dav, freshRSS sync etc. for Android phone
from old DTB: Day 051: Phone ADB full backup - serhii.net:
adb backup -apk -shared -all -f backup-file.adb
/ adb restore backup-file.adb
adb devices -l
, adb shell
— after enabling USB debugging in Settings -> Dev.)
Elsewhere:Cool apps I love and want to remember about (see 220124-0054 List of good things):
NewPipeData-xxx.zip
Export_xxx.osf
filePodcastAddict_xxx.backup
; OPML existsbookmarks.html
mappings_xxx.zip
Internal storage/Books
(because “more secure options are are available”) ↩︎
Relevant (185) Samsung’s privacy policy for Oct 1st is crazy. : Android ↩︎
Instead of doing graphs for 231024-1704 Master thesis task CBT, went with generating prompts for stories.
If you want a story involving a fox and a raven, the story will contain cheese — so need to inject randomness.
Prompt used:
Write an interesting story in the style of Terry Pratchett, with at least 2 recurring major characters and 3 minor, of different genders. Don’t start the story by listing the characters. None of the characters should have names. All should be animals. The story should be graduate student reading level. Please make it rather detailed than generic - don’t mention that X had to Y, but show/tell him doing that. It should be about an apprentice’s revenge over his master. The main plot point should revolve around corruption. The story should have a bad ending. Write the story in the Ukrainian language. Start with the words: Одного разу,
..is really good if you tell it what it wants. In Ukr as well.
The info that story should be in Ukrainian should be at the end, or “start the story with”.
Одного разу, у містечку Анімалопіліс, де жили різні тварини, трапилася ця історія. В місті панувала глибока корупція, але це не виглядало так просто, як здається. У містечку жили два головні персонажі: Майстер Сова і Лицар Заєць. Також були три менш важливі персонажі: Лисичка, Ведмідь та Жаба. … Але план зійшов на нет, коли виявилось, що Майстер Сова дізнався про зраду. Він впіймав Лицаря Зайця, коли він намагався передати докази Лисичці, Ведмедю та Жабі. Майстер Сова вирішив застосувати свою владу та покарати Лицаря Зайця.
Зараз Лицар Заєць перебуває у в’язниці, а корупція в Анімалопілісі продовжує рости. Лисичка, Ведмідь та Жаба вже не намагаються боротися з нею, а мешканці міста стали байдужими до цього проблеми. Так завершується ця історія, де корупція перемогла, а справедливість поступилася владі та грошам.
meister Fuchs
Codellama-34b-instruct:
primitive but in Ukrainian
here’s a list of friends of the fox Lisa (=Russian word for fox written in Ukrainian letters) that will help her with her revenge about the wolf Wolf:
- a rabbit named Rabbit
- an otter named Otter
- a frog named Frog
mixtral-7b-instruct
Fazit
It would be interesting to parametrize/analyze:
Looking for Ukrainian tales as controls found this:
When folklore began to be intensively collected, eastern Ukraine was part of the Russian Empire and western Ukraine was part of Austro-Hungary. In eastern Ukraine, activities that might encourage Ukrainian nationalist feelings were banned, but folklore was not. Russians saw Ukraine as a backward, border place: Little Russia, as Ukraine was so often called. They also saw folklore as ignorant, country literature, appropriate to their perception of Ukraine. Russians felt that the collection of Ukrainian folklore, by perpetuating the image of Ukrainian backwardness, would foster the subjugation of Ukraine. Therefore, they permitted the extensive scholarly activity from which we draw so much of our information today. Ironically, when Ukrainian folklore was published, it was often published not as Ukrainian material, but as a subdivision of Russian folklore. Thus Aleksandr Afanas’ev’s famous collection, Russian Folk Tales, is not strictly a collection of Russian tales at all, but one that includes Ukrainian and Belarusian tales alongside the Russian ones. Because Ukraine was labeled Little Russia and its language was considered a distant dialect of Russian, its folklore was seen as subsumable under Russian folklore. Russia supposedly consisted of three parts: Great Russia, what we call Russia today; Little Russia, or Ukraine; and White Russia, what we now call Belarus. The latter two could beand often wereincluded under Great Russia. Some of the material drawn on here comes from books that nominally contain Russian folktales or Russian legends. We know that they are actually Ukrainian because we can easily distinguish the Ukrainian language from Russian. Sometimes Ukrainian tales appear in Russian translation to make them more accessible to a Russian reading public. In these instances we can discern their Ukrainian origin if the place where a tale or legend was collected is given in the index or the notes. 1
This feels relevant as well: The Politics of innocence: Soviet and Post-Soviet Animation on Folklore topics | Journal of American Folklore | Scholarly Publishing Collective
Tokens Used: 3349
Prompt Tokens: 300
Completion Tokens: 3049
Successful Requests: 2
Total Cost (USD): $0.09447
So it’s about 0.05 per generated story? Somehow way more than I expected.
~300 stories (3 instances from each) would be around 15€
I mean I can happily generate around 100 manually per day from the ChatGPT interface. And I can immediately proofread it as I go and while a different story is being generated. (I can also manually fix gpt3 stories generated for 1/10th of the price.)
I guess not that much more of a workload. And most importantly - it would give me a better insight about possible issues with the stories, so I can change the prompts quickly, instead of generating 300 ‘bad’ ones.
I need to think of a workflow to (grammatically) correct these stories. I assume writing each story to a file named after the row, manually correcting it, and then automatically adding to the new column?
(Either way, having generated 10 stories for 40 cents, I’ll analyze them at home and think about it all.)
It boils down to how many training instances can I get from a story — tomorrow I’ll experiment with it and we’ll see.
The stories contain errors but ChatGPT can fix them! But manual checking is heavily needed, and, well, this will also be part of the Masterarbeit.
The fixes sometimes are really good and sometimes not:
I tried to experiment with telling it to avoid errors and Russian, with inconclusive results. I won’t add this to the prompt.
Колись давним-давно, у лісі, де дерева шепотіли таємницями, а квіти вигравали у вічному танці з вітром, жила духмяна метелик.
(and then goes on to use the feminine gender for it throughout the entire tale)
On second thought, this could be better:
All should be animals. None of the characters should have names, but should be referred to by pronouns and the capitalized name of their species.
I can use the capitalized nouns as keys, and then “до мудрого Сови” doesn’t feel awkward?..
This might be even better:
None of the characters should have names: they should be referred to by the capitalized name of their species (and pronouns), and their gender should be the same as that name of their species.
The story should be about an owl helping their mentor, a frog, with an embarassing problem. The story should be in the Ukrainian language.
And also remove the bit about different genders, or same gender, just let it be.
Yes, let this be the prompt v2 v3. Fixed the genders in the options, removed the genders limit in the prompt.
None of the characters should have names: they should be referred to by the name of their species, and their gender should be the same as that name of their species. {ALL_SHOULD_BE_ANIMAL
Takes about 4 cents and 140 seconds per story:
1%|▌ | 4/300 [11:39<14:23:02, 174.94s/it]
INFO:__main__:Total price for the session: 0.22959999999999997 (5 stories).
“Кішка обіцяла довести, що вона гідний син” that’s on me.
Removed gendered “son”.
Через деякий час до верблюда прийшла газель, яка просила допомоги. Її стадо зазнало нападу від лева, і вона шукала поради, як уникнути подібних інцидентів у майбутньому. Верблюд порадив газелі знайти нові пасовища, де леви не полюють, і навчити стадо бути більш обережними та уважними.
метелик відчула непереборне бажання знайти найсолодший квітка в лісі
Ця історія відбулась у місті, де вулиці були вимощені старою бруківкою, а будинки були ніби з казок. Миша, зі своїм сірим хутром таблискучими очима, була відома серед мешканців міста своєю ненаситною жадібністю. Вона не просто збирала зернятка та крихти, що їй вдавалося знайти, але й намагалася відібрати їжу у інших. This has 264 tokens for bert-base-uncased, 106 for bert-base-multilingual-uncased.
import transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("s")
A ’typical’ longer story is 901 multilinugal tokens. Damn. (Or 709 for KoichiYasuoka/roberta-base-ukrainian · Hugging Face)
Practically this means shorter stories? Cutting bits at the end? Separating them? Using tokens instead of sentences when creating tasks?
… The story should have an unhappy ending. It should be no more than 300 words long.
The story should be about a cunning camel not learning anything.
Колись давним-давно, в піщаних просторах пустелі, жив хитрий верблюд. Він був відомий своєю вмінням уникати праці, перекладаючи свої обов'язки на менш кмітливих сусідів - невеликого єнота та серйозного орла. Вони терпеливо виконували важку роботу, в той час як верблюд ласував найсолодшими пагонами.
Одного дня, коли вода в оазі на межі висихання, верблюд вирішив, що єнот і орел повинні відправитись у небезпечну подорож за новим джерелом. "Тільки ви маєте кмітливість і силу знайти воду," - лукаво мовив верблюд.
Єнот і орел, виснажені його маніпуляціями, нарешті усвідомили хитрість верблюда і вирішили діяти спільно. Вони пішли, обіцяючи верблюду привести воду, але насправді вони планували знайти нову оазу лише для себе.
Залишившись на самоті, верблюд швидко зрозумів, що його власна лінь і хитрість привели до катастрофи. Орел і єнот знайшли нове місце, а верблюд, не здатний самостійно вижити, був змушений мандрувати пустелею у пошуках води і допомоги.
Але пустеля була невблаганною, і верблюд, нарешті, зрозумів, що хитрість без мудрості і співпраці - це шлях до самотності та відчаю. Саме ця думка була його останньою, перш ніж пустеля поглинула його.
175 words, 298 tokens for roberta-base-ukrainian, 416 for bert-base-multilingual-uncased. 10 sentences. I think I’ll add this to the template v4.
Problem: animacy detection is shaky at best:
(Pdb++) for a in matches: print(a, a[0].morph.get("Animacy"))
верблюду ['Inan']
воду ['Inan']
оазу ['Inan']
самоті ['Inan']
верблюд ['Anim']
(Pdb++) for w in doc: print(w, w.morph.get("Animacy")[0]) if w.morph.get("Animacy")==["Anim"] else None
верблюд Anim
кмітливих Anim
сусідів Anim
невеликого Anim
єнота Anim
серйозного Anim
орла Anim
верблюд Anim
верблюд Anim
орел Anim
ви Anim
верблюд Anim
Єнот Anim
орел Anim
верблюда Anim
верблюд Anim
Орел Anim
верблюд Anim
верблюд Anim
OK, so anim has a higher precision than recall. And adj can also be animate, which is logical!
I think I can handle the errors.
More issues:
t.morph Animacy=Anim|Case=Nom|Gender=Fem|NameType=Giv|Number=Sing
(Pdb++) tt = [t for t in doc if t.pos_ == PROPN]
(Pdb++) tt
[Миша, Собака, Миша, Кіт, Мишею, Ластівка, Ластівка, Мишу, Миша, Ластівка, Миша, Кіт, Миша, Миша, Миша, Миші, Мишу, Ластівка, Миші, Миші, Миша]
damn. OK, so propn happens because capitalization only? Wow.
Next up:
ERROR:ua_cbt:Верблюд же, відчуваючи полегшення, що зміг уникнути конфлікту, повернувся до своєї тіні під пальмою, де продовжив роздумувати про важливість рівноваги та справедливості у світі, де кожен шукає своє місце під сонцем.
пальмою -> ['пустелею', 'водити', 'стороною', 'історією']
Fixed.
Fixed вОди
Верблюдиця та шакал опинилися наодинці у безкрайній пустелі, позбавлені підтримки та провізії.
Верблюдиця -> ['Верблюдиця', 'Люда', 'Люди']
Fixed Люда 1 and 2.
cbt · Datasets at Hugging Face
Is quite good at generating stories if given an Ukrainian prompt!
Has trouble following the bits about number of characters, but the grammar is much better. Though can stop randomly.
https://g.co/bard/share/b410fb1181be
The Magic Egg and Other Tales from Ukraine. Retold by Barbara J. Suwyn; drawings by author; edited and with an introduction by Natalie O. Kononenko., found in Ukrainian fairy tale - Wikipedia ↩︎
I like to do
what: some_dict()
for k,v in what.items():
#...
But
for k in what:
# do_sth(k, what[k])
is much more readable sometimes, and one less variable to name. I should do it more often.
Goal: find identical words with diff embeddings in RU and UA, use that to generate examples.
Link broken but I think I found the download page for the vectors
Their blog is also down but they link the howto from the archive Aligning vector representations – Sam’s ML Blog
Download: fastText/docs/crawl-vectors.md at master · facebookresearch/fastText
axel https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.uk.300.bin.gz
axel https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ru.300.bin.gz
It’s taking a while.
EDIT: Ah damn, had to be the text ones, not bin. :( starting again
EDIT2: THIS is the place: fastText/docs/pretrained-vectors.md at master · facebookresearch/fastText
https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.uk.vec
https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ru.vec
UKR has 900k lines, RUS has 1.8M — damn, it’s not going to be easy.
What do I do next, assuming this works?
Assuming I found out that RU-кит is far in the embedding space from UKR-кіт, what do I do next?
How do I test for false friends?
Maybe these papers about Surzhyk might come in handy now, especially <_(@Sira2019) “Towards an automatic recognition of mixed languages: The Ukrainian-Russian hybrid language Surzhyk” (2019) / Nataliya Sira, Giorgio Maria Di Nunzio, Viviana Nosilia: z / http://arxiv.org/abs/1912.08582 / _>.
Took infinite time & then got killed by Linux.
from fasttext import FastVector
# ru_dictionary = FastVector(vector_file='wiki.ru.vec')
ru_dictionary = FastVector(vector_file='/home/sh/uuni/master/code/ru_interference/DATA/wiki.ru.vec')
uk_dictionary = FastVector(vector_file='/home/sh/uuni/master/code/ru_interference/DATA/wiki.uk.vec')
uk_dictionary.apply_transform('alignment_matrices/uk.txt')
ru_dictionary.apply_transform('alignment_matrices/ru.txt')
print(FastVector.cosine_similarity(ua_dictionary["кіт"], ru_dictionary["кот"]))
Gensim it is.
To load:
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
ru_dictionary = 'DATA/small/wiki.ru.vec'
uk_dictionary = 'DATA/small/wiki.uk.vec'
model_ru = KeyedVectors.load_word2vec_format(datapath(ru_dictionary))
model_uk = KeyedVectors.load_word2vec_format(datapath(uk_dictionary))
Did ru_model.save(...)
and then I can load it as >>> KeyedVectors.load("ru_interference/src/ru-model-save")
Which is faster — shouldn’t have used the text format, but that’s on me.
from gensim.models import TranslationMatrix
tm = TranslationMatrix(model_ru,model_uk, word_pairs)
(Pdb++) r = tm2.translate(ukrainian_words,topn=3)
(Pdb++) pp(r)
OrderedDict([('сонце', ['завишня', 'скорбна', 'вишня']),
('квітка', ['вишня', 'груша', 'вишнях']),
('місяць', ['любить…»', 'гадаю…»', 'помилуй']),
('дерево', ['яблуко', '„яблуко', 'яблуку']),
('вода', ['вода', 'риба', 'каламутна']),
('птах', ['короваю', 'коровай', 'корова']),
('книга', ['читати', 'читати»', 'їсти']),
('синій', ['вишнях', 'зморшках', 'плакуча'])])
OK, then definitely more words would be needed for the translation.
Either way I don’t need it, I need the space, roughly described here: mapping - How do i get a vector from gensim’s translation_matrix - Stack Overflow
Next time:
get more words, e.g. from a dictionary
get a space
play with translations
python - Combining/adding vectors from different word2vec models - Stack Overflow mentions transvec · PyPI that allows accessing the vectors
Anyway - my only reason for them was ft multilingual, I can do others now.
word in model.key_to_index
(which is a dict) works*** RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'transvec.transformers.TranslationWordVectorizer'> with constructor (self, target: 'gensim.models.keyedvectors.KeyedVectors', *sources: 'gensim.models.keyedvectors.KeyedVectors', alpha: float = 1.0, max_iter: Optional[int] = None, tol: float = 0.001, solver: str = 'auto', missing: str = 'raise', random_state: Union[int, numpy.random.mtrand.RandomState, NoneType] = None) doesn't follow this convention.
ah damn. Wasn’t an issue with the older one, though the only thing that changed is https://github.com/big-o/transvec/compare/master...Jpsaris:transvec:master
Decided to leave this till better times, but play with this one more hour today.
Coming back to mapping - How do i get a vector from gensim’s translation_matrix - Stack Overflow, I need mapped_source_space
.
I should have used pycharm at a much earlier stage in the process.
mapped_source_space
contains a matrix with the 4 vectors mapped to the target space.Why does source_space
have 1.8k words, while the source embedding space has 200k?
Ah, tmp.translate() can translate words not found in source space. Interesting!
AHA - source/target space gets build only based on the words provided for training, 1.8k in my case. Then it builds the translation matrix based on that.
BUT in translate() the target matrix gets build based on the entire vector!
Which means:
Results!
real
картошка/картопля -> 0.28
дом/дім -> 1.16
чай/чай -> 1.17
паспорт/паспорт -> 0.40
зерно/зерно -> 0.46
нос/ніс -> 0.94
false
неделя/неділя -> 0.34
город/город -> 0.35
он/он -> 0.77
речь/річ -> 0.89
родина/родина -> 0.32
сыр/сир -> 0.99
папа/папа -> 0.63
мать/мати -> 0.52
Let’s normalize:
real
картошка/картопля -> 0.64
дом/дім -> 0.64
чай/чай -> 0.70
паспорт/паспорт -> 0.72
зерно/зерно -> 0.60
false
неделя/неділя -> 0.55
город/город -> 0.44
он/он -> 0.33
речь/річ -> 0.54
родина/родина -> 0.50
сыр/сир -> 0.66
папа/папа -> 0.51
мать/мати -> 0.56
OK, so it mostly works! With good enough tresholds it can work. Words that are totally different aren’t similar (он), words that have some shared meanings (мать/мати) are closer.
Ways to improve this:
https://github.com/frekwencja/most-common-words-multilingual
created pairs out of the words in the dictionaries that are identical (not кот/кіт/кит), will look at similarities of Russian word and Ukrainian word
422 such words in common
sorted by similarity (lower values = more fake friend-y). Nope, doesn’t make sense mostly. But rare words seem to be the most ‘different’ ones:
{'поза': 0.3139531, 'iphone': 0.36648884, 'галактика': 0.39758587, 'Роман': 0.40571105, 'дюйм': 0.43442175, 'араб': 0.47358453, 'друг': 0.4818558, 'альфа': 0.48779228, 'гора': 0.5069237, 'папа': 0.50889325, 'проспект': 0.5117553, 'бейсбол': 0.51532406, 'губа': 0.51682216, 'ранчо': 0.52178365, 'голова': 0.527564, 'сука': 0.5336818, 'назад': 0.53545296, 'кулак': 0.5378426, 'стейк': 0.54102343, 'шериф': 0.5427336, 'палка': 0.5516712, 'ставка': 0.5519752, 'соло': 0.5522958, 'акула': 0.5531602, 'поле': 0.55333376, 'астроном': 0.5556448, 'шина': 0.55686104, 'агентство': 0.561674, 'сосна': 0.56177, 'бургер': 0.56337166, 'франшиза': 0.5638794, 'фунт': 0.56592, 'молекула': 0.5712515, 'браузер': 0.57368404, 'полковник': 0.5739758, 'горе': 0.5740198, 'шапка': 0.57745415, 'кампус': 0.5792211, 'дрейф': 0.5800869, 'онлайн': 0.58176875, 'замок': 0.582287, 'файл': 0.58236635, 'трон': 0.5824338, 'ураган': 0.5841942, 'диван': 0.584252, 'фургон': 0.58459675, 'трейлер': 0.5846335, 'приходить': 0.58562565, 'сотня': 0.585832, 'депозит': 0.58704704, 'демон': 0.58801174, 'будка': 0.5882363, 'царство': 0.5885376, 'миля': 0.58867997, 'головоломка': 0.5903712, 'цент': 0.59163713, 'казино': 0.59246653, 'баскетбол': 0.59255254, 'марихуана': 0.59257627, 'пастор': 0.5928912, 'предок': 0.5933549, 'район': 0.5940658, 'статистика': 0.59584284, 'стартер': 0.5987516, 'сайт': 0.5988183, 'демократ': 0.5999011, 'оплата': 0.60060596, 'тендер': 0.6014088, 'орел': 0.60169894, 'гормон': 0.6021177, 'метр': 0.6023728, 'меню': 0.60291564, 'гавань': 0.6029945, 'рукав': 0.60406476, 'статуя': 0.6047057, 'скульптура': 0.60497975, 'вагон': 0.60551536, 'доза': 0.60576916, 'синдром': 0.6064756, 'тигр': 0.60673815, 'сержант': 0.6070389, 'опера': 0.60711193, 'таблетка': 0.60712767, 'фокус': 0.6080196, 'петля': 0.60817575, 'драма': 0.60842395, 'шнур': 0.6091568, 'член': 0.6092182, 'сервер': 0.6094157, 'вилка': 0.6102615, 'мода': 0.6106603, 'лейтенант': 0.6111004, 'радар': 0.6117528, 'галерея': 0.61191505, 'ворота': 0.6125873, 'чашка': 0.6132187, 'крем': 0.6133907, 'бюро': 0.61342597, 'черепаха': 0.6146957, 'секс': 0.6151523, 'носок': 0.6156026, 'подушка': 0.6160687, 'бочка': 0.61691606, 'гольф': 0.6172053, 'факультет': 0.6178817, 'резюме': 0.61848575, 'нерв': 0.6186257, 'король': 0.61903644, 'трубка': 0.6194198, 'ангел': 0.6196466, 'маска': 0.61996806, 'ферма': 0.62029755, 'резидент': 0.6205579, 'футбол': 0.6209573, 'квест': 0.62117445, 'рулон': 0.62152386, 'сарай': 0.62211347, 'слава': 0.6222329, 'блог': 0.6223742, 'ванна': 0.6224452, 'пророк': 0.6224489, 'дерево': 0.62274456, 'горло': 0.62325376, 'порт': 0.6240524, 'лосось': 0.6243047, 'альтернатива': 0.62446254, 'кровоточить': 0.62455964, 'сенатор': 0.6246379, 'спортзал': 0.6246594, 'протокол': 0.6247676, 'ракета': 0.6254694, 'салат': 0.62662274, 'супер': 0.6277698, 'патент': 0.6280118, 'авто': 0.62803495, 'монета': 0.628338, 'консенсус': 0.62834597, 'резерв': 0.62838227, 'кабель': 0.6293858, 'могила': 0.62939847, 'небо': 0.62995523, 'поправка': 0.63010347, 'кислота': 0.6313528, 'озеро': 0.6314377, 'телескоп': 0.6323617, 'чудо': 0.6325846, 'пластик': 0.6329929, 'процент': 0.63322043, 'маркер': 0.63358307, 'датчик': 0.6337889, 'кластер': 0.633797, 'детектив': 0.6341895, 'валюта': 0.63469064, 'банан': 0.6358283, 'фабрика': 0.6360865, 'сумка': 0.63627976, 'газета': 0.6364525, 'математика': 0.63761103, 'плюс': 0.63765526, 'урожай': 0.6377103, 'контраст': 0.6385834, 'аборт': 0.63913494, 'парад': 0.63918126, 'формула': 0.63957334, 'арена': 0.6396606, 'парк': 0.6401386, 'посадка': 0.6401986, 'марш': 0.6403458, 'концерт': 0.64061844, 'перспектива': 0.6413666, 'статут': 0.6419941, 'транзит': 0.64289963, 'параметр': 0.6430252, 'рука': 0.64307654, 'голод': 0.64329326, 'медаль': 0.643804, 'фестиваль': 0.6438755, 'небеса': 0.64397913, 'барабан': 0.64438117, 'картина': 0.6444177, 'вентилятор': 0.6454438, 'ресторан': 0.64582723, 'лист': 0.64694726, 'частота': 0.64801234, 'ручка': 0.6481528, 'ноутбук': 0.64842474, 'пара': 0.6486577, 'коробка': 0.64910173, 'сенат': 0.64915174, 'номер': 0.64946175, 'ремесло': 0.6498537, 'слон': 0.6499266, 'губернатор': 0.64999187, 'раковина': 0.6502305, 'трава': 0.6505385, 'мандат': 0.6511373, 'великий': 0.6511585, 'ящик': 0.65194154, 'череп': 0.6522753, 'ковбой': 0.65260696, 'корова': 0.65319675, 'честь': 0.65348136, 'легенда': 0.6538656, 'душа': 0.65390354, 'автобус': 0.6544202, 'метафора': 0.65446657, 'магазин': 0.65467703, 'удача': 0.65482104, 'волонтер': 0.65544796, 'сексуально': 0.6555309, 'ордер': 0.6557747, 'точка': 0.65612084, 'через': 0.6563236, 'глина': 0.65652716, 'значок': 0.65661323, 'плакат': 0.6568083, 'слух': 0.65709555, 'нога': 0.6572164, 'фотограф': 0.65756184, 'ненависть': 0.6578564, 'пункт': 0.65826315, 'берег': 0.65849876, 'альбом': 0.65849936, 'кролик': 0.6587049, 'масло': 0.6589803, 'бензин': 0.6590406, 'покупка': 0.65911734, 'параграф': 0.6596477, 'вакцина': 0.6603271, 'континент': 0.6609991, 'расизм': 0.6614046, 'правило': 0.661452, 'симптом': 0.661881, 'романтика': 0.6626457, 'атрибут': 0.66298646, 'олень': 0.66298693, 'кафе': 0.6635062, 'слово': 0.6636568, 'машина': 0.66397023, 'джаз': 0.663977, 'пиво': 0.6649644, 'слуга': 0.665489, 'температура': 0.66552, 'море': 0.666358, 'чувак': 0.6663854, 'комфорт': 0.66651237, 'театр': 0.66665906, 'ключ': 0.6670032, 'храм': 0.6673037, 'золото': 0.6678767, 'робот': 0.66861665, 'джентльмен': 0.66861814, 'рейтинг': 0.6686267, 'талант': 0.66881114, 'флот': 0.6701237, 'бонус': 0.67013747, 'величина': 0.67042017, 'конкурент': 0.6704642, 'конкурс': 0.6709986, 'доступ': 0.6712131, 'жанр': 0.67121863, 'пакет': 0.67209935, 'твердо': 0.6724718, 'клуб': 0.6724739, 'координатор': 0.6727365, 'глобус': 0.67277336, 'карта': 0.6731522, 'зима': 0.67379165, 'вино': 0.6737963, 'туалет': 0.6744124, 'середина': 0.6748006, 'тротуар': 0.67507124, 'законопроект': 0.6753582, 'земля': 0.6756074, 'контейнер': 0.6759613, 'посольство': 0.67680794, 'солдат': 0.6771952, 'канал': 0.677311, 'норма': 0.67757475, 'штраф': 0.67796284, 'маркетинг': 0.67837185, 'приз': 0.6790007, 'дилер': 0.6801595, 'молитва': 0.6806114, 'зона': 0.6806243, 'пояс': 0.6807122, 'автор': 0.68088144, 'рабство': 0.6815858, 'коридор': 0.68208706, 'пропаганда': 0.6826943, 'журнал': 0.6828874, 'портрет': 0.68304217, 'фермер': 0.6831401, 'порошок': 0.6831531, 'сюрприз': 0.68327177, 'камера': 0.6840434, 'фаза': 0.6842661, 'природа': 0.6843757, 'лимон': 0.68452585, 'гараж': 0.68465877, 'рецепт': 0.6848821, 'свинина': 0.6863143, 'атмосфера': 0.6865022, 'режим': 0.6870908, 'характеристика': 0.6878463, 'спонсор': 0.6879278, 'товар': 0.6880773, 'контакт': 0.6888988, 'актриса': 0.6891222, 'диск': 0.68916976, 'шоколад': 0.6892894, 'банда': 0.68934155, 'панель': 0.68947715, 'запуск': 0.6899455, 'травма': 0.690045, 'телефон': 0.69024855, 'список': 0.69054323, 'кредит': 0.69054526, 'актив': 0.69087565, 'партнерство': 0.6909646, 'спорт': 0.6914842, 'маршрут': 0.6915196, 'репортер': 0.6920864, 'сегмент': 0.6920909, 'бунт': 0.69279015, 'риторика': 0.69331145, 'школа': 0.6933826, 'оператор': 0.69384277, 'ветеран': 0.6941337, 'членство': 0.69435036, 'схема': 0.69441277, 'манера': 0.69451445, 'командир': 0.69467854, 'формат': 0.69501007, 'сцена': 0.69557995, 'секрет': 0.6961215, 'курс': 0.6964162, 'компонент': 0.69664925, 'патруль': 0.69678336, 'конверт': 0.6968681, 'символ': 0.6973544, 'насос': 0.6974678, 'океан': 0.69814134, 'критик': 0.6988366, 'доброта': 0.6989736, 'абсолютно': 0.6992678, 'акцент': 0.6998319, 'ремонт': 0.70108724, 'мама': 0.7022723, 'тихо': 0.70254886, 'правда': 0.7040037, 'транспорт': 0.704239, 'книга': 0.7051158, 'вода': 0.7064695, 'кухня': 0.7070433, 'костюм': 0.7073295, 'дикий': 0.70741034, 'прокурор': 0.70768344, 'консультант': 0.707697, 'квартира': 0.7078515, 'шанс': 0.70874536, 'сила': 0.70880103, 'хаос': 0.7089504, 'дебют': 0.7092187, 'завтра': 0.7092679, 'горизонт': 0.7093906, 'модель': 0.7097884, 'запах': 0.710207, 'сама': 0.71082854, 'весна': 0.7109366, 'орган': 0.7114152, 'далекий': 0.7118393, 'смерть': 0.71213734, 'медсестра': 0.71224624, 'молоко': 0.7123647, 'союз': 0.71299064, 'звук': 0.71361446, 'метод': 0.7138604, 'корпус': 0.7141677, 'приятель': 0.71538115, 'центр': 0.716277, 'максимум': 0.7162813, 'страх': 0.7166886, 'велосипед': 0.7168154, 'контроль': 0.7171681, 'ритуал': 0.71721196, 'команда': 0.7175366, 'молоток': 0.71759546, 'цикл': 0.71968937, 'жертва': 0.7198437, 'статус': 0.7203152, 'пульс': 0.7206338, 'тренер': 0.72116625, 'сектор': 0.7221448, 'музей': 0.72323525, 'сфера': 0.7245963, 'пейзаж': 0.7246053, 'вниз': 0.72528857, 'редактор': 0.7254647, 'тема': 0.7256167, 'агент': 0.7256874, 'дизайнер': 0.72618955, 'деталь': 0.72680634, 'банк': 0.7270782, 'союзник': 0.72750694, 'жест': 0.7279984, 'наставник': 0.7282404, 'тактика': 0.72968495, 'спектр': 0.7299538, 'проект': 0.7302779, 'художник': 0.7304505, 'далеко': 0.7306006, 'ресурс': 0.73075294, 'половина': 0.7318293, 'явно': 0.7323554, 'день': 0.7337892, 'юрист': 0.73461473, 'широко': 0.73490566, 'закон': 0.7372453, 'психолог': 0.7373602, 'сигарета': 0.73835427, 'проблема': 0.7388488, 'аргумент': 0.7389784, 'старший': 0.7395191, 'продукт': 0.7395814, 'ритм': 0.7406945, 'широкий': 0.7409786, 'голос': 0.7423325, 'урок': 0.74272805, 'масштаб': 0.74474066, 'критика': 0.74535364, 'правильно': 0.74695253, 'авторитет': 0.74697924, 'активно': 0.74720675, 'причина': 0.7479735, 'сестра': 0.74925977, 'сигнал': 0.749686, 'алкоголь': 0.7517742, 'регулярно': 0.7521055, 'мотив': 0.7527843, 'бюджет': 0.7531772, 'плоский': 0.754082, 'посол': 0.75505507, 'скандал': 0.75518423, 'дизайн': 0.75567746, 'персонал': 0.7561288, 'адвокат': 0.7561835, 'принцип': 0.75786924, 'фонд': 0.7583069, 'структура': 0.75888604, 'дискурс': 0.7596848, 'вперед': 0.76067656, 'контур': 0.7607424, 'спортсмен': 0.7616756, 'стимул': 0.7622434, 'партнер': 0.76245433, 'стиль': 0.76301545, 'сильно': 0.7661394, 'текст': 0.7662303, 'фактор': 0.76729685, 'герой': 0.7697237, 'предмет': 0.775718, 'часто': 0.7780384, 'план': 0.77855974, 'рано': 0.78059715, 'факт': 0.782439, 'конкретно': 0.78783923, 'сорок': 0.79080343, 'аспект': 0.79219675, 'контекст': 0.7926827, 'роль': 0.796745, 'президент': 0.8007479, 'результат': 0.80227, 'десять': 0.8071967, 'скоро': 0.80976427, 'тонкий': 0.8100516, 'момент': 0.8120169, 'нести': 0.81280494, 'документ': 0.8216758, 'просто': 0.8222313, 'очевидно': 0.8242744, 'точно': 0.83183587, 'один': 0.83644223, 'пройти': 0.84026355}
ways to improve:
remove potential bad words from training set
expand looking for candidate words by doing predictable changes a la <_(@Sira2019) “Towards an automatic recognition of mixed languages: The Ukrainian-Russian hybrid language Surzhyk” (2019) / Nataliya Sira, Giorgio Maria Di Nunzio, Viviana Nosilia: z / http://arxiv.org/abs/1912.08582 / _>
add weighting based on frequency, rarer words will have less stable embeddings
look at other trained vectors, ideally sth more processed
And actually thinking about it — is there anything I can solve through this that I can’t solve by parsing one or more dictionaries, maybe even making embeddings of the definitions of the various words?
Fazit: leaving this alone till after the masterarbeit as a side project. It’s incredibly interesting but probably not directly practical. Sad.
By default, <Esc>
— bad idea for the same reason in vim it’s a bad idea.
AND my xkeymap-level keyboard mapping for Esc doesn’t seem to work here.
Default-2 is <C-]> which is impossible because of my custom keyboard layout.
Will be <C-=>
.
{
"command": "vim:leave-insert-mode",
"selector": ".jp-NotebookPanel[data-jp-vim-mode='true'] .jp-Notebook.jp-mod-editMode",
"keys": [
"Ctrl =",
]
}
(I can’t figure out why ,l
etc. don’t work in jupyterlab for this purpose)
(<leader>
is ,
)
"Insert mode mappings
" Leave insert mode
imap <leader>l <Esc>
imap qj <Esc>
" Write, write and close
imap ,, <Esc>:x<CR>
map ,. :w<CR>
… I will have an unified set of bindings for this someday, I promise.
EDIT: this is becoming a more generic thingy for everything I’d ever need to refer to when writing a paper, later I’ll clean this mess.
Resources – DREAM Lab links to https://dream.cs.umass.edu/wp-content/uploads/2020/04/Tips-and-Best-Practices.pdf. Until I set up a system to save PDF info, I’ll paste it as screenshots here:
ChatGPT summarized the relevant pages of the PDF file thus, but didn’t do it well, mostly rewriting myself:
multi-discipli\-nary
\begin{sloppypar}...
for paragraphs where latex goes over the margin.\begin{figure}[t]
\centering
for aligning tables and figures.sth~\cite{whatever}
\emph
over bold or \textit
.\newcommand{\system}{SQuID\xspace}
\xspace
here adds as space unless end of sentence. Package \usepackage{xspace}
\smallskip
, \medskip
, and \bigskip
, instead of \vspace
\linewidth
or \textwidth
.\resizebox
with appropriate dimensions.Compression hacks (see pics)
Paper writing hacks:
best practices - When should I use non-breaking space? - TeX - LaTeX Stack Exchange lists ALL the places where Knuth wanted people to put nonbreaking spaces, incl:
1)~one 2)~two
Donald~E. Knuth
1,~2
Chapter~12
Less obvious and not from him:
I~am
Also:
ALSO
ChatGPT says that citations should come before footnotes to prioritize the scholarly source over unimportant info. So this [32] 3 and not this3 [32]. Basically footnotes after all punctuation and citations. OK
EDIT: NOT PARENTHESES, THEY SHOULD BE WITHIN PARENTHESES.4 DAMN
ALSO
I sometimes write and around ~50%
forgetting that ~
is a nbsp — hard to catch when reading the text.
ALSO
As when writing code I like to add some assert False
(or a failing test) so that I know where I stopped the last time,
\latexstopcompiling here
is a neat way to make sure I REALLY finis ha certain line I started but not finished.
Rounding.
Previously: 211018-1510 Python rounding behaviour with TL;DR that python uses banker’s rounding, with .5th rounding towards the even number.
Floor/ceil have their usual latex notation as \rceil
, \rfloor
(see LaTeX/Mathematics - Wikibooks, open books for an open world at ‘delimiters’)
“Normal” rounding (towards nearest integer) has no standard notation: ceiling and floor functions - What is the mathematical notation for rounding a given number to the nearest integer? - Mathematics Stack Exchange
let XXX denote the standard rounding function
Bankers’ rounding (that python and everyone else use for tie-breaking for normal rounding and .5) has no standard notation as well
Let $\lfloor x \rceil$ denote "round half to even" rounding (a.k.a. "Banker's rounding"), consistent with Python's built-in round() and NumPy's np.round() functions.