In the middle of the desert you can say anything you want
Goal: find identical words with diff embeddings in RU and UA, use that to generate examples.
Link broken but I think I found the download page for the vectors
Their blog is also down but they link the howto from the archive Aligning vector representations – Sam’s ML Blog
Download: fastText/docs/crawl-vectors.md at master · facebookresearch/fastText
axel https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.uk.300.bin.gz
axel https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ru.300.bin.gz
It’s taking a while.
EDIT: Ah damn, had to be the text ones, not bin. :( starting again
EDIT2: THIS is the place: fastText/docs/pretrained-vectors.md at master · facebookresearch/fastText
https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.uk.vec
https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ru.vec
UKR has 900k lines, RUS has 1.8M — damn, it’s not going to be easy.
What do I do next, assuming this works?
Assuming I found out that RU-кит is far in the embedding space from UKR-кіт, what do I do next?
How do I test for false friends?
Maybe these papers about Surzhyk might come in handy now, especially <_(@Sira2019) “Towards an automatic recognition of mixed languages: The Ukrainian-Russian hybrid language Surzhyk” (2019) / Nataliya Sira, Giorgio Maria Di Nunzio, Viviana Nosilia: z / http://arxiv.org/abs/1912.08582 / _>.
Took infinite time & then got killed by Linux.
from fasttext import FastVector
# ru_dictionary = FastVector(vector_file='wiki.ru.vec')
ru_dictionary = FastVector(vector_file='/home/sh/uuni/master/code/ru_interference/DATA/wiki.ru.vec')
uk_dictionary = FastVector(vector_file='/home/sh/uuni/master/code/ru_interference/DATA/wiki.uk.vec')
uk_dictionary.apply_transform('alignment_matrices/uk.txt')
ru_dictionary.apply_transform('alignment_matrices/ru.txt')
print(FastVector.cosine_similarity(ua_dictionary["кіт"], ru_dictionary["кот"]))
Gensim it is.
To load:
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
ru_dictionary = 'DATA/small/wiki.ru.vec'
uk_dictionary = 'DATA/small/wiki.uk.vec'
model_ru = KeyedVectors.load_word2vec_format(datapath(ru_dictionary))
model_uk = KeyedVectors.load_word2vec_format(datapath(uk_dictionary))
Did ru_model.save(...) and then I can load it as >>> KeyedVectors.load("ru_interference/src/ru-model-save")
Which is faster — shouldn’t have used the text format, but that’s on me.
from gensim.models import TranslationMatrix
tm = TranslationMatrix(model_ru,model_uk, word_pairs)
(Pdb++) r = tm2.translate(ukrainian_words,topn=3)
(Pdb++) pp(r)
OrderedDict([('сонце', ['завишня', 'скорбна', 'вишня']),
('квітка', ['вишня', 'груша', 'вишнях']),
('місяць', ['любить…»', 'гадаю…»', 'помилуй']),
('дерево', ['яблуко', '„яблуко', 'яблуку']),
('вода', ['вода', 'риба', 'каламутна']),
('птах', ['короваю', 'коровай', 'корова']),
('книга', ['читати', 'читати»', 'їсти']),
('синій', ['вишнях', 'зморшках', 'плакуча'])])
OK, then definitely more words would be needed for the translation.
Either way I don’t need it, I need the space, roughly described here: mapping - How do i get a vector from gensim’s translation_matrix - Stack Overflow
Next time:
get more words, e.g. from a dictionary
get a space
play with translations
python - Combining/adding vectors from different word2vec models - Stack Overflow mentions transvec · PyPI that allows accessing the vectors
Anyway - my only reason for them was ft multilingual, I can do others now.
word in model.key_to_index (which is a dict) works*** RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'transvec.transformers.TranslationWordVectorizer'> with constructor (self, target: 'gensim.models.keyedvectors.KeyedVectors', *sources: 'gensim.models.keyedvectors.KeyedVectors', alpha: float = 1.0, max_iter: Optional[int] = None, tol: float = 0.001, solver: str = 'auto', missing: str = 'raise', random_state: Union[int, numpy.random.mtrand.RandomState, NoneType] = None) doesn't follow this convention.
ah damn. Wasn’t an issue with the older one, though the only thing that changed is https://github.com/big-o/transvec/compare/master...Jpsaris:transvec:master
Decided to leave this till better times, but play with this one more hour today.
Coming back to mapping - How do i get a vector from gensim’s translation_matrix - Stack Overflow, I need mapped_source_space.
I should have used pycharm at a much earlier stage in the process.
mapped_source_space contains a matrix with the 4 vectors mapped to the target space.Why does source_space have 1.8k words, while the source embedding space has 200k?
Ah, tmp.translate() can translate words not found in source space. Interesting!
AHA - source/target space gets build only based on the words provided for training, 1.8k in my case. Then it builds the translation matrix based on that.
BUT in translate() the target matrix gets build based on the entire vector!
Which means:
Results!
real
картошка/картопля -> 0.28
дом/дім -> 1.16
чай/чай -> 1.17
паспорт/паспорт -> 0.40
зерно/зерно -> 0.46
нос/ніс -> 0.94
false
неделя/неділя -> 0.34
город/город -> 0.35
он/он -> 0.77
речь/річ -> 0.89
родина/родина -> 0.32
сыр/сир -> 0.99
папа/папа -> 0.63
мать/мати -> 0.52
Let’s normalize:
real
картошка/картопля -> 0.64
дом/дім -> 0.64
чай/чай -> 0.70
паспорт/паспорт -> 0.72
зерно/зерно -> 0.60
false
неделя/неділя -> 0.55
город/город -> 0.44
он/он -> 0.33
речь/річ -> 0.54
родина/родина -> 0.50
сыр/сир -> 0.66
папа/папа -> 0.51
мать/мати -> 0.56
OK, so it mostly works! With good enough tresholds it can work. Words that are totally different aren’t similar (он), words that have some shared meanings (мать/мати) are closer.
Ways to improve this:
https://github.com/frekwencja/most-common-words-multilingual
created pairs out of the words in the dictionaries that are identical (not кот/кіт/кит), will look at similarities of Russian word and Ukrainian word
422 such words in common
sorted by similarity (lower values = more fake friend-y). Nope, doesn’t make sense mostly. But rare words seem to be the most ‘different’ ones:
{'поза': 0.3139531, 'iphone': 0.36648884, 'галактика': 0.39758587, 'Роман': 0.40571105, 'дюйм': 0.43442175, 'араб': 0.47358453, 'друг': 0.4818558, 'альфа': 0.48779228, 'гора': 0.5069237, 'папа': 0.50889325, 'проспект': 0.5117553, 'бейсбол': 0.51532406, 'губа': 0.51682216, 'ранчо': 0.52178365, 'голова': 0.527564, 'сука': 0.5336818, 'назад': 0.53545296, 'кулак': 0.5378426, 'стейк': 0.54102343, 'шериф': 0.5427336, 'палка': 0.5516712, 'ставка': 0.5519752, 'соло': 0.5522958, 'акула': 0.5531602, 'поле': 0.55333376, 'астроном': 0.5556448, 'шина': 0.55686104, 'агентство': 0.561674, 'сосна': 0.56177, 'бургер': 0.56337166, 'франшиза': 0.5638794, 'фунт': 0.56592, 'молекула': 0.5712515, 'браузер': 0.57368404, 'полковник': 0.5739758, 'горе': 0.5740198, 'шапка': 0.57745415, 'кампус': 0.5792211, 'дрейф': 0.5800869, 'онлайн': 0.58176875, 'замок': 0.582287, 'файл': 0.58236635, 'трон': 0.5824338, 'ураган': 0.5841942, 'диван': 0.584252, 'фургон': 0.58459675, 'трейлер': 0.5846335, 'приходить': 0.58562565, 'сотня': 0.585832, 'депозит': 0.58704704, 'демон': 0.58801174, 'будка': 0.5882363, 'царство': 0.5885376, 'миля': 0.58867997, 'головоломка': 0.5903712, 'цент': 0.59163713, 'казино': 0.59246653, 'баскетбол': 0.59255254, 'марихуана': 0.59257627, 'пастор': 0.5928912, 'предок': 0.5933549, 'район': 0.5940658, 'статистика': 0.59584284, 'стартер': 0.5987516, 'сайт': 0.5988183, 'демократ': 0.5999011, 'оплата': 0.60060596, 'тендер': 0.6014088, 'орел': 0.60169894, 'гормон': 0.6021177, 'метр': 0.6023728, 'меню': 0.60291564, 'гавань': 0.6029945, 'рукав': 0.60406476, 'статуя': 0.6047057, 'скульптура': 0.60497975, 'вагон': 0.60551536, 'доза': 0.60576916, 'синдром': 0.6064756, 'тигр': 0.60673815, 'сержант': 0.6070389, 'опера': 0.60711193, 'таблетка': 0.60712767, 'фокус': 0.6080196, 'петля': 0.60817575, 'драма': 0.60842395, 'шнур': 0.6091568, 'член': 0.6092182, 'сервер': 0.6094157, 'вилка': 0.6102615, 'мода': 0.6106603, 'лейтенант': 0.6111004, 'радар': 0.6117528, 'галерея': 0.61191505, 'ворота': 0.6125873, 'чашка': 0.6132187, 'крем': 0.6133907, 'бюро': 0.61342597, 'черепаха': 0.6146957, 'секс': 0.6151523, 'носок': 0.6156026, 'подушка': 0.6160687, 'бочка': 0.61691606, 'гольф': 0.6172053, 'факультет': 0.6178817, 'резюме': 0.61848575, 'нерв': 0.6186257, 'король': 0.61903644, 'трубка': 0.6194198, 'ангел': 0.6196466, 'маска': 0.61996806, 'ферма': 0.62029755, 'резидент': 0.6205579, 'футбол': 0.6209573, 'квест': 0.62117445, 'рулон': 0.62152386, 'сарай': 0.62211347, 'слава': 0.6222329, 'блог': 0.6223742, 'ванна': 0.6224452, 'пророк': 0.6224489, 'дерево': 0.62274456, 'горло': 0.62325376, 'порт': 0.6240524, 'лосось': 0.6243047, 'альтернатива': 0.62446254, 'кровоточить': 0.62455964, 'сенатор': 0.6246379, 'спортзал': 0.6246594, 'протокол': 0.6247676, 'ракета': 0.6254694, 'салат': 0.62662274, 'супер': 0.6277698, 'патент': 0.6280118, 'авто': 0.62803495, 'монета': 0.628338, 'консенсус': 0.62834597, 'резерв': 0.62838227, 'кабель': 0.6293858, 'могила': 0.62939847, 'небо': 0.62995523, 'поправка': 0.63010347, 'кислота': 0.6313528, 'озеро': 0.6314377, 'телескоп': 0.6323617, 'чудо': 0.6325846, 'пластик': 0.6329929, 'процент': 0.63322043, 'маркер': 0.63358307, 'датчик': 0.6337889, 'кластер': 0.633797, 'детектив': 0.6341895, 'валюта': 0.63469064, 'банан': 0.6358283, 'фабрика': 0.6360865, 'сумка': 0.63627976, 'газета': 0.6364525, 'математика': 0.63761103, 'плюс': 0.63765526, 'урожай': 0.6377103, 'контраст': 0.6385834, 'аборт': 0.63913494, 'парад': 0.63918126, 'формула': 0.63957334, 'арена': 0.6396606, 'парк': 0.6401386, 'посадка': 0.6401986, 'марш': 0.6403458, 'концерт': 0.64061844, 'перспектива': 0.6413666, 'статут': 0.6419941, 'транзит': 0.64289963, 'параметр': 0.6430252, 'рука': 0.64307654, 'голод': 0.64329326, 'медаль': 0.643804, 'фестиваль': 0.6438755, 'небеса': 0.64397913, 'барабан': 0.64438117, 'картина': 0.6444177, 'вентилятор': 0.6454438, 'ресторан': 0.64582723, 'лист': 0.64694726, 'частота': 0.64801234, 'ручка': 0.6481528, 'ноутбук': 0.64842474, 'пара': 0.6486577, 'коробка': 0.64910173, 'сенат': 0.64915174, 'номер': 0.64946175, 'ремесло': 0.6498537, 'слон': 0.6499266, 'губернатор': 0.64999187, 'раковина': 0.6502305, 'трава': 0.6505385, 'мандат': 0.6511373, 'великий': 0.6511585, 'ящик': 0.65194154, 'череп': 0.6522753, 'ковбой': 0.65260696, 'корова': 0.65319675, 'честь': 0.65348136, 'легенда': 0.6538656, 'душа': 0.65390354, 'автобус': 0.6544202, 'метафора': 0.65446657, 'магазин': 0.65467703, 'удача': 0.65482104, 'волонтер': 0.65544796, 'сексуально': 0.6555309, 'ордер': 0.6557747, 'точка': 0.65612084, 'через': 0.6563236, 'глина': 0.65652716, 'значок': 0.65661323, 'плакат': 0.6568083, 'слух': 0.65709555, 'нога': 0.6572164, 'фотограф': 0.65756184, 'ненависть': 0.6578564, 'пункт': 0.65826315, 'берег': 0.65849876, 'альбом': 0.65849936, 'кролик': 0.6587049, 'масло': 0.6589803, 'бензин': 0.6590406, 'покупка': 0.65911734, 'параграф': 0.6596477, 'вакцина': 0.6603271, 'континент': 0.6609991, 'расизм': 0.6614046, 'правило': 0.661452, 'симптом': 0.661881, 'романтика': 0.6626457, 'атрибут': 0.66298646, 'олень': 0.66298693, 'кафе': 0.6635062, 'слово': 0.6636568, 'машина': 0.66397023, 'джаз': 0.663977, 'пиво': 0.6649644, 'слуга': 0.665489, 'температура': 0.66552, 'море': 0.666358, 'чувак': 0.6663854, 'комфорт': 0.66651237, 'театр': 0.66665906, 'ключ': 0.6670032, 'храм': 0.6673037, 'золото': 0.6678767, 'робот': 0.66861665, 'джентльмен': 0.66861814, 'рейтинг': 0.6686267, 'талант': 0.66881114, 'флот': 0.6701237, 'бонус': 0.67013747, 'величина': 0.67042017, 'конкурент': 0.6704642, 'конкурс': 0.6709986, 'доступ': 0.6712131, 'жанр': 0.67121863, 'пакет': 0.67209935, 'твердо': 0.6724718, 'клуб': 0.6724739, 'координатор': 0.6727365, 'глобус': 0.67277336, 'карта': 0.6731522, 'зима': 0.67379165, 'вино': 0.6737963, 'туалет': 0.6744124, 'середина': 0.6748006, 'тротуар': 0.67507124, 'законопроект': 0.6753582, 'земля': 0.6756074, 'контейнер': 0.6759613, 'посольство': 0.67680794, 'солдат': 0.6771952, 'канал': 0.677311, 'норма': 0.67757475, 'штраф': 0.67796284, 'маркетинг': 0.67837185, 'приз': 0.6790007, 'дилер': 0.6801595, 'молитва': 0.6806114, 'зона': 0.6806243, 'пояс': 0.6807122, 'автор': 0.68088144, 'рабство': 0.6815858, 'коридор': 0.68208706, 'пропаганда': 0.6826943, 'журнал': 0.6828874, 'портрет': 0.68304217, 'фермер': 0.6831401, 'порошок': 0.6831531, 'сюрприз': 0.68327177, 'камера': 0.6840434, 'фаза': 0.6842661, 'природа': 0.6843757, 'лимон': 0.68452585, 'гараж': 0.68465877, 'рецепт': 0.6848821, 'свинина': 0.6863143, 'атмосфера': 0.6865022, 'режим': 0.6870908, 'характеристика': 0.6878463, 'спонсор': 0.6879278, 'товар': 0.6880773, 'контакт': 0.6888988, 'актриса': 0.6891222, 'диск': 0.68916976, 'шоколад': 0.6892894, 'банда': 0.68934155, 'панель': 0.68947715, 'запуск': 0.6899455, 'травма': 0.690045, 'телефон': 0.69024855, 'список': 0.69054323, 'кредит': 0.69054526, 'актив': 0.69087565, 'партнерство': 0.6909646, 'спорт': 0.6914842, 'маршрут': 0.6915196, 'репортер': 0.6920864, 'сегмент': 0.6920909, 'бунт': 0.69279015, 'риторика': 0.69331145, 'школа': 0.6933826, 'оператор': 0.69384277, 'ветеран': 0.6941337, 'членство': 0.69435036, 'схема': 0.69441277, 'манера': 0.69451445, 'командир': 0.69467854, 'формат': 0.69501007, 'сцена': 0.69557995, 'секрет': 0.6961215, 'курс': 0.6964162, 'компонент': 0.69664925, 'патруль': 0.69678336, 'конверт': 0.6968681, 'символ': 0.6973544, 'насос': 0.6974678, 'океан': 0.69814134, 'критик': 0.6988366, 'доброта': 0.6989736, 'абсолютно': 0.6992678, 'акцент': 0.6998319, 'ремонт': 0.70108724, 'мама': 0.7022723, 'тихо': 0.70254886, 'правда': 0.7040037, 'транспорт': 0.704239, 'книга': 0.7051158, 'вода': 0.7064695, 'кухня': 0.7070433, 'костюм': 0.7073295, 'дикий': 0.70741034, 'прокурор': 0.70768344, 'консультант': 0.707697, 'квартира': 0.7078515, 'шанс': 0.70874536, 'сила': 0.70880103, 'хаос': 0.7089504, 'дебют': 0.7092187, 'завтра': 0.7092679, 'горизонт': 0.7093906, 'модель': 0.7097884, 'запах': 0.710207, 'сама': 0.71082854, 'весна': 0.7109366, 'орган': 0.7114152, 'далекий': 0.7118393, 'смерть': 0.71213734, 'медсестра': 0.71224624, 'молоко': 0.7123647, 'союз': 0.71299064, 'звук': 0.71361446, 'метод': 0.7138604, 'корпус': 0.7141677, 'приятель': 0.71538115, 'центр': 0.716277, 'максимум': 0.7162813, 'страх': 0.7166886, 'велосипед': 0.7168154, 'контроль': 0.7171681, 'ритуал': 0.71721196, 'команда': 0.7175366, 'молоток': 0.71759546, 'цикл': 0.71968937, 'жертва': 0.7198437, 'статус': 0.7203152, 'пульс': 0.7206338, 'тренер': 0.72116625, 'сектор': 0.7221448, 'музей': 0.72323525, 'сфера': 0.7245963, 'пейзаж': 0.7246053, 'вниз': 0.72528857, 'редактор': 0.7254647, 'тема': 0.7256167, 'агент': 0.7256874, 'дизайнер': 0.72618955, 'деталь': 0.72680634, 'банк': 0.7270782, 'союзник': 0.72750694, 'жест': 0.7279984, 'наставник': 0.7282404, 'тактика': 0.72968495, 'спектр': 0.7299538, 'проект': 0.7302779, 'художник': 0.7304505, 'далеко': 0.7306006, 'ресурс': 0.73075294, 'половина': 0.7318293, 'явно': 0.7323554, 'день': 0.7337892, 'юрист': 0.73461473, 'широко': 0.73490566, 'закон': 0.7372453, 'психолог': 0.7373602, 'сигарета': 0.73835427, 'проблема': 0.7388488, 'аргумент': 0.7389784, 'старший': 0.7395191, 'продукт': 0.7395814, 'ритм': 0.7406945, 'широкий': 0.7409786, 'голос': 0.7423325, 'урок': 0.74272805, 'масштаб': 0.74474066, 'критика': 0.74535364, 'правильно': 0.74695253, 'авторитет': 0.74697924, 'активно': 0.74720675, 'причина': 0.7479735, 'сестра': 0.74925977, 'сигнал': 0.749686, 'алкоголь': 0.7517742, 'регулярно': 0.7521055, 'мотив': 0.7527843, 'бюджет': 0.7531772, 'плоский': 0.754082, 'посол': 0.75505507, 'скандал': 0.75518423, 'дизайн': 0.75567746, 'персонал': 0.7561288, 'адвокат': 0.7561835, 'принцип': 0.75786924, 'фонд': 0.7583069, 'структура': 0.75888604, 'дискурс': 0.7596848, 'вперед': 0.76067656, 'контур': 0.7607424, 'спортсмен': 0.7616756, 'стимул': 0.7622434, 'партнер': 0.76245433, 'стиль': 0.76301545, 'сильно': 0.7661394, 'текст': 0.7662303, 'фактор': 0.76729685, 'герой': 0.7697237, 'предмет': 0.775718, 'часто': 0.7780384, 'план': 0.77855974, 'рано': 0.78059715, 'факт': 0.782439, 'конкретно': 0.78783923, 'сорок': 0.79080343, 'аспект': 0.79219675, 'контекст': 0.7926827, 'роль': 0.796745, 'президент': 0.8007479, 'результат': 0.80227, 'десять': 0.8071967, 'скоро': 0.80976427, 'тонкий': 0.8100516, 'момент': 0.8120169, 'нести': 0.81280494, 'документ': 0.8216758, 'просто': 0.8222313, 'очевидно': 0.8242744, 'точно': 0.83183587, 'один': 0.83644223, 'пройти': 0.84026355}
ways to improve:
remove potential bad words from training set
expand looking for candidate words by doing predictable changes a la <_(@Sira2019) “Towards an automatic recognition of mixed languages: The Ukrainian-Russian hybrid language Surzhyk” (2019) / Nataliya Sira, Giorgio Maria Di Nunzio, Viviana Nosilia: z / http://arxiv.org/abs/1912.08582 / _>
add weighting based on frequency, rarer words will have less stable embeddings
look at other trained vectors, ideally sth more processed
And actually thinking about it — is there anything I can solve through this that I can’t solve by parsing one or more dictionaries, maybe even making embeddings of the definitions of the various words?
Fazit: leaving this alone till after the masterarbeit as a side project. It’s incredibly interesting but probably not directly practical. Sad.
By default, <Esc> — bad idea for the same reason in vim it’s a bad idea.
AND my xkeymap-level keyboard mapping for Esc doesn’t seem to work here.
Default-2 is <C-]> which is impossible because of my custom keyboard layout.
Will be <C-=>.
{
"command": "vim:leave-insert-mode",
"selector": ".jp-NotebookPanel[data-jp-vim-mode='true'] .jp-Notebook.jp-mod-editMode",
"keys": [
"Ctrl =",
]
}
(I can’t figure out why ,l etc. don’t work in jupyterlab for this purpose)
(<leader> is ,)
"Insert mode mappings
" Leave insert mode
imap <leader>l <Esc>
imap qj <Esc>
" Write, write and close
imap ,, <Esc>:x<CR>
map ,. :w<CR>
… I will have an unified set of bindings for this someday, I promise.
EDIT: this is becoming a more generic thingy for everything I’d ever need to refer to when writing a paper, TODO clean this at some point.
Resources – DREAM Lab links to https://dream.cs.umass.edu/wp-content/uploads/2020/04/Tips-and-Best-Practices.pdf. Until I set up a system to save PDF info, I’ll paste it as screenshots here:
ChatGPT summarized the relevant pages of the PDF file thus, but didn’t do it well, mostly rewriting myself:
multi-discipli\-nary
\begin{sloppypar}... for paragraphs where latex goes over the margin.\begin{figure}[t]\centering for aligning tables and figures.sth~\cite{whatever}\emph over bold or \textit.\newcommand{\system}{SQuID\xspace}
\xspace here adds as space unless end of sentence. Package \usepackage{xspace}\smallskip, \medskip, and \bigskip, instead of \vspace\linewidth or \textwidth.\resizebox with appropriate dimensions.\begin{itemize}
\setlength{\itemsep}{0pt}
\setlength{\parskip}{0pt}
best practices - When should I use non-breaking space? - TeX - LaTeX Stack Exchange lists ALL the places where Knuth wanted people to put nonbreaking spaces, incl:
1)~one 2)~twoDonald~E. Knuth1,~2Chapter~12Less obvious and not from him:
I~amAlso:
and around ~50% forgetting that ~ is a nbsp — hard to catch when reading the text.assert False (or a failing test) so that I know where I stopped the last time,
\latexstopcompiling here is a neat way to make sure I REALLY finis ha certain line I started but not finished.Rounding.
Previously: 211018-1510 Python rounding behaviour with TL;DR that python uses banker’s rounding, with .5th rounding towards the even number.
Floor/ceil have their usual latex notation as \rceil, \rfloor (see LaTeX/Mathematics - Wikibooks, open books for an open world at ‘delimiters’)
“Normal” rounding (towards nearest integer) has no standard notation: ceiling and floor functions - What is the mathematical notation for rounding a given number to the nearest integer? - Mathematics Stack Exchange
let XXX denote the standard rounding functionBankers’ rounding (that python and everyone else use for tie-breaking for normal rounding and .5) has no standard notation as well
Let $\lfloor x \rceil$ denote "round half to even" rounding (a.k.a. "Banker's rounding"), consistent with Python's built-in round() and NumPy's np.round() functions.Require/Ensure is basically Input/Output and can be renamed thus1:
\floatname{algorithm}{Procedure}
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
\usepackage{algorithm}
\usepackage{algpseudocode}
% ...
\begin{algorithm}
\caption{Drop Rare Species per Country}
\label{alg:drop}
\begin{algorithmic}
\Require $D_0$: initial set of occurrences
\Ensure $D_1$: Set of occurrences after filtering rare species
\State $D_1 \gets$ \emptyset
\For{each $c$ in Countries}
\For{each $s$ in Species}
\If {$|O_{c,s} \in D_0| \geq 10$} % if observations of species in country in D_0 have more than 10 entries; || is set cardinality
\State{$D_1 \gets D_1 \cup O_{c,s}$}
\EndIf
\EndFor
\EndFor
\end{algorithmic}
\end{algorithm}
from pathlib import Path
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
INTERACTIVE_TABLES=False
USE_BLACK = True
# 100% width table
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
if INTERACTIVE_TABLES:
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True, connected=True)
# black formatting
if USE_BLACK:
%load_ext jupyter_black
# column/row limits removal
pd.set_option("display.max_columns", None)
pd.set_option('display.max_rows', 100)
# figsize is figsize
plt.rcParams["figure.figsize"] = (6, 8)
plt.rcParams["figure.dpi"] = 100
# CHANGEME
PATH_STR = "xxxxx/home/sh/hsa/plants/inat500k/gbif.metadata.csv"
PATH = Path(PATH_STR)
assert PATH.exists()
List of all map providers, not all included in geopandas and some paid, nevertheless really neat: https://xyzservices.readthedocs.io/en/stable/gallery.html
For the 231024-1704 Master thesis task CBT task of my 230928-1745 Masterarbeit draft, I’d like to create an ontology I can use to “seed” LMs to generate ungoogleable stories.
And it’s gonna be fascinating.
I don’t know what’s the difference between knowledge graph, ontology etc. at this point.
I want it to be highly abstract - I don’t care if it’s a forest, if it’s Cinderella etc., I want the relationships.
Let’s try. Cinderella is basically “Rags to riches”, so:
…
Or GPT3’s ideas from before:
"Entities": {
"Thief": {"Characteristics": ["Cunning", "Resourceful"], "Role": "Protagonist"},
"Fish": {"Characteristics": ["Valuable", "Symbolic"], "Role": "Object"},
"Owner": {"Characteristics": ["Victimized", "Unaware"], "Role": "Antagonist"}
},
"Goals": {
"Thief": "Steal Fish",
"Owner": "Protect Property"
},
"Challenges": {
"Thief": "Avoid Detection",
"Owner": "Secure Property"
},
"Interactions": {
("Thief", "Fish"): "Theft",
("Thief", "Owner"): "Avoidance",
("Owner", "Fish"): "Ownership"
},
"Outcomes": {
"Immediate": "Successful Theft",
"Long-term": "Loss of Trust"
},
"Moral Lessons": {
"Actions Have Consequences",
"Importance of Trust",
"Greed Leads to Loss"
}
Here’s it generating an ontology based on the above graph: https://chat.openai.com/share/92ed18ce-88f9-4262-9dd9-f06a07d06acc
And more in UKR: https://chat.openai.com/share/846a5e85-353e-4bb5-adbe-6da7825c51ed
In bold bits I’m not sure of. In decreasing order of abstraction, with the first two being the most generic ones and the latter ones more fitting for concrete stories.
Characteristics:
Role: CHARACTER ROLEEntity: ENTITYGoal: main goal of entity in this contextSHORT-TERM: plaintext descriptionLONG-TERM: plaintext descriptionRemaining issues:
Here’s ChatGPT applying that to Shrek: https://chat.openai.com/share/d96d4be6-d42f-4096-a18f-03f786b802c6
Modifying its answers:
“Using this ontology for abstract fairy tale description, please create a generalized graph structure for THE FIRST HARRY POTTER MOVIE. Focus on the overarching themes and character roles without specific names or unique settings. The graph should include key plot points, character roles, entities, goals, interactions, outcomes, and moral lessons, all described in a manner that is broadly applicable to similar stories.”
<
Context:
> pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf
# unicode magic
Try running pandoc with --pdf-engine=xelatex.
# thank you
> pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf --pdf-engine=xelatex
# a volley of...
[WARNING] Missing character: There is no о (U+043E) in font [lmroman10-italic]:mapping=tex-text;!
Exporting Hugo to PDF | akos.ma looks nice.
build/pdf/%.pdf: content/posts/%/index.md
$(PANDOC) --write=pdf --pdf-engine=xelatex \
--variable=papersize:a4 --variable=links-as-notes \
--variable=mainfont:DejaVuSans \
--variable=monofont:DejaVuSansMono \
--resource-path=$$(dirname $<) --out=$@ $< 2> /dev/null
Let’s try:
pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf --pdf-engine=xelatex --variable=links-as-notes \
--variable=mainfont:DejaVuSans \
--variable=monofont:DejaVuSansMono
Better but not much; HTML is not parsed, lists count as lists only after a newline it seems.
pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf --pdf-engine=xelatex --variable=links-as-notes \
--variable=mainfont:DejaVuSans \
--variable=monofont:DejaVuSansMono \
--from=markdown+lists_without_preceding_blankline
Better, but quotes unsolved:
Markdown blockquote shouldn’t require a leading blank line · Issue #7069 · jgm/pandoc
pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf --pdf-engine=xelatex --variable=links-as-notes \
--variable=mainfont:DejaVuSans \
--variable=monofont:DejaVuSansMono \
--from=markdown+lists_without_preceding_blankline
#+blank_before_blockquote
ACTUALLY, - f gfm (github-flavour) solves basically everything. commonmark doesn’t parse latex, commonmark_x (‘with many md extensions’) on first sight is similar to gfm.
I think HTML is the last one.
Raw HTML says it’s only for strict:
--from=markdown_strict+markdown_in_html_blocks
msword - Pandoc / Latex / Markdown - TeX - LaTeX Stack Exchange suggest md to tex and tex to pdf, interesting approach.
6.11 Write raw LaTeX code | R Markdown Cookbook says complex latex code may be too complex for markdown.
This means this except w/o backslashes:
\```{=latex}
$\underset{\text{NOUN-NOM}}{\overset{\text{man}}{\text{чоловік-}\varnothing}}$ $\underset{\text{PST}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{NOUN-ACC}}{\overset{\text{dog}}{\text{собак-у}}}$.
\```
Then commonmark_x can handle that.
EDIT: --standalone!
I don’t need HTML, I need <sub>.
pandoc md has a syntax for this: Pandoc - Pandoc User’s Guide
Options
--from=markdown+lists_without_preceding_blankline+blank_before_blockquote? :(ChatGPT tried to create a filter but nothing works, I’ll leave it for later: https://chat.openai.com/share/c94fffbe-1e90-4bc0-9e97-6027eeab281a
This produces the best HTML documents:
> pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.html \
--from=gfm --mathjax --standalone
NB If I add CSS, it should be an absolute path:
It’d be cool to wrap examples in the same environment!
https://forum.obsidian.md/t/rendering-callouts-similarly-in-pandoc/40020:
-- https://forum.obsidian.md/t/rendering-callouts-similarly-in-pandoc/40020/6
--
local stringify = (require "pandoc.utils").stringify
function BlockQuote (el)
start = el.content[1]
if (start.t == "Para" and start.content[1].t == "Str" and
start.content[1].text:match("^%[!%w+%][-+]?$")) then
_, _, ctype = start.content[1].text:find("%[!(%w+)%]")
el.content:remove(1)
start.content:remove(1)
div = pandoc.Div(el.content, {class = "callout"})
div.attributes["data-callout"] = ctype:lower()
div.attributes["title"] = stringify(start.content):gsub("^ ", "")
return div
else
return el
end
end
Makes:
> [!NOTE]- callout Title
>
> callout content
into
::: {.callout data-callout="note" title="callout Title"}
callout content
:::
.callout {
color: red; /* Set text color to red */
border: 1px solid red; /* Optional: add a red border */
padding: 10px; /* Optional: add some padding */
/* Add any other styling as needed */
}
Then this makes it pretty HTML:
pandoc callout.md -L luas/obsidian-callouts.lua -t markdown -s | pandoc --standalone -o some_test.html --css luas/callout-style.css
<div class="callout" data-callout="note" title="callout Title">
<p>callout content</p>
</div>
For PDF: .. it’s more complex, will need such a header file etc. later on. TODO
\usepackage{xcolor} % Required for color definition
\newenvironment{callout}{
\color{red} % Sets the text color to red within the environment
% Add any other formatting commands here
}{}
--css /abs/tufte.css!Copied executables to /home/sh/.local/bin/: aha so that’s where you put your filters, inside $PATHDamn! Just had to replace index.md with my thesis, then make all and it just …worked. Wow.
Apparently to make it not a sidenote I just have to add - to the footnote itself. Would be trivial to replace with an @ etc., then I get my inital plan - citations as citations and footnotes with my remarks as sidenotes.
I can add --from gfm --mathjax to the makefile command and it works with all my other requirements!
pandoc \
--katex \
--section-divs \
--from gfm \
--mathjax \
--filter pandoc-sidenote \
--to html5+smart \
--template=tufte \
--css tufte.css --css pandoc.css --css pandoc-solarized.css --css tufte-extra.css \
--output docs/tufte-md/index.html \
docs/tufte-md/index.md
I wonder if I can modify it to create latex-style sidenotes, it should be very easy: pandoc-sidenote/src/Text/Pandoc/SideNote.hs at master · jez/pandoc-sidenote
{#fig:label}
$$ math $$ {#eq:label}
Section {#sec:section}
TODO figure out, and latex as well.
TODO