In the middle of the desert you can say anything you want
Copypasting this (still draft version) here in full, before radically shortening it for my master thesis.
L’Ukraine a toujours aspiré à être libre
“Ukraine has always aspired to be free.” Voltaire, 1731 1
This section describes the bilingual nature of Ukraine’s society and the impact of historical state policies on the modern development of the language.
The ongoing Russian invasion is viewed by many as a continuation of a long-standing historical pattern, rather than an isolated incident.
This section doesn’t attempt to justify or challenge any particular position regarding the events described, nor is meant to be a definitive account of the history of the language.
But I believe this perspective is important to understanding the current linguistic landscape in Ukraine, as well as the linguistic challenges and phenomena that had a direct relevance on this thesis. (TODO mention how and which tasks are impacted by this)
In Ukraine itself, the status of Ukrainian (its only official language) varies widely, but for a large part of Ukrainians the question was never too much on the foreground (until recently, that is).
A significant number of people in Ukraine are bilingual (Ukrainian and Russian languages), and almost everyone can understand both Russian and Ukrainian.2
The reasons for this include Ukraine’s geographical and cultural proximity to Russia, and was to a large extent a result of consistent policy first of the Russian empire and the Soviet Union.
In the Russian Empire, the broader imperial ideology sought to assimilate various ethnicities into a single Russian identity (with Russian as dominant language), and policies aimed at diminshing Ukrainian national self-consciousness were a facet of that3. TODO source
Ukrainian (then officially called little Russian language/малорусский язык) was stigmatized as a (uncultured town folks’) dialect of Russian, unsuited for ‘serious’ literature or poetry — as opposed to the great Russian language (not editorializing, it was literally called that; these phrasing applied to the names of ethnicities as well, Russia as great Russia and Ukraine as little Russia; the extent to which this referred broader cultural attitudes is a discussion out of scope of this Thesis). (TODO footnote to ‘War and Punishment’ for more on this)
The history of Ukrainian language bans is long enough to merit a Wikipedia page itemizing all the attempts, 4 with the more notable ones in the Russian Empire being the 1863 Valuev Circular (forbidding the use of Ukrainian in religious and educational printed literature) and the Ems Ukaz, a decree by Emperor Alexander II banning the use of the Ukrainian language in print (except for reprinting old documents), forbidding the import of Ukrainian publications and the staging of plays or lectures in Ukrainian (1876). (TODO sources for both)
The 1928 grammar reform (sometimes called Skrypnykivka after the minister of education Skrypnyk) passed during this period, drafted by a commitee of prominent Ukrainian linguists, writers, and teachers synthetized the different dialects into a single orthography to be used across the entire territory.
The Ukrainian writers and intellectuals of that period became known as “the executed Renaissance”: most of them were purged in the years to follow, after the Soviet Union took a sharp turn towards Russification in the late 1920s and in the multiple waves of purges that followed. (Most prominent members of committee behind Skrypnykivka were repressed as well; Skrypnyk himself committed suicide in 1933.)
A new ‘orthographic’ reform was drafted in 1933. It had the stated goal of removing alleged burgeoise influences of the previous one. Andriy Khvylia5, the chairman of the new Orthography Commission described in his 1933 book “Eradicate, Destroy the Roots of Ukrainian Nationalism on the Linguistic Front” (TODO source) how the new reform eliminates all “deadly conservative norms established by nationalists” that “focused the Ukrainian language on the Polish and Czech borgeois cultures (…) and set a barrier between the Ukrainian and Russian language”.
In practice the reform brought the Ukrainian language much closer to Russian in many ways:
Many Ukrainian writers, poets and dissidents kept using the ‘old’ orthography, as well as the Ukrainian community outside the Soviet Union.
After the fall of the Soviet Union, there were many proposals for restoring the original orthography, but only the letter ґ was restored. In 2019 a new version of the Ukrainian orthography was approved, which restored some of the original rules as ’legal’ variants but without mandating any of them.
TODO format citation Debunking the myth of a divided Ukraine - Atlantic Council citing Oeuvres complètes de Voltaire - Voltaire - Google Books ↩︎
While the two languages are mutually intelligible to a large extent, knowing one doesn’t automatically make understand the other - most Russians can’t understand Ukrainian nearly as well as Ukrainians undestand the Russian language, for example. ↩︎
(by no means the only one — but the stories of other victims of Russia’s imperialism are best told elsewhere, and for many ethnicities, especially ones deeper inside Russia’s borders, there’s no one left to tell the story) ↩︎
Later repressed for nationalism. ↩︎
Require/Ensure is basically Input/Output and can be renamed thus1:
\floatname{algorithm}{Procedure}
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
\usepackage{algorithm}
\usepackage{algpseudocode}
% ...
\begin{algorithm}
\caption{Drop Rare Species per Country}
\label{alg:drop}
\begin{algorithmic}
\Require $D_0$: initial set of occurrences
\Ensure $D_1$: Set of occurrences after filtering rare species
\State $D_1 \gets$ \emptyset
\For{each $c$ in Countries}
\For{each $s$ in Species}
\If {$|O_{c,s} \in D_0| \geq 10$} % if observations of species in country in D_0 have more than 10 entries; || is set cardinality
\State{$D_1 \gets D_1 \cup O_{c,s}$}
\EndIf
\EndFor
\EndFor
\end{algorithmic}
\end{algorithm}
from pathlib import Path
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
INTERACTIVE_TABLES=False
USE_BLACK = True
# 100% width table
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
if INTERACTIVE_TABLES:
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True, connected=True)
# black formatting
if USE_BLACK:
%load_ext jupyter_black
# column/row limits removal
pd.set_option("display.max_columns", None)
pd.set_option('display.max_rows', 100)
# figsize is figsize
plt.rcParams["figure.figsize"] = (6, 8)
plt.rcParams["figure.dpi"] = 100
# CHANGEME
PATH_STR = "xxxxx/home/sh/hsa/plants/inat500k/gbif.metadata.csv"
PATH = Path(PATH_STR)
assert PATH.exists()
List of all map providers, not all included in geopandas and some paid, nevertheless really neat: https://xyzservices.readthedocs.io/en/stable/gallery.html
For the 231024-1704 Master thesis task CBT task of my 230928-1745 Masterarbeit draft, I’d like to create an ontology I can use to “seed” LMs to generate ungoogleable stories.
And it’s gonna be fascinating.
I don’t know what’s the difference between knowledge graph, ontology etc. at this point.
I want it to be highly abstract - I don’t care if it’s a forest, if it’s Cinderella etc., I want the relationships.
Let’s try. Cinderella is basically “Rags to riches”, so:
…
Or GPT3’s ideas from before:
"Entities": {
"Thief": {"Characteristics": ["Cunning", "Resourceful"], "Role": "Protagonist"},
"Fish": {"Characteristics": ["Valuable", "Symbolic"], "Role": "Object"},
"Owner": {"Characteristics": ["Victimized", "Unaware"], "Role": "Antagonist"}
},
"Goals": {
"Thief": "Steal Fish",
"Owner": "Protect Property"
},
"Challenges": {
"Thief": "Avoid Detection",
"Owner": "Secure Property"
},
"Interactions": {
("Thief", "Fish"): "Theft",
("Thief", "Owner"): "Avoidance",
("Owner", "Fish"): "Ownership"
},
"Outcomes": {
"Immediate": "Successful Theft",
"Long-term": "Loss of Trust"
},
"Moral Lessons": {
"Actions Have Consequences",
"Importance of Trust",
"Greed Leads to Loss"
}
Here’s it generating an ontology based on the above graph: https://chat.openai.com/share/92ed18ce-88f9-4262-9dd9-f06a07d06acc
And more in UKR: https://chat.openai.com/share/846a5e85-353e-4bb5-adbe-6da7825c51ed
In bold bits I’m not sure of. In decreasing order of abstraction, with the first two being the most generic ones and the latter ones more fitting for concrete stories.
Characteristics
:
Role
: CHARACTER ROLEEntity
: ENTITYGoal
: main goal of entity in this contextSHORT-TERM
: plaintext descriptionLONG-TERM
: plaintext descriptionRemaining issues:
Here’s ChatGPT applying that to Shrek: https://chat.openai.com/share/d96d4be6-d42f-4096-a18f-03f786b802c6
Modifying its answers:
“Using this ontology for abstract fairy tale description, please create a generalized graph structure for THE FIRST HARRY POTTER MOVIE. Focus on the overarching themes and character roles without specific names or unique settings. The graph should include key plot points, character roles, entities, goals, interactions, outcomes, and moral lessons, all described in a manner that is broadly applicable to similar stories.”
<
Context:
> pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf
# unicode magic
Try running pandoc with --pdf-engine=xelatex.
# thank you
> pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf --pdf-engine=xelatex
# a volley of...
[WARNING] Missing character: There is no о (U+043E) in font [lmroman10-italic]:mapping=tex-text;!
Exporting Hugo to PDF | akos.ma looks nice.
build/pdf/%.pdf: content/posts/%/index.md
$(PANDOC) --write=pdf --pdf-engine=xelatex \
--variable=papersize:a4 --variable=links-as-notes \
--variable=mainfont:DejaVuSans \
--variable=monofont:DejaVuSansMono \
--resource-path=$$(dirname $<) --out=$@ $< 2> /dev/null
Let’s try:
pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf --pdf-engine=xelatex --variable=links-as-notes \
--variable=mainfont:DejaVuSans \
--variable=monofont:DejaVuSansMono
Better but not much; HTML is not parsed, lists count as lists only after a newline it seems.
pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf --pdf-engine=xelatex --variable=links-as-notes \
--variable=mainfont:DejaVuSans \
--variable=monofont:DejaVuSansMono \
--from=markdown+lists_without_preceding_blankline
Better, but quotes unsolved:
Markdown blockquote shouldn’t require a leading blank line · Issue #7069 · jgm/pandoc
pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.pdf --pdf-engine=xelatex --variable=links-as-notes \
--variable=mainfont:DejaVuSans \
--variable=monofont:DejaVuSansMono \
--from=markdown+lists_without_preceding_blankline
#+blank_before_blockquote
ACTUALLY, - f gfm
(github-flavour) solves basically everything. commonmark
doesn’t parse latex, commonmark_x
(‘with many md extensions’) on first sight is similar to gfm
.
I think HTML is the last one.
Raw HTML says it’s only for strict:
--from=markdown_strict+markdown_in_html_blocks
msword - Pandoc / Latex / Markdown - TeX - LaTeX Stack Exchange suggest md to tex and tex to pdf, interesting approach.
6.11 Write raw LaTeX code | R Markdown Cookbook says complex latex code may be too complex for markdown.
This means this except w/o backslashes:
\```{=latex}
$\underset{\text{NOUN-NOM}}{\overset{\text{man}}{\text{чоловік-}\varnothing}}$ $\underset{\text{PST}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{NOUN-ACC}}{\overset{\text{dog}}{\text{собак-у}}}$.
\```
Then commonmark_x can handle that.
EDIT: --standalone
!
I don’t need HTML, I need <sub>
.
pandoc md has a syntax for this: Pandoc - Pandoc User’s Guide
Options
--from=markdown+lists_without_preceding_blankline+blank_before_blockquote
? :(ChatGPT tried to create a filter but nothing works, I’ll leave it for later: https://chat.openai.com/share/c94fffbe-1e90-4bc0-9e97-6027eeab281a
This produces the best HTML documents:
> pandoc 230928-1745\ Masterarbeit\ draft.md -o master_thesis.html \
--from=gfm --mathjax --standalone
NB If I add CSS, it should be an absolute path:
It’d be cool to wrap examples in the same environment!
https://forum.obsidian.md/t/rendering-callouts-similarly-in-pandoc/40020:
-- https://forum.obsidian.md/t/rendering-callouts-similarly-in-pandoc/40020/6
--
local stringify = (require "pandoc.utils").stringify
function BlockQuote (el)
start = el.content[1]
if (start.t == "Para" and start.content[1].t == "Str" and
start.content[1].text:match("^%[!%w+%][-+]?$")) then
_, _, ctype = start.content[1].text:find("%[!(%w+)%]")
el.content:remove(1)
start.content:remove(1)
div = pandoc.Div(el.content, {class = "callout"})
div.attributes["data-callout"] = ctype:lower()
div.attributes["title"] = stringify(start.content):gsub("^ ", "")
return div
else
return el
end
end
Makes:
> [!NOTE]- callout Title
>
> callout content
into
::: {.callout data-callout="note" title="callout Title"}
callout content
:::
.callout {
color: red; /* Set text color to red */
border: 1px solid red; /* Optional: add a red border */
padding: 10px; /* Optional: add some padding */
/* Add any other styling as needed */
}
Then this makes it pretty HTML:
pandoc callout.md -L luas/obsidian-callouts.lua -t markdown -s | pandoc --standalone -o some_test.html --css luas/callout-style.css
<div class="callout" data-callout="note" title="callout Title">
<p>callout content</p>
</div>
For PDF: .. it’s more complex, will need such a header file etc. later on. TODO
\usepackage{xcolor} % Required for color definition
\newenvironment{callout}{
\color{red} % Sets the text color to red within the environment
% Add any other formatting commands here
}{}
--css /abs/tufte.css
!Copied executables to /home/sh/.local/bin/:
aha so that’s where you put your filters, inside $PATH
Damn! Just had to replace index.md with my thesis, then make all
and it just …worked. Wow.
Apparently to make it not a sidenote I just have to add -
to the footnote itself. Would be trivial to replace with an @
etc., then I get my inital plan - citations as citations and footnotes with my remarks as sidenotes.
I can add --from gfm --mathjax
to the makefile command and it works with all my other requirements!
pandoc \
--katex \
--section-divs \
--from gfm \
--mathjax \
--filter pandoc-sidenote \
--to html5+smart \
--template=tufte \
--css tufte.css --css pandoc.css --css pandoc-solarized.css --css tufte-extra.css \
--output docs/tufte-md/index.html \
docs/tufte-md/index.md
I wonder if I can modify it to create latex-style sidenotes, it should be very easy: pandoc-sidenote/src/Text/Pandoc/SideNote.hs at master · jez/pandoc-sidenote
![Caption](file.ext){#fig:label}
$$ math $$ {#eq:label}
Section {#sec:section}
TODO figure out, and latex as well.
TODO
Current best:
eng: the manNOM.SG saw the dogNOM.SG
ukr: чоловікman-NOM.SG побачивsaw-PST собакydog-ACC.SG
I’d love to integrate the usual UD feats bits but they take a lot of space, and it’s either latex magic or one word per line.
ukr: чоловік(man): Case=Nom|Number=Sing побачив(saw) собакy(dog): Case=Acc|Number=Sing
$чоловік^{man}_{Case=Nom|Number=Sing}$
${\underset{man}{чоловік}}^{Case=Nom|Number=Sing}$
$\underset{Case=Nom|Number=Sing}{чоловік^{man}}$
$\underset{NOM.SG}{чоловік^{man}}$
${\underset{man}{чоловік}}^{Case=Nom|Number=Sing}$
${\underset{man}{чоловік}}^{NOM.SG}$
${\underset{man}{чоловік}}^{NOM.SG}$ ${\underset{saw}{побачив}}$ ${\underset{dog}{собаку}}^{GEN.PL}$
я I Case=Nom|Number=Sing
побачив saw
собаку saw Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing
ukr: чоловікman-NOM.SG побачивsaw-PST собакydog-GEN.PL
${\underset{man}{чоловік}}$ Case=Nom|Number=Sing ${\underset{man}{чоловік}}$ Case=Nom|Number=Sing
I think this is cool! But hell to write and parse:
$\underset{\text{NOUN.NOM}}{\overset{\text{man}}{\text{чоловік-}\varnothing}}$ $\underset{\text{PST}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{NOUN-ACC}}{\overset{\text{dog}}{\text{собак-у}}}$.
$\underset{\text{NOUN.NOM}}{\overset{\text{man}}{\text{чоловік-}\varnothing}}$ $\underset{\text{PST}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{NOUN-ACC}}{\overset{\text{dog}}{\text{собак-у}}}$.
Let’s play more with it:
$\underset{\text{Case=Nom|Number=Sing}}{\overset{\text{man }}{\text{чоловік}}}$ $\underset{\text{}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{Case=Acc|Number=Sing}}{\overset{\text{dog}}{\text{собаку}}}$.
I can split it in diff lines: $\underset{\text{Case=Nom|Number=Sing}}{\overset{\text{man }}{\text{чоловік}}} \underset{\text{}}{\overset{\text{saw}}{\text{побачив}}} \underset{\text{Case=Acc|Number=Sing}}{\overset{\text{dog}}{\text{собаку}}}$.
$$\underset{\text{Case=Nom|Number=Sing}}{\overset{\text{man }}{\text{ЧОЛОВІК}}} \underset{\text{}}{\overset{\text{saw}}{\text{ПОБАЧИВ}}} \underset{\text{Case=Acc|Number=Sing}}{\overset{\text{dog}}{\text{СОБАКУ}}}$$
ukr: використовуватимуться Aspect=Imp|Number=Plur|Person=3
1 використовуватимуться використовуватися VERB _ Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin 0 root _ SpaceAfter=No
ukr: використовуватимуть-сяVERB-REFL
ukr: використовуватимутьVERB -сяREFL
$\underset{\text{NOM.SG}}{\overset{\text{man }}{\text{чоловік}}}$ $\underset{\text{PST}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{SG-ACC}}{\overset{\text{dog}}{\text{собак-у}}}$.
This is one of the cooler ones, I’ll use it if I ever need to: Examples — graphviz 0.20.1 documentation
It’s also supported by HackMD! How to use MathJax & UML - HackMD
I’ll need something like overleaf for my markdown thesis.
5 Best Collaborative Online Markdown Editors - TechWiser
… are a way to annotate grammar bits of a language together with translation: Interlinear gloss - Wikipedia
The Leipzig Glossing Rules are a set of rules to standardize interlinear glosses. They are focused less on understandability and more on consistency.
<span style="font-variant:small-caps;">Hello World</span>
[^1]I’m writing my thesis in Obsidian/Markdown, synced to Hugo, later I’ll use sth like pandoc to make it into a PDF, with or without a latex intermediate step.
EDIT: newer technical part lives now here 231226-1702 Ideas for annotating glosses in my Masterarbeit
cysouw/pandoc-ling: Pandoc Lua filter for linguistic examples
> pandoc --lua-filter=pandoc_ling.lua 231225-2240\ Glosses\ markdown\ magic.pandoc.md -o test.pdf
Error running filter pandoc_ling.lua:
pandoc_ling.lua:21: attempt to call a nil value (method 'must_be_at_least')
stack traceback:
pandoc_ling.lua:21: in main chunk
:::ex
| Dutch (Germanic)
| Deze zin is in het nederlands.
| DEM sentence AUX in DET dutch.
| This sentence is dutch.
:::
:::ex | Dutch (Germanic) | Deze zin is in het nederlands. | DEM sentence AUX in DET dutch. | This sentence is dutch. :::
.. it was the pandoc version. Updated. No error, but no luck either.
Digging into the examples I think this is happening:
Code is code. Using that formatting without code makes it be interpreted as a line, and that doesn’t survive the obsidian’s pandoc extensions’ conversion to pandoc markdown.
The original docu generation had this script:
function addRealCopy (code)
return { code, pandoc.RawBlock("markdown", code.text) }
end
return {
{ CodeBlock = addRealCopy }
}
It changes code blocks into code blocks and the content of the code block. Then the :::
block is put after the code but like normal markdown text, and it gets correctly changed by the pandoc-ling filter.
> pandoc 231225-2240\ Glosses\ markdown\ magic.pandoc.md -t markdown -L processVerbatim.lua -s --wrap=preserve | pandoc -L pandoc_ling.lua -o my.html
This works:
> pandoc "garden/it/231225-2240 Glosses markdown magic.md" -t markdown -L pandoc_ling.lua -s
> pandoc "garden/it/231225-2240 Glosses markdown magic.md" -L pandoc_ling.lua -o my.html
and is OK if my masterarbeit file will have no complexities at all.
(Can i add this as parameter to the existing bits?)
YES!
-L /home/sh/t/pandoc/pandoc_ling.lua
added as option to the pandoc plugins, together with “from markdown” (not HTML) option, works for getting this parsed right!
(Except that it’s ugly in the HTML view but I can live with that)
And Hugo. Exporting to Hugo through obyde is ugly as well.
I colud write sth like this: A Pandoc Lua filter to convert Callout Blocks to Hugo admonitions (shortcode).
We’lll o
Mijyuoon/obsidian-ling-gloss: An Obsidian plugin for interlinear glosses used in linguistics texts.
Pandoc export from HTML visualizes them quite well.
\gla Péter-nek van egy macská-ja
\glb pe:tɛrnɛk vɒn ɛɟ mɒt͡ʃka:jɒ
\glc Peter-DAT exist INDEF cat-POSS.3SG
\ft Peter has a cat.
\gla Péter-nek van egy macská-ja
\glb pe:tɛrnɛk vɒn ɛɟ mɒt͡ʃka:jɒ
\glc Peter-DAT exist INDEF cat-POSS.3SG
\ft Peter has a cat.
\set glastyle cjk
\ex 牆上掛著一幅畫 / 墙上挂着一幅画
\gl 牆 [墙] [qiáng] [wall] [^[TOP]
上 [上] [shàng] [on] [^]]
掛 [挂] [guà] [hang] [V]
著 [着] [zhe] [CONT] [ASP]
一 [一] [yì] [one] [^[S]
幅 [幅] [fú] [picture.CL] []
畫 [画] [huà] [picture] [^]]
\ft A picture is hanging on the wall.
function addRealCopy (code)
-- return { code, pandoc.RawBlock("markdown", code.text) }
if code.classes[1] == "mygloss" then
return { pandoc.RawBlock("markdown", code.text) }
else
return { code }
end
end
return {
{ CodeBlock = addRealCopy }
}
Should parse:
:::ex
| Dutch (Germanic)
| Deze zin is in het nederlands.
| DEM sentence AUX in DET dutch.
| This sentence is dutch.
:::
Should stay as code:
:::ex
| Dutch (Germanic)
| Deze zin is in het nederlands.
| DEM sentence AUX in DET dutch.
| This sentence is dutch.
:::
pandoc "/... arden/it/231225-2240 Glosses markdown magic.md" -L processVerbatim.lua -t markdown -s | pandoc -L pandoc_ling.lua -o my.html
It works!
But not this:
> pandoc "/home231225-2240 Glosses markdown magic.md" -L processVerbatim.lua -L pandoc_ling.lua -o my.html
Likely because both require markdown and the intermediate step seems to break.
Maybe I’m overcomplicating it and I can just use the UD I can use superscripts!
The inflectional paradigm of Ukrainian admits free word order: in English the Subject-Verb-Object word order in “the manman-NOM.SG saw the dogdog-NOM.SG” (vs “the dog man-NOM.SG saw the manman-NOM.SG “) determines who saw whom, while in Ukrainian (“чоловікman-NOM.SG побачивsaw-PST собакУdog-GEN.PL”) the last letter of the object (dog) makes it genetive, and therefore the object.