In the middle of the desert you can say anything you want
Current best:
eng: the manNOM.SG saw the dogNOM.SG
ukr: чоловікman-NOM.SG побачивsaw-PST собакydog-ACC.SG
I’d love to integrate the usual UD feats bits but they take a lot of space, and it’s either latex magic or one word per line.
ukr: чоловік(man): Case=Nom|Number=Sing побачив(saw) собакy(dog): Case=Acc|Number=Sing
$чоловік^{man}_{Case=Nom|Number=Sing}$
${\underset{man}{чоловік}}^{Case=Nom|Number=Sing}$
$\underset{Case=Nom|Number=Sing}{чоловік^{man}}$
$\underset{NOM.SG}{чоловік^{man}}$
${\underset{man}{чоловік}}^{Case=Nom|Number=Sing}$
${\underset{man}{чоловік}}^{NOM.SG}$
${\underset{man}{чоловік}}^{NOM.SG}$ ${\underset{saw}{побачив}}$ ${\underset{dog}{собаку}}^{GEN.PL}$
я I Case=Nom|Number=Sing
побачив saw
собаку saw Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing
ukr: чоловікman-NOM.SG побачивsaw-PST собакydog-GEN.PL
${\underset{man}{чоловік}}$ Case=Nom|Number=Sing ${\underset{man}{чоловік}}$ Case=Nom|Number=Sing
I think this is cool! But hell to write and parse:
$\underset{\text{NOUN.NOM}}{\overset{\text{man}}{\text{чоловік-}\varnothing}}$ $\underset{\text{PST}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{NOUN-ACC}}{\overset{\text{dog}}{\text{собак-у}}}$.
$\underset{\text{NOUN.NOM}}{\overset{\text{man}}{\text{чоловік-}\varnothing}}$ $\underset{\text{PST}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{NOUN-ACC}}{\overset{\text{dog}}{\text{собак-у}}}$.
Let’s play more with it:
$\underset{\text{Case=Nom|Number=Sing}}{\overset{\text{man }}{\text{чоловік}}}$ $\underset{\text{}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{Case=Acc|Number=Sing}}{\overset{\text{dog}}{\text{собаку}}}$.
I can split it in diff lines: $\underset{\text{Case=Nom|Number=Sing}}{\overset{\text{man }}{\text{чоловік}}} \underset{\text{}}{\overset{\text{saw}}{\text{побачив}}} \underset{\text{Case=Acc|Number=Sing}}{\overset{\text{dog}}{\text{собаку}}}$.
$$\underset{\text{Case=Nom|Number=Sing}}{\overset{\text{man }}{\text{ЧОЛОВІК}}} \underset{\text{}}{\overset{\text{saw}}{\text{ПОБАЧИВ}}} \underset{\text{Case=Acc|Number=Sing}}{\overset{\text{dog}}{\text{СОБАКУ}}}$$
ukr: використовуватимуться Aspect=Imp|Number=Plur|Person=3
1 використовуватимуться використовуватися VERB _ Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin 0 root _ SpaceAfter=No
ukr: використовуватимуть-сяVERB-REFL
ukr: використовуватимутьVERB -сяREFL
$\underset{\text{NOM.SG}}{\overset{\text{man }}{\text{чоловік}}}$ $\underset{\text{PST}}{\overset{\text{saw}}{\text{побачив}}}$ $\underset{\text{SG-ACC}}{\overset{\text{dog}}{\text{собак-у}}}$.
This is one of the cooler ones, I’ll use it if I ever need to: Examples — graphviz 0.20.1 documentation
It’s also supported by HackMD! How to use MathJax & UML - HackMD
I’ll need something like overleaf for my markdown thesis.
5 Best Collaborative Online Markdown Editors - TechWiser
… are a way to annotate grammar bits of a language together with translation: Interlinear gloss - Wikipedia
The Leipzig Glossing Rules are a set of rules to standardize interlinear glosses. They are focused less on understandability and more on consistency.
<span style="font-variant:small-caps;">Hello World</span>
[^1]I’m writing my thesis in Obsidian/Markdown, synced to Hugo, later I’ll use sth like pandoc to make it into a PDF, with or without a latex intermediate step.
EDIT: newer technical part lives now here 231226-1702 Ideas for annotating glosses in my Masterarbeit
cysouw/pandoc-ling: Pandoc Lua filter for linguistic examples
> pandoc --lua-filter=pandoc_ling.lua 231225-2240\ Glosses\ markdown\ magic.pandoc.md -o test.pdf
Error running filter pandoc_ling.lua:
pandoc_ling.lua:21: attempt to call a nil value (method 'must_be_at_least')
stack traceback:
pandoc_ling.lua:21: in main chunk
:::ex
| Dutch (Germanic)
| Deze zin is in het nederlands.
| DEM sentence AUX in DET dutch.
| This sentence is dutch.
:::
:::ex | Dutch (Germanic) | Deze zin is in het nederlands. | DEM sentence AUX in DET dutch. | This sentence is dutch. :::
.. it was the pandoc version. Updated. No error, but no luck either.
Digging into the examples I think this is happening:
Code is code. Using that formatting without code makes it be interpreted as a line, and that doesn’t survive the obsidian’s pandoc extensions’ conversion to pandoc markdown.
The original docu generation had this script:
function addRealCopy (code)
return { code, pandoc.RawBlock("markdown", code.text) }
end
return {
{ CodeBlock = addRealCopy }
}
It changes code blocks into code blocks and the content of the code block. Then the :::
block is put after the code but like normal markdown text, and it gets correctly changed by the pandoc-ling filter.
> pandoc 231225-2240\ Glosses\ markdown\ magic.pandoc.md -t markdown -L processVerbatim.lua -s --wrap=preserve | pandoc -L pandoc_ling.lua -o my.html
This works:
> pandoc "garden/it/231225-2240 Glosses markdown magic.md" -t markdown -L pandoc_ling.lua -s
> pandoc "garden/it/231225-2240 Glosses markdown magic.md" -L pandoc_ling.lua -o my.html
and is OK if my masterarbeit file will have no complexities at all.
(Can i add this as parameter to the existing bits?)
YES!
-L /home/sh/t/pandoc/pandoc_ling.lua
added as option to the pandoc plugins, together with “from markdown” (not HTML) option, works for getting this parsed right!
(Except that it’s ugly in the HTML view but I can live with that)
And Hugo. Exporting to Hugo through obyde is ugly as well.
I colud write sth like this: A Pandoc Lua filter to convert Callout Blocks to Hugo admonitions (shortcode).
We’lll o
Mijyuoon/obsidian-ling-gloss: An Obsidian plugin for interlinear glosses used in linguistics texts.
Pandoc export from HTML visualizes them quite well.
\gla Péter-nek van egy macská-ja
\glb pe:tɛrnɛk vɒn ɛɟ mɒt͡ʃka:jɒ
\glc Peter-DAT exist INDEF cat-POSS.3SG
\ft Peter has a cat.
\gla Péter-nek van egy macská-ja
\glb pe:tɛrnɛk vɒn ɛɟ mɒt͡ʃka:jɒ
\glc Peter-DAT exist INDEF cat-POSS.3SG
\ft Peter has a cat.
\set glastyle cjk
\ex 牆上掛著一幅畫 / 墙上挂着一幅画
\gl 牆 [墙] [qiáng] [wall] [^[TOP]
上 [上] [shàng] [on] [^]]
掛 [挂] [guà] [hang] [V]
著 [着] [zhe] [CONT] [ASP]
一 [一] [yì] [one] [^[S]
幅 [幅] [fú] [picture.CL] []
畫 [画] [huà] [picture] [^]]
\ft A picture is hanging on the wall.
function addRealCopy (code)
-- return { code, pandoc.RawBlock("markdown", code.text) }
if code.classes[1] == "mygloss" then
return { pandoc.RawBlock("markdown", code.text) }
else
return { code }
end
end
return {
{ CodeBlock = addRealCopy }
}
Should parse:
:::ex
| Dutch (Germanic)
| Deze zin is in het nederlands.
| DEM sentence AUX in DET dutch.
| This sentence is dutch.
:::
Should stay as code:
:::ex
| Dutch (Germanic)
| Deze zin is in het nederlands.
| DEM sentence AUX in DET dutch.
| This sentence is dutch.
:::
pandoc "/... arden/it/231225-2240 Glosses markdown magic.md" -L processVerbatim.lua -t markdown -s | pandoc -L pandoc_ling.lua -o my.html
It works!
But not this:
> pandoc "/home231225-2240 Glosses markdown magic.md" -L processVerbatim.lua -L pandoc_ling.lua -o my.html
Likely because both require markdown and the intermediate step seems to break.
Maybe I’m overcomplicating it and I can just use the UD I can use superscripts!
The inflectional paradigm of Ukrainian admits free word order: in English the Subject-Verb-Object word order in “the manman-NOM.SG saw the dogdog-NOM.SG” (vs “the dog man-NOM.SG saw the manman-NOM.SG “) determines who saw whom, while in Ukrainian (“чоловікman-NOM.SG побачивsaw-PST собакУdog-GEN.PL”) the last letter of the object (dog) makes it genetive, and therefore the object.
Related: 231220-1232 GBIF iNaturalist plantNet duplicates
KilianB/JImageHash: Perceptual image hashing library used to match similar images does hashes based on image content, not bytes (a la SHA1 and friends)
Hashing Algorithms · KilianB/JImageHash Wiki is a cool visual explanation of the algos involved.
Kind of Like That - The Hacker Factor Blog is a benchmark thing, TL;DR
One of the comments suggest running a quick one with many FPs and then a slower one on the problematic detected images.
Finally, you can totally upload the same picture multiple times if there’s multiple organisms in one picture that you would like identified - you want to have a separate observation for each organism. Usually if I do this I’ll make a note in the description of what I’m looking to have IDed.
https://www.inaturalist.org/posts/28325-tech-tip-tuesday-duplicating-observations: “perfectly okay to duplicate observations”, because species in the background etc.
https://forum.inaturalist.org/t/duplicate-obervations/18378/4
Random
just came across a user who repeatedly submits pairs of some robber fly photo weeks apart. (https://forum.inaturalist.org/t/create-a-flag-category-for-duplicate-observations/29647/42)
remove shared queries (already present in observation dataset) - remove duplicate session (keep the most recent query based on the session number) -
associatedOccurrences
for kinda similar ones, incl. by parsing the descriptions for mentions etc.: Darwin Core Quick Reference Guide - Darwin Core
Has clustering of records that appear to be similar:
matching similar entries in individual fields across different datasets
curl https://api.gbif.org/v1/occurrence/4011664186 | jq -C | less
has isInCluster
but nothing moreGBIF has associatedOccurrences
Darwin Core Resource Relationship – Extension darwin core bit about relationships between records docu
Duplicate occurrence records - Data Publishing - GBIF community forum
I’m not aware of any backend or external packages (e.g. in R or Python) that can tidy a Darwin Core dataset
Fortunately or unfortunately, Darwin Core datasets are complex beasts that don’t lend themselves to automated checking and fixing. For this reason people (not backend routines) are the best Darwin Core data cleaners 4. The code recipes 2 I use are freely available on the Web and I (and now others) are happy to train others in their use.
Duplicate observations across datasets - GBIF community forum
Something that I often hear repeated on iNaturalist and BugGuide is that posting the same observation on both platforms results in the observation being ingested twice by GBIF.
Look for GBIF/iNat/plantnet repos on Github and look their mentions of duplicates
Core plugins -> Outline!
Usually models are added as python -m spacy download de_core_news_sm
For poetryi: python - How to download spaCy models in a Poetry managed environment - Stack Overflow
TL;DR: spacy models are python packages!
Get direct link to model packages here: uk · Releases · explosion/spacy-models
Add to poetry tool dependencies in pyproject.toml:
[tool.poetry.dependencies]
python = "^3.10"
# ...
uk_core_news_sm = {url = "https://github.com/explosion/spacy-models/releases/download/uk_core_news_sm-3.7.0/uk_core_news_sm-3.7.0-py3-none-any.whl"}
add through poetry CLI:
poetry add https://github.com/explosion/spacy-models/releases/download/uk_core_news_sm-3.7.0/uk_core_news_sm-3.7.0-py3-none-any.whl
I’d usually do ps aux
.
The ps
command can also do:
ps -ef --forest
But the best one is pstree
1 from the psmisc
package.
pstree
# or
pstree -i # for processids
231213-1710 Ukrainska Pravda dataset#Can I also use this to generate tasks for the UA-CBT ( 231024-1704 Master thesis task CBT ) task? : both 3.5 and 4 during summarization use definitely Russian-inspired phrases :
In the news summarization bit, it magically changed Євген->Евген (https://chat.openai.com/share/2f6cf1f3-caf5-4e55-9c1b-3dbd6b73ba29)
Та подивись, баране, як я виглядаю з цим стильним сурдутом1
Вертить хвостиком і крутить рогами. Цап робить враження2.
(from 230928-1630 Ideas for Ukrainian LM eval tasks)
On the semantic front, exploit polysemy and homonymy differences. Formulate sentences with words that have multiple meanings in Russian, but those meanings have distinct equivalents in Ukrainian. This will challenge the model to accurately discern the intended sense based on context.