serhii.net

In the middle of the desert you can say anything you want

28 Sep 2023

Random side quests about the Masterarbeit

UA-RU parallel corpus

pravda.com.ua1 має статті трьома мовами:

The difference seems to be only in that one part of the URL!

Article; title; tags; date,author.

Then article title+classification might be one of the benchmark tasks!

Is there anything stopping me from scraping the hell out of all of it?

Google finds 50k articles in /eng/, 483k in /rus/, assumption: all english articles were translated to Russian as well.

=> For each english article, try to get the Russian and Ukrainian one from the URI.

Related: ua-datasets/ua_datasets/src/text_classification at main · fido-ai/ua-datasets Related: facebook/flores · Datasets at Hugging Face frow wikinews in infinite languages including UA!

Somehow magically use WikiData

How does alignment/censoring work with UA?

eg could other langs help for that?2

Nel mezzo del deserto posso dire tutto quello che voglio.