In the middle of the desert you can say anything you want
Related: 231220-1232 GBIF iNaturalist plantNet duplicates
KilianB/JImageHash: Perceptual image hashing library used to match similar images does hashes based on image content, not bytes (a la SHA1 and friends)
Hashing Algorithms · KilianB/JImageHash Wiki is a cool visual explanation of the algos involved.
Kind of Like That - The Hacker Factor Blog is a benchmark thing, TL;DR
One of the comments suggest running a quick one with many FPs and then a slower one on the problematic detected images.
Finally, you can totally upload the same picture multiple times if there’s multiple organisms in one picture that you would like identified - you want to have a separate observation for each organism. Usually if I do this I’ll make a note in the description of what I’m looking to have IDed.
https://www.inaturalist.org/posts/28325-tech-tip-tuesday-duplicating-observations: “perfectly okay to duplicate observations”, because species in the background etc.
https://forum.inaturalist.org/t/duplicate-obervations/18378/4
Random
just came across a user who repeatedly submits pairs of some robber fly photo weeks apart. (https://forum.inaturalist.org/t/create-a-flag-category-for-duplicate-observations/29647/42)
remove shared queries (already present in observation dataset) - remove duplicate session (keep the most recent query based on the session number) -
associatedOccurrences
for kinda similar ones, incl. by parsing the descriptions for mentions etc.: Darwin Core Quick Reference Guide - Darwin Core
Has clustering of records that appear to be similar:
matching similar entries in individual fields across different datasets
curl https://api.gbif.org/v1/occurrence/4011664186 | jq -C | less
has isInCluster
but nothing moreGBIF has associatedOccurrences
Darwin Core Resource Relationship – Extension darwin core bit about relationships between records docu
Duplicate occurrence records - Data Publishing - GBIF community forum
I’m not aware of any backend or external packages (e.g. in R or Python) that can tidy a Darwin Core dataset
Fortunately or unfortunately, Darwin Core datasets are complex beasts that don’t lend themselves to automated checking and fixing. For this reason people (not backend routines) are the best Darwin Core data cleaners 4. The code recipes 2 I use are freely available on the Web and I (and now others) are happy to train others in their use.
Duplicate observations across datasets - GBIF community forum
Something that I often hear repeated on iNaturalist and BugGuide is that posting the same observation on both platforms results in the observation being ingested twice by GBIF.
Look for GBIF/iNat/plantnet repos on Github and look their mentions of duplicates
Core plugins -> Outline!
Usually models are added as python -m spacy download de_core_news_sm
For poetryi: python - How to download spaCy models in a Poetry managed environment - Stack Overflow
TL;DR: spacy models are python packages!
Get direct link to model packages here: uk · Releases · explosion/spacy-models
Add to poetry tool dependencies in pyproject.toml:
[tool.poetry.dependencies]
python = "^3.10"
# ...
uk_core_news_sm = {url = "https://github.com/explosion/spacy-models/releases/download/uk_core_news_sm-3.7.0/uk_core_news_sm-3.7.0-py3-none-any.whl"}
add through poetry CLI:
poetry add https://github.com/explosion/spacy-models/releases/download/uk_core_news_sm-3.7.0/uk_core_news_sm-3.7.0-py3-none-any.whl
I’d usually do ps aux
.
The ps
command can also do:
ps -ef --forest
But the best one is pstree
1 from the psmisc
package.
pstree
# or
pstree -i # for processids
Used it 3 times already and it’s awesome.
— If I’ll want your help with this in the future, which prompt can I use to describe the task I need and the output type to get a graph in the format and complexity level of the one you just generated? How can I concisely describe it to you so that no clarifications will be needed and you can just give me the answer?
— “Create an abstract graph structure for a story involving multiple characters with interconnected goals, challenges, outcomes, and a moral lesson. The graph should use nodes and relationships similar to the format of the ‘Adventurer and Guide’ mountain climbing story you previously created, with entities, goals, challenges, interactions, outcomes, and a moral lesson. The structure should reflect underlying themes rather than the literal narrative, similar to the complexity and abstraction level of the previous example.”
After more clarifications:
“Generate an abstract graph structure for a narrative involving multiple animate characters. The graph should include nodes for entities, goals, challenges, interactions, outcomes, and moral lessons. Each node should abstractly represent the core elements of the story, focusing on thematic and moral aspects rather than the literal narrative. The format should be similar to a semantic web ontology, emphasizing relationships and abstract concepts. Please provide the graph in a Python dictionary format, with complexity and depth akin to an advanced semantic network.”
Context: 231024-1704 Master thesis task CBT
231213-1710 Ukrainska Pravda dataset#Can I also use this to generate tasks for the UA-CBT ( 231024-1704 Master thesis task CBT ) task? : both 3.5 and 4 during summarization use definitely Russian-inspired phrases :
In the news summarization bit, it magically changed Євген->Евген (https://chat.openai.com/share/2f6cf1f3-caf5-4e55-9c1b-3dbd6b73ba29)
Та подивись, баране, як я виглядаю з цим стильним сурдутом1
Вертить хвостиком і крутить рогами. Цап робить враження2.
(from 230928-1630 Ideas for Ukrainian LM eval tasks)
On the semantic front, exploit polysemy and homonymy differences. Formulate sentences with words that have multiple meanings in Russian, but those meanings have distinct equivalents in Ukrainian. This will challenge the model to accurately discern the intended sense based on context.
This post describes the Ukrainska Pravda dataset I created as part of my Master’s Thesis. The contents of this blog post will be edited (esp. for brevity) and become part of the thesis (230928-1745 Masterarbeit draft).
A novel dataset created in the context of this Master’s Thesis is the Ukrainska Pravda multilingual dataset. The package written for this, UPCrawler
, is released at (https://github.com/pchr8/up_crawler) under the MIT license.
The dataset is released on the HF Hub at https://huggingface.co/datasets/shamotskyi/ukr_pravda_2y / doi https://doi.org/10.57967/hf/1476 under the CC BY-NC 4.0 license.
Ukrainska Pravda (lit. “Ukrainian Truth”; https://www.pravda.com.ua/) is a Ukrainian online newspaper for a general readership writing, mostly, about political and social topics.
In 2017, it was in the eighth most cited source of the Ukrainian Wikipedia1 and in 2020 it was the most visited online news website in Ukraine2(TODO - better source). The Institute of Mass Information listed Ukrainska Pravda listed it among the six online editions with the highest level of compliance with professional journalistic standards in 2021.3
UP (Ukrainska Pravda) publishes articles predominantly in Ukrainian, with some being translated to Russian and English. Each article can belong to zero or more “topics” (tags) that are mostly preserved across translations.
Each article has an article ID that is constant across translations.
The CLI interface expects a date range (using natural language, e.g. “last year”) and a target folder, where the pages are saved.
Initially, the package UPCrawler
used the daily archive pages (e.g. https://www.pravda.com.ua/archives/date_27092023/) to get the URLs of articles published on a specific day, then for each article URL accessed the expected locations of the Russian and English translations to check if a translation exists. Later, I rewrote the code to use a much better solution: parsing the XML sitemaps (e.g. https://www.pravda.com.ua/sitemap/sitemap-2023-09.xml.gz) using the advertools Python package.
Sitemaps4 is a XML-based protocol used to inform search engines about the URLs available for web crawling, as well as provide additional information about it such as when was the page last updated, how often does the content change, etc.
The following regex (see https://regex101.com/r/dYlIiF/4 for an interactive analysis) is used to parse each URL to get the language of the article, the article ID, the section (news, podcasts, ..) etc.:
URI_REGEX_STR_EXT = r"(?P<uri>(?P<domain>.*\.com\.ua\/)(?P<lang>(eng)|(rus))?\/?(?P<kind>.*?)\/(?P<art_id>.*(?P<date_part>....\/..\/..?)\/(?P<id>.*)\/))"
Crawling the articles is done using the beautifulsoup4 library. I considered the alternative option of using the newspaper3k package which was able to detect the article, title and metadata from UP surprisingly well, but it incorrectly detected some fields (which would have required manual fixes anyway), so I decided to keep my from scratch implementation.
For transparency and in the spirit of ethical crawling5, there were timeouts between requests, and the unique useragent contained a short explanation of my project as well as my email. At no point was I ever contacted or the crawler blocked.
The most challenging part were the tags. The URL of each tag contained a unique identifier that was consistent between translations.
The article text inside <article>
was taken from each page. The content of the tags <p>
and <li>
were used to extract the plaintext while avoiding advertisements, infoboxes etc.
Paragraphs matching some standard article endings like “follow us on Twitter” weren’t added to the plaintext, but not all such endings were filtered out.
The tags required special care because they presented two problems:
Since this was supposed to be a multilingual dataset I wanted to have a list of tags for each article independent on the translations. The solution at the end was to crawl Ukrainian and Russian tags pages to save the short unique ID and both translations, and add English translations to the short IDs when they were seen in the English translations of articles.
An example tag and three translations:
{"ukr":["флот","/tags/flot/"],"rus":["флот","/rus/tags/flot/"],"eng":["naval fleet","/eng/tags/flot/"]}
The UPravda multilingual dataset contains in total XX individual translations of YY articles. X articles have a Ukrainian version, Y a Russian and Z an Engish one.
The dataset has X individual tags, of which the most common ones are shown in the table below: TODO
The dataset contains articles published from the 01.01.2022 to X, since UP drastically increased the amount of articles translated to English after the start of the full-scale invasion on the 24.02.2022 7 , (see picture below; TODO better x axis angle on plot).
A recent (2022) manual audit of available crawled multilingual datasets found surprisingly low amounts of in-language data and systematic issues in many of them. 8
Some issues raised in the paper in the context of this dataset:
According to Ukrainian law, newpaper-like articles aren’t subject to copyright. According to UP’s rules on the matter9, reprinting (..) in other online-newspapers is free but requires a link to the UP article not later than the second paragraph. Using the materials for commercial reasons is forbidden.
I believe releasing this dataset under the CC BY-NC 4.0 license (that allows sharing and adaptation only with attribution and for non-commercial use), with clear attribution to UP in the name and the description of the dataset, fulfills the applicable obligations both in letter and in spirit.
The dataset is released at https://huggingface.co/datasets/shamotskyi/ukr_pravda_2y
Some UP articles have short paragraphs in the style of “follow us on Twitter” at the end. They have little to do with the actual article, so they were removed from the article text in the dataset.
All paragraphs containing text matching any of the lines/regexes below were filtered out:
"Follow (us|Ukrainska Pravda) on Twitter",
"Support UP",
"become our patron",
"(читайте|слухайте|слушайте) (також|также)", # "read/listen also to", in Russian and Ukrainian
It suggested (https://chat.openai.com/share/2f6cf1f3-caf5-4e55-9c1b-3dbd6b73ba29) to me this prompt:
Будь ласка, перефразуйте цей текст, змінюючи порядок інформації та структуру повідомлення, уникаючи збігів слів та фразових конструкцій з оригіналом. Фокусуйтеся лише на ключових фактах, уникаючи зайвих деталей:
An improved version that seems to work ~better(https://chat.openai.com/share/14f12f87-50a8-438c-9d01-a0b076c3be12) :
Будь ласка, перефразуйте цей текст, змінюючи порядок інформації та структуру повідомлення, максимально уникаючи збігів слів та фразових конструкцій з оригіналом. Довжина статті має бути приблизно такою ж, як довжина оригіналу.
GPT3.5 works just as well if not better than GPT4 (and is much faster): https://chat.openai.com/share/78927782-25fa-4047-b2a4-fd01ee9a7a54
Here GPT4 is much better than GPT3. Can’t share either link because “disabled by moderation”(???).
Interestingly, GPT3.5 used definitely Russian chiches that I document in 231214-1251 Masterarbeit benchmark task for Russian-Ukrainian interference.
231010-1003 Masterarbeit Tagebuch#2023-12-15
<_(@inbook) “Analysis of references across wikipedia languages” (2017) / Włodzimierz Lewoniewski, Krzysztof Węcel, Witold Abramowicz: z / / 10.1007/978-3-319-67642-5_47 _> ↩︎
Рейтинг топсайтів України | Інститут масової інформації, linked on Українська правда — Вікіпедія ↩︎
Compliance with professional standards in online media. The fourth wave of monitoring in 2021 | Institute of Mass Information ↩︎
<_(@Schonfeld2009) “Sitemaps: Above and beyond the crawl of duty” (2009) / Uri Schonfeld, Narayanan Shivakumar: z / https://dl.acm.org/doi/10.1145/1526709.1526842 / 10.1145/1526709.1526842 _> ↩︎
Ethics in Web Scraping. We all scrape web data. Well, those of… | by James Densmore | Towards Data Science ↩︎
https://www.pravda.com.ua/tags/; https://www.pravda.com.ua/rus/tags/ ↩︎
<_(@10.1162/tacl_a_00447) “Quality at a glance: An audit of web-crawled multilingual datasets” (2022) / Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi: z / https://doi.org/10.1162/tacl_a_00447 / 10.1162/tacl_a_00447 _> ↩︎
Правила використання матеріалів сайтів Інтернет-холдингу ‘‘Українська правда’’ (Оновлено) | Українська правда ↩︎
Was looking for a way to do this but it’s part of the batteries included: Pluralsight Tech Blog | Python CLI Utilities with Poetry and Typer
If you define run points in the pyproject.toml
[tool.poetry.scripts]
up_get_uris = "up_crawler.get_uris:main"
up_crawl_uris = "up_crawler.bs_oop:main"
up_run = "up_crawler.__main__:main"
up_convert = "up_crawler.up_reader:main"
Then once you install the package you built with poetry build
elsewhere, these commands will be registered as cli commands, and then you’ll be able to just run up_run --help
and it’ll work!
I come back to the topic every once in awhile, but this time How To Use Pytest Logging And Print To Console And File (A Comprehensive Guide) | Pytest With Eric gave me the only solution I’ll ever need:
poetry run pytest --log-cli-level=INFO
which works as-is without any additional packages etc.