Diensttagebuch

Day 1806 (11 Dec 2023)

Backing up a wordpress installation with wp-cli

Wordpress-Backups mit der WordPress-CLI – emsgold:

# check if upgrading is needed
wp core check-update 
wp plugin status 
wp theme status 

# DB
wp db export

# the entire website files
touch backupname.tar.gz
tar --exclude=backupname.tar.gz -vczf backupname.tar.gz .

poetry running scripts after building python package

Was looking for a way to do this but it’s part of the batteries included: Pluralsight Tech Blog | Python CLI Utilities with Poetry and Typer

If you define run points in the pyproject.toml

[tool.poetry.scripts]
up_get_uris = "up_crawler.get_uris:main"
up_crawl_uris = "up_crawler.bs_oop:main"
up_run = "up_crawler.__main__:main"
up_convert = "up_crawler.up_reader:main"

Then once you install the package you built with poetry build elsewhere, these commands will be registered as cli commands, and then you’ll be able to just run up_run --help and it’ll work!

Pytest logging output through CLI

I come back to the topic every once in awhile, but this time How To Use Pytest Logging And Print To Console And File (A Comprehensive Guide) | Pytest With Eric gave me the only solution I’ll ever need:

poetry run pytest --log-cli-level=INFO

which works as-is without any additional packages etc.

Day 1804 (09 Dec 2023)

Tenacity: a retrying library for python

jd/tenacity: Retrying library for Python¹:

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    before_sleep_log,
)

# ... 

@retry(
	stop=stop_after_attempt(10),  # Maximum number of retries
	wait=wait_exponential(multiplier=1, min=1, max=60),  # Exponential backoff
	before_sleep=before_sleep_log(logger, logging.INFO),
)
@staticmethod
def do_basic_uri_ops_when_crawling(
	# ...
	pass

This is much better than the various retrying mechanisms in requests (e.g. needing session adapters: Handling Retries in Python Requests – Majornetwork), and likely better than most reinvented wheels (231206-1722 Overengineered solution to retrying and exceptions in python).

Ty Python Requests: Retry Failed Requests [2023] - ZenRows ↩︎

Day 1802 (07 Dec 2023)

Notes to self and lessons learned, OOP and programming in general

I decided that I should go back to the digital garden roots of this, and use this note as a small journey of conceptual/high-level things that I believe would make me a better programmer.

And that I’ll re-read this every time I think of something to add here.

The master thesis has given me ample occasions to find out about these things, and will give me ample occasions to use them before it’s over. Just like with dashes (231205-1311 Notes from paper review#Hyphens vs dashes (em-dash, en-dash)), practiced enough it will stick.

OOP (2023-12-07)

(the post that started this page)

After refactoring my third program to use OOP this quarter, this be the wisdom:

If I'm starting a one-time simple project that looks like it doesn't need OOP - think hard, because often it does.

(Unless threads/parallelism, then it means think harder).

Crawling and converting and synchronicity (2023-12-08)

Context: UPCrawler & GBIF downloader

TL;DR: downloading bits and writing to disk each is sometimes better than to keep them in a dataframe-like-ish structure that gets written to disk in bulk. And the presence of a file on disk can be signal enough about its state, making separate data structures tracking that unneeded.

Background:

When downloading something big and of many parts, my first instinct is/was to put it into pretty dataclasses-like structures, (maybe serializable through JSONWizard), collect it and write it down.

If I think I need some intermediate results, I’d do checkpoints or something similar, usually in an ugly function of the dataframe class to do file handling etc.

Often one can download the individual bit and write it to disk, maybe inside a folder. Then a check of whether it has been downloaded would be literally a check if the file exists, making them self-documenting in a small way.

(And generally - previously I had this when writing certain converters and the second worst thing I have written in my life - I’d have dataclasses with kinds of data and separate boolean fields with has_X_data and stuff. I could have just used whether the data fields are None to signify if they are there or not instead of …that.)

Synchronicity and threads

Doing it like that makes they can happily be parallelized or whatever, downloaded separately.

In the UPCrawler, I was blocked by the need to add to each article a language-independent tag, that was an URI and one to two translations. I wanted to get the entire chunk, gather all translations of tags from them, label the chunks correctly, then serialize.

This is idiotic if I can just download the articles with the info I have to disk and then run a separate script to gather all tags from them and do this. (Or I can gather the tags in parallel while this is happening but don’t let the need to complete it block my download)

Shortcuts (2023-12-08)

Context: UPCrawler; a pattern I’ve been noticing.

Sitemaps instead of crawling archives

First I crawled and prased pages like Архив 26 сентября 2023 года | Украинская правда to get the URI of the articles published on that day, did permutations of the URI to get the other languages if any, and got the list of URIs of articles to crawl.

Yesterday I realized that UPravda has sitemaps: https://www.pravda.com.ua/sitemap/sitemap-2023-04.xml.gz, and that I can use something like advtools to nicely parse them, and advtools gave me back the data as a pandas DataFrame — leading me to the insight that I can analyze parse regex etc. the uris using pandas. Including things like groupby article ID to give me immediately the 1..3 translations of that article. Instead of me needing to track it inside a (guess what) datastructure based on dataclasses.

This inspired me to look for better solutions of another problem plaguing me - tags, with their UK and RU translations.

Lesson in all that

Make an effort — really, an effort - to look at the forest, and for each problem think if there’s an easier way to do that than the one I started implementing without thinking. Including whether there are already structures in place I know about but from other contexts.

I learned to look for solutions inside python stdlib, remembering about this at the right moments should be easy as well.

Я ускладнюю все, до чого торкаюсь (2023-12-08)

A lot of my code is more complex than needed, and to heavy for its own good/purpose. Connected to the above: think (draw? architect?) of a good design before I start writing the code. A sound structure from the beginning will remove many of the corner cases that end up in ugly code to maintain.

Use a real IDE as soon as needed (2024-01-19)

In the context of 240118-1516 RU interference masterarbeit task embeddings mapping, especially given that the models take a while to load.

A Jupyter notebook would have allowed me to experiment much better with the loaded models than a pdbpp interpreter/command line.
Pycharm would have allowed me to debug inside gensim and transmat, and therefore understand them, much better and earlier.

You can add notes to exceptions

8. Errors and Exceptions — Python 3.12.0 documentation:

try:
...     raise TypeError('bad type')
... except Exception as e:
...     e.add_note('Add some information')
...     e.add_note('Add some more information')
...     raise

A function deep down can raise the exception, then a function higher up can catch it and add more details (uri returns 404 -> when downloading image $image we got a 404).

This solves so many conceptual problems I’ve been having!

GBIF API bits

GBIF has media of four types, StillImage is only one of them: MediaType (GBIF Parsers 0.62-SNAPSHOT API)
GBIF can have images of not just image/jpeg mime type: https://data-blog.gbif.org/post/gbif-multimedia/

Couldn’t find a proper list, but one can always jq

> curl https://api.gbif.org/v1/occurrence/search\?taxonKey\=4\&limit\=300 | jq -C | grep "format.*image" | sort | uniq
          "format": "image/gif",
          "format": "image/jpeg",
          "format": "image/png",
            "http://purl.org/dc/terms/format": "image/gif",
            "http://purl.org/dc/terms/format": "image/jpeg",
            "http://purl.org/dc/terms/format": "image/png",

requests and urllib333 exceptions adventures

If I’m looking at this, the tenacity library post for retrying may be relevant as well (todo link).

So, TIL:

urllib3’s ConnectionError is now called ProtocolError, but aliased for backward compatibility
requests has also a ConnectionError! A totally different one I think

from requests.exceptions import ConnectionError, HTTPError
from urllib3.exceptions import ProtocolError, NameResolutionError, MaxRetryError

urllib3
- ProtocolError
  - RemoteDisconnected
- NameResolutionError - usually temporary
- MaxRetryError
requests:
- ConnectionError - catchall for all of the above?
- HTTPError - 404 and friends

requests exceptions docs: Developer Interface — Requests 2.31.0 documentation urllib3 exceptions: Exceptions and Warnings - urllib3 2.1.0 documentation

TODO: Interaction between MaxRetryError <-> other urllib3 errors - who raises whom, and what does requests do with it (I think raising a ConnectionError but am not sure)

Python filtering logging logs and warnings

zc
zc/it

Filtering logging messages in Python

class LoggingFilter(logging.Filter):
    def filter(self, record):
        if "Connection pool is full" in record.getMessage():
            return False

logger_cpool = logging.getLogger("urllib3.connectionpool")
logger_cpool.addFilter(LoggingFilter())

All filters are applied until one returns False, then the record is silenced, otherwise it gets logged normally.
Any magic can be done, incl. regex etc.!
LogRecord attributes¹ had a logrecord.message but I didn’t have it in my case (todo), but I found getMessage().

Getting the name of your logger to apply the filter to

How to list all existing loggers using python.logging module - Stack Overflow:

import logging

loggers = [logging.getLogger(name) for name in logging.root.manager.loggerDict]

Warnings

Temporarily supressing warnings

warnings — Warning control — Python 3.12.0 documentation:

import warnings

def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()

Dealing with warnings through logging

logging — Logging facility for Python — Python 3.12.0 documentation

# capture is True for enabling, False for disabling
logging.captureWarnings(capture)

# all warnings will become logs from logger 'py.warnings' with severity WARN

logging — Logging facility for Python — Python 3.10.13 documentation ↩︎

Day 1801 (06 Dec 2023)

Overengineered solution to retrying and exceptions in python

Goal: retry running function X times max Scenario: networking-ish issues

Solution: I came up with the thing below. It gets an optional list of acceptable exception types, and retries N times every time it gets one of them. As soon as it gets an unacceptable exception it passes it further. As soon as the function runs successfully it returns the function’s return value.

Can repeat infinite times and can consider all exceptions acceptable if both params are given empty or None.

from urllib3.exceptions import ProtocolError
from functools import partial
from itertools import count
from typing import Optional

def _try_n_times(fn, n_times: Optional[int]=3, acceptable_exceptions: Optional[tuple] =(ProtocolError, )):
    """ Try function X times before giving up.

    Concept:
    - retry N times if fn fails with an acceptable exception   
    - raise immediately any exceptions not inside acceptable_exceptions
    - if n_times is falsey will retry infinite times
    - if acceptable_exceptions is falsey, all exceptions are acceptable

    Returns:
        - after n<n_times retries the return value of the first successdful run of fn

    Raises:
        - first unacceptable exceptions if acceptable_exceptions is not empty 
        - last exception raised by fn after too many retries

    Args:
        fn: callable to run
        n_times: how many times, 0 means infinite
        acceptable_exceptions: iterable of exceptions classes after which retry
            empty/None means all exceptions are OK 

    TODO: if this works, integrate into load image/json as well (or increase 
        the number of retries organically) for e.g. NameResolutionErrors
        and similar networking/connection issues
    """
    last_exc = None
    for time in range(n_times) if n_times else count(0):
        try:
            # Try running the function and save output
            # break if it worked
            if time>0:
                logger.debug(f"Running fn {time=}")
            res = fn()
            break
        except Exception as e:
            # If there's an exception, raise bad ones otherwise continue the loop
            if acceptable_exceptions and e.__class__ not in acceptable_exceptions:
                logger.error(f"Caught {e} not in {acceptable_exceptions=},  so raising")
                raise
            logger.debug(f"Caught acceptable {e} our {time}'th time, continuing")
            last_exc = e
            continue
    else:
        # If loop went through without a single break it means fn always failed
        # we raise the last exception
        logger.error(f"Went through {time} acceptable exceptions, all failed, last exception was {last_exc}")
        raise last_exc

    # Return whatever fn returned on its first successful run
    return res

The main bit here was that I didn’t want to use any magic values that might conflict with whatever the function returns (if I get a None/False how can I know it wasn’t the function without ugly complex magic values?)

The main insight here is the else clause w/ break.

fn is run as fn() and partial is a good way to generate them

EDIT: (ty CH) you can also just declare a function, lol

serhii.net