Diensttagebuch

Day 1596 (16 May 2023)

matplotlib datetime axes indexing

If you have an axis indexed from timestamps and want to draw stuff there, you need to convert between datetimes and coordinates.

SO¹ led me to matplotlib.dates — Matplotlib 3.7.1 documentation.

from matplotlib.dates import date2num
coordinate_float_value = date2num(your_timestamp)
# there's also num2date, etc.

Then you can do things like

g=sns.histplot(...)

g.axes.bar(x=date2num(timestamp),height=5,width=0.01)

#or

Ref:

And, for later, gridlnes/dates fun I’ll document later maybe:

from matplotlib.dates import date2num, drange
from datetime import timedelta
import matplotlib.ticker as ticker

g = sns.lineplot(...)

# We create a grid located at midnight of each day
t_end = df_item['time.pull'].max().ceil(freq="D")
t_start = df_item['time.pull'].min().floor(freq="D")
dr_days = drange(t_start,t_end,delta=timedelta(days=1))
dr_hours = drange(t_start,t_end,delta=timedelta(hours=4))

g.axes.grid(True, axis='both',which='major')
g.axes.grid(True, axis='both',which='minor',linewidth=0.2,linestyle="--")
g.axes.xaxis.set_major_locator(ticker.FixedLocator(dr_days))
g.axes.xaxis.set_minor_locator(ticker.FixedLocator(dr_hours))

python - Matplotlib datetime from event coordinates - Stack Overflow ↩︎

Day 1595 (15 May 2023)

Seaborn setting titles and stuff through matplotlib's axis .set() function
- zc
- zc/it
- py/seaborn
- py/matplotlib
For titles I was using sns.histplot(..).set(title="My title"), but I couldn’t find any documentation for that .set() function in the seaborn docu.

Seaborn’s FAQ (“How can I can I change something about the figure?”) led me here: matplotlib.axes.Axes.set — Matplotlib 3.7.1 documentation

It’s actually a matplotlib function!

(TODO: understand much better how seaborn exposes matplotlib’s internals. Then I can google for matplotlib stuff too)

Bonus: setting figure title for each facet of a FacetGrid AND the entire figure

You can access the matplotlib Figure through .fig , then use matplotlib.pyplot.suptitle — Matplotlib 3.7.1 documentation for the main figure title!
```
x = sns.displot(
	data=xxx,
	x='items_available',
	col="item.item_category",
).set_titles(col_template="{col_name}") # Title template for each facet

# Main figure title, through matplotlib Figure
x.fig.suptitle("Distribution of sums of all items_available per time.pull",va='bottom')
```
Pandas joining and merging tables
- zc
- zc/it
- python
- pandas
I was trying to do a join based on two columns, one of which is a pd Timestamp.

What I learned: If you’re trying to join/merge two DataFrames not by their indexes,
pandas.DataFrame.merge is better (yay precise language) than pandas.DataFrame.join.

Or, for some reason I had issues with df.join(.. by=[col1,col2]), even with df.set_index([col1,col2]).join(df2.set_index...), then it went out of memory and I gave up.

Then a SO answer¹ said

use merge if you are not joining on the index

I tried it and df.merge(..., by=col2) magically worked!
1. python - Why does Pandas inner join give ValueError: len(left_on) must equal the number of levels in the index of “right”? - Stack Overflow ↩︎
seaborn things built on FacetGrid for easy multiple plots
- zc
- zc/it
- python
- py/seaborn
This is REALLY neat and seaborn is now officially the best thing since sliced bread (only having pie charts could make it better¹).

seaborn.FacetGrid — seaborn 0.12.2 documentation:

relplot Combine a relational plot and a FacetGrid

displot Combine a distribution plot and a FacetGrid

catplot Combine a categorical plot and a FacetGrid

lmplot Combine a regression plot and a FacetGrid
```
sns.displot(
	data=df_item[df_item['item.item_category']!="OTHER"].groupby(['item.item_category','time.pull']).sum(),
	#y='item_active', 
	x='items_available',
	hue="item.item_category",
	col="item.item_category",
)
```
All of this takes row/col arguments that neatly create separate plots!
1. Obyde/obsidian internal link test: 230515-1855 Pie charts considered harmful ↩︎
Pandas categorical types weirdness
- zc
- zc/it
- pandas
- python
Spent hours trying to understand what’s happening.

TL;DR categorical types inside groupbys get shown ALL, even if there are no instances of a specific type in the actual data.
```
# Shows all categories including OTHER
df_item[df_item['item.item_category']!="OTHER"].groupby(['item.item_category']).sum()

df_item['item.item_category'] =  df_item['item.item_category'].astype(str)

# Shows three categories
df_item[df_item['item.item_category']!="OTHER"].groupby(['item.item_category']).sum()
```
Rel. thread: groupby with categorical type returns all combinations · Issue #17594 · pandas-dev/pandas
Pandas seaborn plotting groupby can be used without reset_index
- zc
- zc/it
- python
- pandas
- py/seaborn
Both things below work! Seaborn is smart and parses pd groupby-s as-is
```
sns.histplot(data=gbc,
            x='items_available', 
            hue="item.item_category",
            )
sns.histplot(data=gbc.reset_index(),
	x='items_available', 
	hue="item.item_category",
	)
```
Pie Charts Considered Harmful
- zc
- zc/it
- visualization
- matplotlib
- seaborn
Note that seaborn doesn’t create pie charts, as seaborn’s author considers those to be unfit for statistical visualization. See e.g. Why you shouldn’t use pie charts – Johan ¹

Why you shouldn’t use pie charts:
Pies and doughnuts fail because:
- Quantity is represented by slices; humans aren’t particularly good at estimating quantity from angles, which is the skill needed.
- Matching the labels and the slices can be hard work.
- Small percentages (which might be important) are tricky to show.
The world is interesting.
1. python - Pie chart from count of values (showing the percentage) - Stack Overflow ↩︎
Pandas set column value based on (incl.groupby) filter
- zc
- zc/it
- pandas
- python
TL;DR
```
df.loc[row_indexer, col_indexer] = value
```
col_indexer can be a non-existing-yet column! And row_indexer can be anything, including based on a groupby filter.

Below, the groupby filter has dropna=False which would return also the rows that don’t match the filter, giving a Series with the same indexes as the main df
```
# E.g. this groupby filter - NB.  dropna=False
df_item.groupby(['item.item_id']).filter(lambda x:x.items_available.max()>0, dropna=False)['item.item_id']

# Then we use that in the condition, nice arbitrary example with `item.item_id` not being the index of the DF
df_item.loc[df_item['item.item_id']==df_item.groupby(['item.item_id']).filter(lambda x:x.items_available.max()>0, dropna=False)['item.item_id'],'item_active'] = True
```
I’m not sure whether this is the “best” way to incorporate groupby results, but seems to work OK for now.

Esp. the remaining rows have nan instead of False, can be worked around but is ugly:
```
df_item['item_active'] = df_item['item_active'].notna()

# For plotting purposes
sns.histplot(data=df_item.notna(), ... )
```
Pandas Filter by Column Value - Spark By {Examples} has more examples of conditions:
```
# From https://sparkbyexamples.com/pandas/pandas-filter-by-column-value/

df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]
```
Day 1591 (11 May 2023)

jupyter pandas interactive tables
- zc
- zc/it
- pandas
- jupyter
mwouts/itables: Pandas DataFrames as Interactive DataTables:
```
from itables import init_notebook_mode

init_notebook_mode(all_interactive=True)
```
It kept “loading”. I set the notebook to ’trusted’ as per help, didn’t help.

But this did:
```
init_notebook_mode(all_interactive=True, connected=True)
```
(connected=True makes it load libraries dynamically from the internet, and is not the default)

Allows more interesting interactive dynamical tables. Incl. things like sorting by column etc.
Pandas add colums to dataframe based on dictionary inside a column
- zc
- zc/it
- python
- pandas
Courtesy of ChatGPT, after googling for 15 minutes didn’t bring anything immediately copy-pasteable.

I’ll deal with the existential crisis from this later.

To explode the cells containing dictionaries in a pandas DataFrame and append them as additional columns, you can use the apply() method along with the pd.Series() constructor. Here’s an example:
```
import pandas as pd

# create example DataFrame
df = pd.DataFrame({
    'id': [1, 2],
    'info': [
        {'name': 'Alice', 'age': 25},
        {'name': 'Bob', 'age': 30, 'city': 'New York'}
    ]
})

# use apply() with pd.Series() to explode the dictionaries and append them as columns
df = pd.concat([df.drop('info', axis=1), df['info'].apply(pd.Series)], axis=1)

print(df)
```
Output:
```
   id   name  age      city
0   1  Alice   25       NaN
1   2    Bob   30  New York
```
In this example, we first create a DataFrame with two rows, where the info column contains dictionaries. We then use the apply() method with pd.Series() to explode the dictionaries and create additional columns for each key in the dictionaries. Finally, we concatenate the exploded columns with the original DataFrame using pd.concat().

Day 1596 (16 May 2023)

Day 1595 (15 May 2023)

Bonus: setting figure title for each facet of a FacetGrid AND the entire figure

Day 1591 (11 May 2023)