In the middle of the desert you can say anything you want
If you have an axis indexed from timestamps and want to draw stuff there, you need to convert between datetimes and coordinates.
SO1 led me to matplotlib.dates — Matplotlib 3.7.1 documentation.
from matplotlib.dates import date2num
coordinate_float_value = date2num(your_timestamp)
# there's also num2date, etc.
Then you can do things like
g=sns.histplot(...)
g.axes.bar(x=date2num(timestamp),height=5,width=0.01)
#or
Ref:
And, for later, gridlnes/dates fun I’ll document later maybe:
from matplotlib.dates import date2num, drange
from datetime import timedelta
import matplotlib.ticker as ticker
g = sns.lineplot(...)
# We create a grid located at midnight of each day
t_end = df_item['time.pull'].max().ceil(freq="D")
t_start = df_item['time.pull'].min().floor(freq="D")
dr_days = drange(t_start,t_end,delta=timedelta(days=1))
dr_hours = drange(t_start,t_end,delta=timedelta(hours=4))
g.axes.grid(True, axis='both',which='major')
g.axes.grid(True, axis='both',which='minor',linewidth=0.2,linestyle="--")
g.axes.xaxis.set_major_locator(ticker.FixedLocator(dr_days))
g.axes.xaxis.set_minor_locator(ticker.FixedLocator(dr_hours))
For titles I was using sns.histplot(..).set(title="My title")
, but I couldn’t find any documentation for that .set()
function in the seaborn docu.
Seaborn’s FAQ (“How can I can I change something about the figure?”) led me here: matplotlib.axes.Axes.set — Matplotlib 3.7.1 documentation
It’s actually a matplotlib function!
(TODO: understand much better how seaborn exposes matplotlib’s internals. Then I can google for matplotlib stuff too)
You can access the matplotlib Figure
through .fig
, then use
matplotlib.pyplot.suptitle — Matplotlib 3.7.1 documentation
for the main figure title!
x = sns.displot(
data=xxx,
x='items_available',
col="item.item_category",
).set_titles(col_template="{col_name}") # Title template for each facet
# Main figure title, through matplotlib Figure
x.fig.suptitle("Distribution of sums of all items_available per time.pull",va='bottom')
I was trying to do a join based on two columns, one of which is a pd Timestamp
.
What I learned: If you’re trying to join/merge two DataFrames not by their indexes,
pandas.DataFrame.merge
is better (yay precise language) than
pandas.DataFrame.join.
Or, for some reason I had issues with df.join(.. by=[col1,col2])
, even with df.set_index([col1,col2]).join(df2.set_index...)
, then it went out of memory and I gave up.
Then a SO answer1 said
use merge if you are not joining on the index
I tried it and df.merge(..., by=col2)
magically worked!
This is REALLY neat and seaborn is now officially the best thing since sliced bread (only having pie charts could make it better1).
seaborn.FacetGrid — seaborn 0.12.2 documentation:
relplot Combine a relational plot and a FacetGrid
displot Combine a distribution plot and a FacetGrid
catplot Combine a categorical plot and a FacetGrid
lmplot Combine a regression plot and a FacetGrid
sns.displot(
data=df_item[df_item['item.item_category']!="OTHER"].groupby(['item.item_category','time.pull']).sum(),
#y='item_active',
x='items_available',
hue="item.item_category",
col="item.item_category",
)
All of this takes row
/col
arguments that neatly create separate plots!
Obyde/obsidian internal link test: 230515-1855 Pie charts considered harmful ↩︎
Spent hours trying to understand what’s happening.
TL;DR categorical types inside groupbys get shown ALL, even if there are no instances of a specific type in the actual data.
# Shows all categories including OTHER
df_item[df_item['item.item_category']!="OTHER"].groupby(['item.item_category']).sum()
df_item['item.item_category'] = df_item['item.item_category'].astype(str)
# Shows three categories
df_item[df_item['item.item_category']!="OTHER"].groupby(['item.item_category']).sum()
Rel. thread: groupby with categorical type returns all combinations · Issue #17594 · pandas-dev/pandas
Both things below work! Seaborn is smart and parses pd groupby-s as-is
sns.histplot(data=gbc,
x='items_available',
hue="item.item_category",
)
sns.histplot(data=gbc.reset_index(),
x='items_available',
hue="item.item_category",
)
Note that seaborn doesn’t create pie charts, as seaborn’s author considers those to be unfit for statistical visualization. See e.g. Why you shouldn’t use pie charts – Johan 1
Why you shouldn’t use pie charts:
Pies and doughnuts fail because:
- Quantity is represented by slices; humans aren’t particularly good at estimating quantity from angles, which is the skill needed.
- Matching the labels and the slices can be hard work.
- Small percentages (which might be important) are tricky to show.
The world is interesting.
TL;DR
df.loc[row_indexer, col_indexer] = value
col_indexer
can be a non-existing-yet column! And row_indexer
can be anything, including based on a groupby
filter.
Below, the groupby filter has dropna=False
which would return also the rows that don’t match the filter, giving a Series with the same indexes as the main df
# E.g. this groupby filter - NB. dropna=False
df_item.groupby(['item.item_id']).filter(lambda x:x.items_available.max()>0, dropna=False)['item.item_id']
# Then we use that in the condition, nice arbitrary example with `item.item_id` not being the index of the DF
df_item.loc[df_item['item.item_id']==df_item.groupby(['item.item_id']).filter(lambda x:x.items_available.max()>0, dropna=False)['item.item_id'],'item_active'] = True
I’m not sure whether this is the “best” way to incorporate groupby results, but seems to work OK for now.
Esp. the remaining rows have nan
instead of False, can be worked around but is ugly:
df_item['item_active'] = df_item['item_active'].notna()
# For plotting purposes
sns.histplot(data=df_item.notna(), ... )
Pandas Filter by Column Value - Spark By {Examples} has more examples of conditions:
# From https://sparkbyexamples.com/pandas/pandas-filter-by-column-value/
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
df[df["Courses"] == 'Spark']
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]
mwouts/itables: Pandas DataFrames as Interactive DataTables:
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
It kept “loading”. I set the notebook to ’trusted’ as per help, didn’t help.
But this did:
init_notebook_mode(all_interactive=True, connected=True)
(connected=True
makes it load libraries dynamically from the internet, and is not the default)
Allows more interesting interactive dynamical tables. Incl. things like sorting by column etc.
Courtesy of ChatGPT, after googling for 15 minutes didn’t bring anything immediately copy-pasteable.
I’ll deal with the existential crisis from this later.
To explode the cells containing dictionaries in a pandas DataFrame and append them as additional columns, you can use the apply() method along with the pd.Series() constructor. Here’s an example:
import pandas as pd
# create example DataFrame
df = pd.DataFrame({
'id': [1, 2],
'info': [
{'name': 'Alice', 'age': 25},
{'name': 'Bob', 'age': 30, 'city': 'New York'}
]
})
# use apply() with pd.Series() to explode the dictionaries and append them as columns
df = pd.concat([df.drop('info', axis=1), df['info'].apply(pd.Series)], axis=1)
print(df)
Output:
id name age city
0 1 Alice 25 NaN
1 2 Bob 30 New York
In this example, we first create a DataFrame with two rows, where the info column contains dictionaries. We then use the apply() method with pd.Series() to explode the dictionaries and create additional columns for each key in the dictionaries. Finally, we concatenate the exploded columns with the original DataFrame using pd.concat().