In the middle of the desert you can say anything you want
So I have recorded a lot of random tracks of my location through the OSMAnd Android app and I wanted to see if I can visualize them. This was a deep dive into the whole topic of visualizing geo data.
Likely I got some things wrong but I think I have a picture more or less of what exists, and I tried to use a lot of it.
OSMAnd gives me .gpx files.
(I specifically would like to mention OsmAnd GPX | OsmAnd, that I should have opened MUCH earlier, and that would have saved me one hour trying to understand why do the speeds look strange: they were in meters per second.)
A GPX file give you a set(?) of tracks composed of segments that are a list of points, with some additional optional attributes/data with them.
Saw it mentioned the most often, therefore in my head it’s an important format.
Among other things, it’s one of the formats Stadt Leipzig provides data about the different parts of the city in, and I used that data heavily: Geodaten der Leipziger Ortsteile und Stadtbezirke - Geodaten der Leipziger Ortsteile - Open Data-Portal der Stadt Leipzig
Are out of scope of this and I never used them, but if I had to look somewhere for a list it’d be here: Input/output — GeoPandas 0.13.0+0.gaa5abc3.dirty documentation
Really lightweight and is basically an interface to the XML content of such files.
import gpxpy
# Assume gpx_tracks is a list of Paths to .gpx files
data = list()
for f in tqdm.tqdm(gpx_tracks):
gpx_data = gpxpy.parse(f.read_text())
# For each track inside that file
for t in gpx_data.tracks:
# For each segment of that track
for s in t.segments:
# We create a row
el = {"file": f.name, "segment": s}
data.append(el)
# Vanilla pd.DataFrame with those data
df = pd.DataFrame(data)
# We add more info about each segment
df["time_start"] = df["track"].map(lambda x: x.get_time_bounds().start_time)
df["time_end"] = df["track"].map(lambda x: x.get_time_bounds().end_time)
df["duration"] = df["track"].map(lambda x: x.get_duration())
df
At the end of my notebook, I used a more complex function to do the same basically:
WITH_OBJ = False
LIM=30
data = list()
# Ugly way to enumerate files
fn = -1
for f in tqdm.tqdm(gpx_tracks[:LIM]):
fn += 1
gpx_data = gpxpy.parse(f.read_text())
for tn, t in enumerate(gpx_data.tracks):
for sn, s in enumerate(t.segments):
# Same as above, but with one row per point now:
for pn, p in enumerate(s.points):
# We get the speed at each point
point_speed = s.get_speed(pn)
if point_speed:
# Multiply to get m/s -> km/h
point_speed *= 3.6
el = {
"file": f.name,
"file_n": fn,
"track_n": tn,
"seg_n": sn,
"point_n": pn,
"p_speed": point_speed,
"p_lat": p.latitude,
"p_lon": p.longitude,
"p_height": p.elevation,
}
# If needed, add the individual objects too
if WITH_OBJ:
el.update(
{
"track": t,
"segm": s,
"point": p,
}
)
data.append(el)
ft = pd.DataFrame(data)
gft = gp.GeoDataFrame(ft, geometry=gp.points_from_xy(ft.p_lon, ft.p_lat))
Then you can do groupbys etc by file/track/segment/… and do things like mean segment speed:
gft.groupby(["file_n", "track_n", "seg_n"]).p_speed.transform("mean")
gft["seg_speed"] = gft.groupby(["file_n", "track_n", "seg_n"]).p_speed.transform("mean")
GeoPandas 0.13.0 — GeoPandas 0.13.0+0.gaa5abc3.dirty documentation
It’s a superset of pandas but with more cool additional functionality built on top. Didn’t find it intuitive at all, but again new topic to me and all that.
Here’s an example of creating a GeoDataFrame:
import geopandas as gp
# Assume df_points.lon is a pd.Series/list-like of longitudes etc.
pdf = gp.GeoDataFrame(
df_points, geometry=gp.points_from_xy(df_points.lon, df_points.lat)
)
# we have a GeoDataFrame that's valid, because we created a `geometry` that gets semantically parsed as geo-data now!
Theoretically you can also read .gpx files directly, adapted from GPS track (GPX-Datei) in Geopandas öffnen | Florian Neukirchen to use pd.concat:
gdft = gp.GeoDataFrame(columns=["name", "geometry"], geometry="geometry")
for file in gpx_tracks:
try:
f_gdf = gp.read_file(file, layer="tracks")
gdft = pd.concat([gdft, f_gdf"name", "geometry"])
except Exception as e:
pass
# print("Error", file, e)
Problem with that is that I got a GeoDataFrame with shapely.MultiLineStrings that I could plot, but not
easily do more interesting stuff with it directly:
Under the hood, GeoPandas uses Shapely a lot. I love shapely, but I gave up on directly reading GPX files with geopandas", and went with the gpgpy way described above.
Merging data — GeoPandas 0.13.0+0.gaa5abc3.dirty documentation, first found on python - Accelerating GeoPandas for selecting points inside polygon - Geographic Information Systems Stack Exchange
Are really cool, and you can merge two different geodataframes based on whether things are inside/intersecting/outside other things, with the vanilla join analogy holding well otherwise.
FOR EXAMPLE, we have the data about Leipzig ldf from before:
We can
pdf = pdf.sjoin(ldf, how="left", predicate="intersects")
it with our pdf dataframe with points, it automatically takes the geometry from both, finds the points that are inside different parts of Leipzig, and you get your points with the columns from the part of Leipzig they’re located at!
Then you can do pretty graphs with the points colored based on that, etc.:
Adding a background map to plots — GeoPandas 0.13.0+0.gaa5abc3.dirty documentation contextily/intro_guide.ipynb at main · geopandas/contextily · GitHub DenisCarriere/geocoder: Python Geocoder API Overview — geocoder 1.38.1 documentation Sampling Points — GeoPandas 0.13.0+0.gaa5abc3.dirty documentation
Guide to the Place API — contextily 1.1.0 documentation Quickstart — Folium 0.14.0 documentation A short look into providers objects — contextily 1.1.0 documentation plugins — Folium 0.14.0 documentation masajid390/BeautifyMarker Plotting with Folium — GeoPandas 0.13.0+0.gaa5abc3.dirty documentation
html_code = m.get_root()._repr_html_()The Radar chart and its caveats: “radar or spider or web chart” (c)
… are best done in plotly:
For a log axis:
fig.update_layout(
template=None,
polar = dict(
radialaxis = dict(type="log"),
)
EDIT: for removing excessive margins, use
fig.update_layout( margin=dict(l=20, r=20, t=20, b=20), )
Trivial option: Label data points with Seaborn & Matplotlib | EasyTweaks.com
TL;DR
for i, label in enumerate (data_labels):
ax.annotate(label, (x_position, y_position))
BUT! Overlapping texts are sad:
SO sent me to the library Home · Phlya/adjustText Wiki and it’s awesome
fig, ax = plt.subplots()
plt.plot(x, y, 'bo')
texts = [plt.text(x[i], y[i], 'Text%s' %i, ha='center', va='center') for i in range(len(x))]
# adjust_text(texts)
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='red'))
Not perfect but MUCH cleaner:
More advanced tutorial: adjustText/Examples.ipynb at master · Phlya/adjustText · GitHub
Pypy doesn’t have the latest version, which has:
min_arrow_lenexpandHow To Apply Conditional Formatting Across An Entire Row;
$A$1 is a direct reference to A1, that won’t move if formula is applied to a rangeISBLANK(..) means cell is emptyAND(c1,c2,...,cN), OR(c1,c2,...,cN)$s=$U1=1 is “if U of the current row is equal to 1” (then you can color the entire row green or whatever)This contains the entire list of all datasets I care about RE [230529-1413 Plants datasets taxonomy] for 230507-2308 230507-1623 Plants paper notes
GBIF
Pl@ntNet
Flora-On: Flora de Portugal Interactiva. (2023). Sociedade Portuguesa de Botânica. www.flora-on.pt. Consulta efectuada em 29-5-2023.
@herediaLargeScalePlantClassification2017 (2017) z/d>iNaturalist-xxx
@vanhornINaturalistSpeciesClassification2018 (2018) z/d>INaturalist
Flora Incognita
we curated a partly crowd-sourced image dataset, comprising 50,500 images of 101 species.
2494 observations with 3199 images from 588 species, 365 genera and 89 families
PlantCLEF
REALLY NICE OVERVIEW PAPER with really good overview of the existing datasets! Frontiers | Plant recognition by AI: Deep neural nets, transformers, and kNN in deep embeddings
Flavia
Datasets | The Leaf Genie has list of leaf datasets! TODO
Herbarium 2021
@delutioHerbarium2021HalfEarth2021 (2021) z/d>)pip install jupyter-black
To load:
%load_ext jupyter_black
It will automatically format all correct python code in the cells!
NB works much, much better with jupyterlab, in the notebook version it first executes the cell, then does black and hides cell output. It does warn about that everywhere though.
Old code I wrote for making ds.corr() more readable, looked for it three times already ergo its place is here.
Basically: removes all small correlations, and optionally plots a colorful heatmap of that.
def plot_corr(res:pd.DataFrame):
import seaborn as sns
sns.heatmap(res, annot=True,fmt=".1f",cmap="coolwarm")
def get_biggest_corr(ds_corr: pd.DataFrame, limit: float=0.8, remove_diagonal=True, remove_nans=True,plot=False) -> pd.DataFrame:
import numpy as np # just in case
res = ds_corr[(ds_corr>limit) | (ds_corr<-limit)]
if remove_diagonal:
np.fill_diagonal(res.values, np.nan)
if remove_nans:
res = res.dropna(how='all', axis=0)
res = res.dropna(how='all', axis=1)
if plot:
plot_corr(res)
else:
return res
I like seaborn but kept googling the same things and could never get any internal ‘consistency’ in it, which led to a lot of small unsystematic posts1 but I felt I was going in circles. This post is an attempt to actually read the documentation and understand the underlying logic of it all.
I’ll be using the context of my “Informationsvisualisierung und Visual Analytics 2023” HSA course’s “Aufgabe 6: Visuelle Exploration multivariater Daten”, and the dataset given for that task: UCI Machine Learning Repository: Student Performance Data Set:
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires
Goal:
I’m not touching the seaborn.objects interface as the only place I’ve seen it mentioned is the official docu and I’m not sure it’s worth digging into for now.
An introduction to seaborn — seaborn 0.12.2 documentation
# sets default theme that looks nice
# and used in all pics of the tutorial
sns.set_theme()
Overview of seaborn plotting functions — seaborn 0.12.2 documentation:
Functions can be:
matplotlib.axes.Axes object and return it
The axes-level functions are written to act like drop-in replacements for matplotlib functions. While they add axis labels and legends automatically, they don’t modify anything beyond the axes that they are drawn into. That means they can be composed into arbitrarily-complex matplotlib figures with predictable results.
FacetGrid)kind=xxx parameter)col= and row= params that automatically create subplots!The figure-level functions wrap their axes-level counterparts and pass the kind-specific keyword arguments (such as the bin size for a histogram) down to the underlying function. That means they are no less flexible, but there is a downside: the kind-specific parameters don’t appear in the function signature or docstring
Special cases:
sns.jointplot()3 has one plot with distributions around it and is a JointGridsns.pairplot()4 “visualizes every pairwise combination of variables simultaneously” and is a PairGridIn the pic above, the figure-level functions are the blocks on top, their axes-level functions - below. (TODO: my version of that pic with the kind=xxx bits added)
The returned seaborn.FacetGrid can be customized in some ways (all examples here from that documentation link).
g.map_dataframe(sns.scatterplot, x="total_bill", y="tip")
g.set_axis_labels("Total bill ($)", "Tip ($)")
g.set_titles(col_template="{col_name} patrons", row_template="{row_name}")
g.set(xlim=(0, 60), ylim=(0, 12), xticks=[10, 30, 50], yticks=[2, 6, 10])
g.tight_layout()
g.savefig("facet_plot.png")
It’s possible to access the underlying matplotlib axes:
g = sns.FacetGrid(tips, col="sex", row="time", margin_titles=True, despine=False)
g.map_dataframe(sns.scatterplot, x="total_bill", y="tip")
g.figure.subplots_adjust(wspace=0, hspace=0)
for (row_val, col_val), ax in g.axes_dict.items():
if row_val == "Lunch" and col_val == "Female":
ax.set_facecolor(".95")
else:
ax.set_facecolor((0, 0, 0, 0))
And generally access matplotlib stuff:
ax: The matplotlib.axes.Axes when no faceting variables are assigned.axes: An array of the matplotlib.axes.Axes objects in the grid.axes_dict: A mapping of facet names to corresponding matplotlib.axes.Axes.figure: Access the matplotlib.figure.Figure object underlying the grid (formerly fig)legend: The matplotlib.legend.Legend object, if present.FacetGrid.set()(Previously: 230515-2257 seaborn setting titles etc. with matplotlib set)
FacetGrid.set() is used from time to time in the tutorial (e.g. .set(title="My title"), especially in Building structured multi-plot grids) but never explicitly explained; in its documentation, there’s only “Set attributes on each subplot Axes”.
It sets attributes for each subplot’s matplotlib.axes.Axes. Useful ones are:
title for plot title (set_title())xticks,yticksset_xlabel(), set_ylabel (but not sequentially as return value is not the ax)Axis-level functions “can be composed into arbitrarily complex matplotlib figures”.
Practically:
fig, axs = plt.subplots(2)
sns.heatmap(..., ax=axs[0])
sns.heatmap(..., ax=axs[1])
Documentation has an entire section on it5, mostly reprasing and stealing screenshots from it.
For axis-level functions, the size of the plot is determined by the size of the Figure it is part of and the axes layout in that figure. You basically use what you would do in matplotlib, relevant being:
matplotlib.Figure.set_size_inches())TL;DR they have FacetGrid’s’ height= and aspect=(ratio; 0.75 means 5 cells high, 4 cells wide) params that work per subplot.
Figure-level functions’ size has differences:
height and aspect, work like this: width = height * aspect
Blocks doing similar kinds of plots, each with a figure-level function and multiple axis-level ones. Listed in the API reference.6
kind="scatter"; the default)kind="line")And again, the already mentioned special cases, now with pictures:
sns.jointplot()3 has one plot with distributions around it and is a JointGrid:
sns.pairplot()4 “visualizes every pairwise combination of variables simultaneously” and is a PairGrid:
The parameters for marks are described better in the tutorial than I ever could: Properties of Mark objects — seaborn 0.12.2 documentation:
TODO my main remaining question is where/how do I set this? Can this be done outside the seaborn.objects interface I don’t want to learn.
s=30 to the plotting function. (size= would be a column name)
sns.scatterplot(
style="is_available",
# marker=MarkerStyle("o", "left"),
markers={True: MarkerStyle("o", "left"), False: MarkerStyle("o", "right")},
)
Controlling figure aesthetics — seaborn 0.12.2 documentation
There are five preset seaborn themes: dark, white, ticks, whitegrid, darkgrid. This picture contains the first four of the above in this order.
set_context()
The tutorial has this: Choosing color palettes — seaborn 0.12.2 documentation with both a theoretical basis about color and stuff, and the “how to set it in your plot”.
TL;DR sns.color_palette(PALETTE_NAME, NUM_COLORS, as_cmap=TRUE_IF_CONTINUOUS)
seaborn.color_palette() returns a list of colors or a continuous matplotlib ListedColormap colormap:
Accepts as palette, among other things:
‘light:<color>’, ‘dark:<color>’, ‘blend:<color>,<color>’n_colors: will truncate if it’s less than palette colors, will extend/cycle palette if it’s more
as_cmap - whether to return a continuous ListedColormap
desat
You can do .as_hex() to get the list as hex colors.
You can use it as context manager: with sns.color_palette(...): to temporarily change the current defaults.
Matplotlib colormap + _r (tab10_r).
I needed a colormap where male is blue and female is orange, tab10 has these colors but in reversed order. This is how I got a colormap with the first two colors but reversed:
cm = sns.color_palette("tab10",2)[::-1]
First I generated a color_palette of 2 colors, then reversed the list of tuples it returned.
histplot has different approaches for plotting multiple= distributions on the same plot:
multiple=fill
dodge=Trueerrwidth=The error bars around an estimate of central tendency can show one of two general things: either the range of uncertainty about the estimate or the spread of the underlying data around it. These measures are related: given the same sample size, estimates will be more uncertain when data has a broader spread. But uncertainty will decrease as sample sizes grow, whereas spread will not.
pd.sort_index()annot=True, fmt=".1f"vmin=/vmax=Previously: Small unsystematic posts about seaborn: - Architecture-ish: - 230515-2257 seaborn setting titles etc. with matplotlib set - 230515-2016 seaborn things built on FacetGrid for easy multiple plots - Small misc: - 230428-2042 Seaborn basics - 230524-2209 Seaborn visualizing distributions and KDE plots)
Visualizing distributions of data — seaborn 0.12.2 documentation:
common_norm=True by default applies the same normalization to the entire distribution. False scales each independently. This is critical in many cases, esp. with stat="probability"Generally: I read the seaborn documentation, esp. the high level architecture things, and a lot of things I’ve been asking myself since forever (e.g. 230515-2257 seaborn setting titles etc. with matplotlib set) have become much clearer - and will be its own post. I love seaborn and it’s honestly worth learning to use well and systematically.
There’s cycler, a package:
It returns cycles of dicts, finite or infinite:
from cycler import cycler
# list of colors
pal = sns.color_palette("Paired")
# `cycler` is a finite cycle, cycler() is an infinite
cols = iter(cycler(color=pal)())
# every time you need a color
my_color = next(cols)