Vaex as faster pandas alternative
I have a larger-than-usual text-based dataset, need to do analysis, pandas is slow (hell, even
wc -l takes 50 seconds…)
Vaex: Pandas but 1000x faster - KDnuggets - that’s a way to catch one’s attention.
vx.from_csv()reads a CSV in memory, kwargs get passed to pandas'
vx.open()reads stuff lazily, but I can’t find a way to tell it that my
.txtfile is a CSV, and more critically - how to pass params like
vx.from_ascii()has a parameter called sepe rator?! API documentation for vaex library — vaex 4.16.0 documentation
- the first two support
convert=that converts stuff to things like HDFS, optionally
chunk_size=is the chunk size in lines. It’ll create $N/chunk_size$ chunks and concat together at the end.
- Ways to limit stuff:
nrows=is the number of rows to read, works with convert etc.
usecols=limits to columns by name, id or callable, speeds up stuff too and by a lot
- I can do
df.export_hdf5()in vaex, but pandas can’t read that. It may be related to the opposite problem - vaex can’t open pandas HDF5 files directly, because one saves them as rows, other as columns. (See FAQ)
- When converting csv to hdf5, it breaks if one of the columns was detected as an
object, in my case it was a boolean. Objects are not supported1, and booleans are objects. Not trivial situation because converting that to, say, int, would have meant reading the entire file - which is just what I don’t want to do, I want to convert to hdf to make it manageable.
Syntax is similar to pandas, but the documentation is somehow .. can’t put my finger on it, but I don’t enjoy it somehow.
Stupid way to find columns that are all NA
l_desc = df.describe() # We find column names that have length_of_dataset NA values not_empty_cols = list(l_desc.T[l_desc.T.NA!=df.count()].T.columns) # Filter the description by them interesting_desc = l_desc[not_empty_cols]