In the middle of the desert you can say anything you want
Inspecting the importance of features when running Random Forest:
feature_importances = pd.DataFrame(rf.feature_importances_, index = X_train.columns, columns=['importance']).sort_values('importance', ascending=False)
df.shuffle(frac=1)
uses the shuffle function for this.
It’s kinda logical, but if I group stuff, it gets saved in the same order.
d3b 79% Sat 20 Apr 2019 11:18:34 AM CESTh d3b 71% Sat 20 Apr 2019 11:20:10 AM CEST d3b 71% Sat 20 Apr 2019 11:21:44 AM CEST d3b 100% Sat 20 Apr 2019 11:23:16 AM CEST d4b 56% Sat 20 Apr 2019 11:25:31 AM CEST d4b 50% Sat 20 Apr 2019 11:27:26 AM CEST d4b 50% Sat 20 Apr 2019 11:29:24 AM CEST d4b 17% Sat 20 Apr 2019 11:31:18 AM CEST d4b 40% Sat 20 Apr 2019 11:33:13 AM CEST d4b 50% Sat 20 Apr 2019 11:35:15 AM CEST d4b 56% Sat 20 Apr 2019 11:37:06 AM CEST
What would happen if I actually used them as one of my features, leaving the non-stopwords text alone? Here’s a long list
sklearn.preprocessing.LabelEncoder
for converting categorical data to a numerical format.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])
Can I use some of the insights/methods/ideas from stylometry for this? (After reading this article about Beowulf.
Will become a problem. I can just remove all tweets containing any quotes symbols('
, "
) after checking how many are there.
Get things out of your head and into a system that you fully trust. Everything you do should have positive value – it’s either improving you (I put self care and genuine leisure time in here, but not time wasting), improving a relationship, making money, or making one of those other things more efficient. Do high energy and high focus things when you actually have energy and focus; do mindless things when you feel mindless. Do not skimp on self-care, which includes genuine leisure time, good healthy food, exercise, good personal relationships, and adequate sleep. Aim for the “flow state” in everything you do, because you’ll never be better than when you’re so engaged that you lose track of time and place and just get lost in the moment. (How I get things done)
I find that forcing myself to think about those things at the pace of my handwriting brings a ton of clarity to the ideas I’m struggling with or the life issues I’m trying to figure out. (same source)
it’s easy to sleep well when you get up early and work hard. (same source)
“No more yes. It’s either HELL YEAH! or no.” — Derek Sivers
I need a system to consistently track things I’m trying to optimize in my life. Today I already read N articles about excellent things I can do with my life, and usually it would end at it. Probably the first in line would be reinforcement and mental contrasting.
On a certain level we actually bump aganst the infinitely familiar thing about not knowing what I want.
460 cpm 98% d4b 14% Thu 18 Apr 2019 12:54:55 PM CEST d4b 0% Thu 18 Apr 2019 12:56:50 PM CEST d4b 11% Thu 18 Apr 2019 12:58:46 PM CEST d3b 85% Thu 18 Apr 2019 01:00:22 PM CEST ! d4b 50% Thu 18 Apr 2019 01:03:42 PM CEST d4b 17% Thu 18 Apr 2019 01:05:37 PM CEST d4b 50% Thu 18 Apr 2019 01:07:32 PM CEST d4b 61% Thu 18 Apr 2019 01:09:28 PM CEST d4b 67% Thu 18 Apr 2019 01:11:25 PM CEST d4b 50% Thu 18 Apr 2019 01:13:19 PM CEST
I’m familiar with most of this, but since I find myself googling it every time, I’ll just write it here, so I’ll know where to loo.
Scipy Lecture Notes seems like a very interesting place.
pd.concat([d, dd])
concatenates them leaving the same columns.
pd.concat([d, dd], ignore_index=True)
concatenates them leaving the same columns and having a common id
column.
pd.concat([d, dd], axis=1)
merges them horizontally, that is there will be all the columns from the input dataframes.
Apparently sns.plt
is a bug which has been fixed. Nice. Regardless, the new correct way is import matplotlib.pyplot as plt; plt....
.
dsa[ (dsa.char_count>190) & (dsa.char_count<220) ]
from IPython.core.display import display, HTML display(HTML("<style>.container { width:100% !important; }</style>"))
inside a cell (SO)
I have my semi-final dataset, today I’ll clean it, analyze, and output it to some clean.csv
file. Along with creating a script that cleans the data, for all the repetitive things I’ll have to do.
0418-analysis-of-final-dataset
.
token_count
!= pos_count
.{%raw%}’@FragrantFrog @BourgeoisViews @SimonHowell7 @Mr_Bo_Jangles_1 @Joysetruth @Caesar2207 @NancyParks8 @thetruthnessie @carmarsutra @Esjabe1 @DavidHuddo @rob22_re @lindale70139487 @anotherviv @AndyFish19 @Jules1602xx @EricaCantona7 @grand___wazoo @PollyGraph69 @CruftMs @ZaneZeleti @McCannFacts @ditsy_chick @Andreamariapre2 @barragirl49 @MancunianMEDlC @rambojambo9 @MrDelorean2 @Nadalena @LoverandomIeigh @cattywhites2 @Millsyj73 @strackers74 @may_shazzy @JBLittlemore @Tassie666 @justjulescolson @regretkay @Chinado59513358 @Louise42368296 @TypRussell @Anvil161Anvil16 @DuskatChristie @McCannCaseTweet @noseybugger1 @HilaryDean15 @DesireeLWiggin1 @M47Jakeman @crocodi11276514 @jonj85014 If it was in the Scenic several weeks after she was reported missing.Her body must have been put there.!\nWho by ?The people who hired the Scenic ! How hard is that to understand ?\nThis algorithmic software gives a probability of the identity of each contributer to the sample !\n😏’{%endraw%}
Now playing: The Godfather II Soundtrack
Add search to this blog via this simple js
To watch: Hacking democracy with theater
It was a small Army Security Agency Station in Southeast Asia that I was doing some work for. They had a shrink and he pulled me aside. In just 10 minutes or so he taught me “breathing”. It wasn’t until the internet that I learned the term mindful breathing. Subsequently I figured out it was some sort of meditation. [..]\ \ He said I was ‘wrapped to tight’. What ever that means. Those guys were all spooks, but I did not have the same clearances. I was an outsider in that regard, but I did eat with them when at their place. I guess he was bored.\ \ He took my blood pressure and then taught me to breathe. Then he took it again. I was surprised at the drop. It hooked me on mindful breathing. It was probably a parlor trick, but it worked. He improved my lifetime health. For that I thank him.\ (from reddit)
Okular can fill and save PDF forms. Zathura can open already filled forms.
convert
pdftoppm input.pdf outputname -png
\
pdftoppm input.pdf outputname -png -f {page} -singlefile
It works much better than convert
.
timeww continue
continues the last tracked thing
Even though stylistically questionable (PEP8 favours multiple multiline comments), one possibility is to use """ mycomment """
; when they are not a docstring they are ignored. (source). They have to be indented right though. And feel kinda wrong\
Additionally:
triple-quotes are a way to insert text that doesn’t do anything (I believe you could do this with regular single-quoted strings too), but they aren’t comments - the interpreter does actually execute the line (but the line doesn’t do anything). That’s why the indentation of a triple-quoted ‘comment’ is important. – Demis Jun 9 ‘15 at 18:35
This is an excellent paper about Reddit and more focused on orthoographic errors. Will read next! \ And this is an awesome annotated dataset, exactly the kind I need.
SSH can handle commands.
From the blog post above: <Enter>~.
\
SSH parses commands sent after a newline and ~
. ~.
is the one to exit.
In ~/.ssh/config
.
Host host1 HostName ssh.example.com User myuser IdentityFile ~/.ssh/id_rsa
allows to just do sh host1
.
… Still amazed by Linux and the number of such things. If I ever planned to do Linux much more professionally, I would just sit and read through all the man pages of the typical tools, systematically.
I need to make this Diensttagebuch searchable from the website, not just locally with :Ag
.
t id!=123
, works with everything.
For unicode strings, do “unicode string”.encode(‘utf-8’)
I looked again at the confusion matrix, after having made a copy. It’s quite interesting:
array([[29, 14, 28, 26], [38, 57, 36, 27], [52, 18, 58, 28], [18, 14, 18, 39]])
This is a simple SVM, using extremely simple features, and 2000 examples per class. The columns/rows are: ar, jp, lib, it, in that order. My first error is that Arabic and countries which are around Libya are quite similar in my world, linguistically, and we can see that they are confused quite often, in both directions. Italy and Japan do much better.
Still, ich finde das sehr vielversprechend, and definitely better than chance. And logically it makes sense. I’ll continue.
The list. I’ll stick to Japan, UK, SA, Brazil, India – quite between each other, geographically and linguistically. I leave the US alone, too mixed.
This is the picker. DublinCore format is in the identical order as Twitter wants!
d[d.co.isin(['uk','in'])]
leaves the rows where co==‘uk’ or co==‘in’. \
For multiple conditions, df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
\
TODO: Why is .loc used here?
Has a config file! This opened a new universe for me too.
The key needs to be added from the panel, adding it to the user folder as usual does not work.