July 2015 – Alfredo Motta

Cross Validation done wrong

Cross validation is an essential tool in statistical learning to estimate the accuracy of your algorithm. Despite its great power it also exposes some fundamental risk when done wrong which may terribly bias your accuracy estimate. In this blog post I’ll demonstrate – using the Python scikit-learn framework – how to avoid the biggest and…

Data manipulation primitives in R and Python

Both R and Python are incredibly good tools to manipulate your data and their integration is becoming increasingly important. The latest tool for data manipulation in R is Dplyr whilst Python relies on Pandas. In this blog post I’ll show you the fundamental primitives to manipulate your dataframes using both libraries highlighting their major advantages…