Tools for High Performance Python
24 Jan 2021 by dzlabIn this article we will see how to profile python program and some of the tools at hand to improve the performance. As a toy example we will consider the case where we have a Pandas DataFrame of many columns and we want to apply a function to each row to do some heavy caclulations.
pip install line-profiler
pip install bulwark
pip install swifter
pip install numba
pip install dask
Let’s a create a toy dataframe of 100k rows and 14 columns:
data = np.random.randint(0, 100, (100000, 14))
df = pd.DataFrame(data).astype("float64")
As a toy function, we pick a linear regression on the columns of the dataframe to calculate the slope of a line.
A first solution would to train sklearn LinearRegression
on every row, and defining X as the index in the row and y as the actual values. Upon training we take the first coeffition of the linear regression as output.
from sklearn.linear_model import LinearRegression
def ols_sklearn(row):
"""Solve OLS using scikit-learn's LinearRegression"""
est = LinearRegression()
X = np.arange(row.shape[0]).reshape(-1, 1) # shape (14, 1)
y = row.values # shape (14,)
# note that the intercept is built inside LinearRegression
est.fit(X, y)
m = est.coef_[0] # note c is est.intercept_
return m
We could also try another implementation that we deem will outperform the first one. In this case using numpy’s least-squares resolver for linear matrix equation lstsq.
import numpy as np
def ols_lstsq(row):
"""Solve OLS using numpy.linalg.lstsq"""
# build X values for [0, 13]
X = np.arange(row.shape[0]) # shape (14,)
ones = np.ones(row.shape[0]) # constant used to build intercept
A = np.vstack((X, ones)).T # shape (14, 2)
# lstsq returns the coefficient and intercept as the first result followed by the residuals and other items
m, c = np.linalg.lstsq(A, row.values, rcond=-1)[0]
return m
We first make sure both implementations output similar result using numpy’s assert_array_almost_equal
from numpy.testing import assert_array_almost_equal
results_sklearn = df.apply(ols_sklearn, axis=1)
results_lstsq = df.apply(ols_lstsq, axis=1)
assert_array_almost_equal(results_sklearn, results_lstsq)
After making sure that our initial solution to the problem behaves the same, we can compare their performances with timeit
>>> %timeit ols_sklearn(df.iloc[0])
The slowest run took 56.09 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 452 µs per loop
>>> %timeit ols_lstsq(df.iloc[0])
The slowest run took 30.11 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 175 µs per loop
At a first glance and with no surprise numpy solution out performed sklearn version. This is because even though sklearn uses under the hood numpy’s lstsq
it also does lot additional safety checks (e.g. division by zero) that could add overhead.
To identity where exactly sklearn overhead is introduced we use a python profiling tool call line_profiler
from line_profiler import LineProfiler
row = df.iloc[0]
est = LinearRegression()
X = np.arange(row.shape[0]).reshape(-1, 1)
lp = LineProfiler(est.fit)
print("Run on a single row")
lp.run("est.fit(X, row.values)")
lp.print_stats()
The output will looks as follows, for each line in the fit
method we get the time it took, the percentage as well as the number of hits.
Run on a single row
Timer unit: 1e-06 s
Total time: 0.001564 s
File: /usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_base.py
Function: fit at line 467
Line # Hits Time Per Hit % Time Line Contents
==============================================================
467 def fit(self, X, y, sample_weight=None):
468 """
469 Fit linear model.
470
471 Parameters
472 ----------
473 X : {array-like, sparse matrix} of shape (n_samples, n_features)
474 Training data
475
476 y : array-like of shape (n_samples,) or (n_samples, n_targets)
477 Target values. Will be cast to X's dtype if necessary
478
479 sample_weight : array-like of shape (n_samples,), default=None
480 Individual weights for each sample
481
482 .. versionadded:: 0.17
483 parameter *sample_weight* support to LinearRegression.
484
485 Returns
486 -------
487 self : returns an instance of self.
488 """
489
490 1 5.0 5.0 0.3 n_jobs_ = self.n_jobs
491 1 3.0 3.0 0.2 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
492 1 736.0 736.0 47.1 y_numeric=True, multi_output=True)
493
494 1 3.0 3.0 0.2 if sample_weight is not None:
495 sample_weight = _check_sample_weight(sample_weight, X,
496 dtype=X.dtype)
497
498 1 5.0 5.0 0.3 X, y, X_offset, y_offset, X_scale = self._preprocess_data(
499 1 3.0 3.0 0.2 X, y, fit_intercept=self.fit_intercept, normalize=self.normalize,
500 1 2.0 2.0 0.1 copy=self.copy_X, sample_weight=sample_weight,
501 1 517.0 517.0 33.1 return_mean=True)
502
503 1 4.0 4.0 0.3 if sample_weight is not None:
504 # Sample weight can be implemented via a simple rescaling.
505 X, y = _rescale_data(X, y, sample_weight)
506
507 1 4.0 4.0 0.3 if sp.issparse(X):
508 X_offset_scale = X_offset / X_scale
509
510 def matvec(b):
511 return X.dot(b) - b.dot(X_offset_scale)
512
513 def rmatvec(b):
514 return X.T.dot(b) - X_offset_scale * np.sum(b)
515
516 X_centered = sparse.linalg.LinearOperator(shape=X.shape,
517 matvec=matvec,
518 rmatvec=rmatvec)
519
520 if y.ndim < 2:
521 out = sparse_lsqr(X_centered, y)
522 self.coef_ = out[0]
523 self._residues = out[3]
524 else:
525 # sparse_lstsq cannot handle y with shape (M, K)
526 outs = Parallel(n_jobs=n_jobs_)(
527 delayed(sparse_lsqr)(X_centered, y[:, j].ravel())
528 for j in range(y.shape[1]))
529 self.coef_ = np.vstack([out[0] for out in outs])
530 self._residues = np.vstack([out[3] for out in outs])
531 else:
532 self.coef_, self._residues, self.rank_, self.singular_ = \
533 1 240.0 240.0 15.3 linalg.lstsq(X, y)
534 1 3.0 3.0 0.2 self.coef_ = self.coef_.T
535
536 1 1.0 1.0 0.1 if y.ndim == 1:
537 1 15.0 15.0 1.0 self.coef_ = np.ravel(self.coef_)
538 1 22.0 22.0 1.4 self._set_intercept(X_offset, y_offset, X_scale)
539 1 1.0 1.0 0.1 return self
From the ouput we can clear see that most of the time in fit
was spent in either safety checks or data preprocessing and all that before calling numpy’s lstsq
.
Here is the profiling output for safety checks
491 1 3.0 3.0 0.2 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
492 1 736.0 736.0 47.1 y_numeric=True, multi_output=True)
Here is the profiling output of the data preprocessing
498 1 5.0 5.0 0.3 X, y, X_offset, y_offset, X_scale = self._preprocess_data(
499 1 3.0 3.0 0.2 X, y, fit_intercept=self.fit_intercept, normalize=self.normalize,
500 1 2.0 2.0 0.1 copy=self.copy_X, sample_weight=sample_weight,
501 1 517.0 517.0 33.1 return_mean=True)
Here is the profiling output for the actual training with lstsq
532 self.coef_, self._residues, self.rank_, self.singular_ = \
533 1 240.0 240.0 15.3 linalg.lstsq(X, y)