Data Science Simplified: Logistic Regression: A Beginner's Visual Guide

Logistic regression is a supervised learning technique applied to classification problems.

In this post, let us explore:

Logistic Regression model
Advantages
Disadvantages
Example
Hyperparemeters and Tuning

Logistic Regression model

Logistic functions capture the exponential growth when resources are limited (read more here and here).

Sigmoid function is a special case of Logistic function as shown in the picture below (link). In the following pictures, I have shown how to derive the log of odds ratio from the sigmoid function.

Use sigmoid function to model the probability of dependent variable being 1 or 0 (binary classification).

Now we will see how to derive the log of odds ratio [p/(1-p)].

As shown in the pictures above, log of odds ratio (logit) is linear function of independent variables.

Advantage

We can interpret the coefficients (white box model)

We can interpret the results based on the size and sign of the regression coefficients. This is useful to identify the relationship between independent and dependent variables.

Disadvantage

Requires that each data point be independent of other data points (source)

Example

The data used for demonstrating the logistic regression is from the Titanic dataset. For simplicity I have used only three features (Age, fare and pclass).

And I have performed 5-fold cross-validation (cv=5) after dividing the data into training (80%) and testing (20%) datasets. I have calculated accuracy using both cv and also on test dataset.

#Ravel
y=y.values.ravel()

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#Splitting
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#LR model
logisticRegr = LogisticRegression()

#Fit
logisticRegr.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

The above output shows the default hyperparemeters used in sklearn.

Let us now perform the 5-fold cross-validation and find out the accuracy.

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

#CV scores, 5 fold CV
scores = cross_val_score(logisticRegr, x_train, y_train, cv=5)

#Prediction and accuracy
y_pred = logisticRegr.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)

#Print the summary
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print ("Test Accuracy: %0.2f" % (accuracy_test))

Accuracy: 0.68 (+/- 0.04)
Test Accuracy: 0.70

# Score
score = logisticRegr.score(x_test, y_test)
print ("Accuracy: %0.2f" % (score))

#Prediction
predictions = logisticRegr.predict(x_test)
predictions

To get the confusion matrix, we can use the following code. We can also visualize the confusion matrix for easier understanding.

from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics

#confusion matrix
cm = metrics.confusion_matrix(y_test, predictions)
print("Confusion Matrix: \n", cm)

#Visualize the confusion matrix
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual Class');
plt.xlabel('Predicted Class');
all_sample_title = "Test Accuracy: %0.2f" % (score)
plt.title(all_sample_title, size = 15);

Hyperparameters

Let us look at the important hyperparameters of Logistic Regression one by one in the order of sklearn's fit output. The following output shows the default hyperparemeters used in sklearn.

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

It is the inverse of regularization strength. Lower the value of C, higher the regularization and hence lower is the chance of overfitting. Default is C=1.

Class weight

By default, sklearn uses 'balanced' mode by automatically adjusting the class weights to inverse of frequencies in each class. This is useful while handling class imbalance in dataset.

Fit Intercept

By default, intercept is added to the logistic regression model.

Multiclass

In this we have three options: ovr’, ‘multinomial’, ‘auto'.

'ovr' corresponds to One-vs-Rest. Auto selects 'ovr' when problem is binary classification, otherwise 'multinomial'.

These are commonly tuned hyperparameters. For more about these read sklearn's manual.

Tuning the Hyperparameters

Now let us tune the hyperparameters using GridSearch.

#import GridseachCV
from sklearn.model_selection import GridSearchCV

#Instantiate
clf = LogisticRegression()

#Grid
parameter_grid = {'C': [0.01, 0.1, 1, 2, 10, 100], 'penalty': ['l1', 'l2']}

#Gridsearch
gridsearch = GridSearchCV(clf, parameter_grid)
gridsearch.fit(x_train, y_train);

#Get best hyperparameters
gridsearch.best_params_

{'C': 2, 'penalty': 'l2'}

Then you can use these values in the pipeline for fitting the model (for more).

We can also use randomized search for finding the best parameters. Advantages of randomized search is that it is faster and also we can search the hyperparameters such as C over distribution of values.

References

https://machinelearning-blog.com/2018/04/23/logistic-regression-101/

Data Science Simplified

Logistic Regression: A Beginner's Visual Guide