In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

np.random.seed()

# Scikit-Learn is a fast-changing library and generates API warnings all over
# the place - disable for now so it doesn't clutter our workspace
import warnings
warnings.filterwarnings('ignore')

## The MNIST Dataset

The MNIST dataset is a set of 70,000 28x28 images of handwritten digits (0-9), with the first 60,000 images as the training set and the remaining 10,000 as the test set (i.e. the dataset has been shuffled for us already.)

Let's download the dataset, pick a random sample and take a look at it.

In [None]:
from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original', data_home='/scratch/scikit_learn_data')
X, y = mnist["data"], mnist["target"]
print(X.shape)
print(y.shape)

some_digit = X[12345]

plt.imshow(some_digit.reshape(28,28), cmap=matplotlib.cm.binary)

## Split into training/testing sets.

For this classification task, we'll try to classify each digit as a 5/not-5. Before doing anything else, set some test data aside.

In [None]:
# No. of training samples
m = 60000

X_train, X_test, y_train, y_test = X[:m], X[m:], y[:m], y[m:]
shuffle_index = np.random.permutation(m)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

## Stochastic Gradient Descent Classifier

Train a SGDClassifier on the training data.

In [None]:
from sklearn.linear_model import SGDClassifier

classifier = SGDClassifier(random_state=33)
classifier.fit(X_train, y_train_5)

In [None]:
# Let's see how it did on our randomly chosen digit
classifier.predict([some_digit])

## Evaluate the Model by doing a K-fold cross-validation

Let's see how the model does in terms of accuracy.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(classifier, X_train, y_train_5, cv=3, scoring="accuracy")
print('Accuracy score = ', np.mean(scores))

Accuracy is not the only game in town. Let's look at the Confusion Matrix to get a little more insight.

![Confusion Matrix](https://raw.githubusercontent.com/vineetbansal/MLLandscape/master/confusion_matrix.png "Confusion Matrix")

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(classifier, X_train, y_train_5, cv=3)
confusion_matrix(y_train_5, y_train_pred)

In [None]:
from sklearn.metrics import precision_score, recall_score
print(precision_score(y_train_5, y_train_pred))
print(recall_score(y_train_5, y_train_pred))

## Decision Thresholds

The SGDClassifier computes "scores" for each prediction internally that we can look at.

In [None]:
y_scores = cross_val_predict(classifier, X_train, y_train_5, cv=3, method="decision_function")
y_scores

Using a response vector and a score vector for predictions, we can ask Scikit-Learn to compute precision/recall values for **all** possible thresholds on the predictions.

### Precision-Recall curve

In [None]:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
plt.plot(recalls, precisions)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

### ROC Curve

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

print(roc_auc_score(y_train_5, y_scores))

## Multiclass Classification

Scikit-Learn detects when you try to use a binary classification algo for a multi-class classification task, and it automatically runs a *one-versus-all* classifier - in this case, creating 10 binary classifiers, getting their decision scores, and selecting the class with the highest score.

In [None]:
classifier.fit(X_train, y_train)
classifier.predict([some_digit])

In [None]:
cross_val_score(classifier, X_train, y_train, cv=3, scoring="accuracy")

### Feature Scaling
Let's apply feature scaling (subtract mean and divide by variance) on each of the training samples. *Why would it make a difference in the MNIST case?*

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype('float'))
cross_val_score(classifier, X_train_scaled, y_train, cv=3, scoring="accuracy")

### Confusion Matrix

In [None]:
y_train_pred = cross_val_predict(classifier, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

Let's visualize the confusion matrix to see where most of the misclassifications are happening.

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)

## Accuracy on Test Set

In [None]:
from sklearn.metrics import accuracy_score
X_test_scaled = scaler.fit_transform(X_test.astype('float'))
y_pred_test = classifier.predict(X_test_scaled)
accuracy_score(y_test, y_pred_test)