instant | Minimum Viable Data Product Development

How-to

Minimum Viable Data Product Development

2018-09-03 13:50:05.468625

Minimum Viable (Data) Product Development

1) Introduction

The amount of data we produce daily is truly mind-boggling. The are 2.5 quintillion bytes of data created each day at our current pace and that pace is only accelerating. With such a growth, more and more companies recognize the importance of incorporating artificial intelligence into their bussiness models. In order to succeed, many companies, especially small ones (start-ups), are taking the approach of the minimum viable (data) product development (abbreviated: MVP).

The MVP is basically the state of the art methodology for product development nowadays. Its central idea is that by building products or services iteratively by constantly integrating customer feedback, we can significantly reduce the risk that a product/service will fail (build-measure-learn).

In [3]:

from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/picture_1.png", width=1000, height=600)

Out[3]:

For example, if your final product idea is an electric car, then according to the MVP approach, you will start with the smallest effort to deliver the car. In this case, you will perhaps, start with just two wheels, a board and a simple electric engine. Then you will ship this to the market and get feedback to continuously improve your product by adding more complexity to it. By starting with a small subset of the features of your final product, you can get to a solution sooner, even if it's not perfect.

In [4]:

from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/picture_2.png", width=1000, height=600)

Out[4]:

The minimum viable data product development is based on this concept. First, we define the key business question, next we acquire the data needed to answer the question and finally implement a test - using the simplest possible appropriate statistical model, to answer that question. We deliver the product and build on the feedback that we receive from the market.

The key is to find the minimum viable data. It doesn't need to be perfect data, just to inform a decision on whether to stick, twist and to enable you to measure the impact of that decision. The machine learning model doesn't have to be perfect either, just a baseline model which can provide you a meaningful referrence point to which to compare.

In this blog post, we will try to solve a text classification problem, following the minimal viable product development approach. Suppose that our goal is to develop a machine learning system for a start-up that can classify an email as spam or ham. We will start with a simple bag-of-words, as a useful basis for modeling our prediction problem and will end up with a sophisticated convolutional neural network built in tensorflow.

Compared to the electric car, the bag-of-words model represents the two wheels with the simple electric engine prototype, while the convolutional neural network, the final, ready for delivery, electric car.

We will use a minimal annotated dataset and will analyse how the learning curves help us to answer the following question:

'' Is our data enough or should we get more ?

In other words: 'How much data is enough data ?'

2) Data collection

The first step in the process of developing our machine learning model is to collect data relevant to our task. As previously mentioned, we will use the minimum viable data approach - the least amount of data nedeed to make an effective decision.

To set the wheels in motion, we first have to find a minimal dataset to start testing our assumptions. Such a dataset is the SMSSpamCollection corpus, which can be downloaded from: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. The file contains a collection of more than five thousand sms phone messages, mannualy annotated as spam and ham. The dataset is an excellent source on which we can start building our statistical model.

Generally, with less data:

The are smaller costs

It is easier to build on the basics

It is quicker to analyse

It is easier to iterate

It is less time-consuming to label data

First, we will import all the libraries used throughout this blog post. Once we do that, we load our minimal SMSSpamCollection dataset into a pandas dataframe.

In [239]:

#import libraries
import pandas as pd
import numpy as np
import logging
import itertools

import time
import datetime
import shutil
import os
import re
import sys
import json
import csv
import pickle
from collections import defaultdict, Counter
import matplotlib 
import matplotlib.pyplot as plt 
% matplotlib inline

from textblob import TextBlob
from  sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split, learning_curve
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix

import tensorflow as tf
from tensorflow.contrib import learn

In [56]:

#load dataset into a Pandas Dataframe
df = pd.read_csv('//Users//vlad//Desktop//MVP//smsspamcollection//SMSSpamCollection',
                 sep='\t',
                 skip_blank_lines = True,
                 quoting=csv.QUOTE_NONE,
                 names=["label", "sms"])

In [57]:

#display the first 10 rows
df.head(n = 10)

Out[57]:

	label	sms
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
5	spam	FreeMsg Hey there darling it's been 3 week's n...
6	ham	Even my brother is not like to speak with me. ...
7	ham	As per your request 'Melle Melle (Oru Minnamin...
8	spam	WINNER!! As a valued network customer you have...
9	spam	Had your mobile 11 months or more? U R entitle...

As we can see in the cell below, the corpus contains 5.574 text messages, labeled as spam and ham.

In [58]:

print('The dataframe contains {:,} messages.'.format(df.shape[0]))

The dataframe contains 5,574 messages.

With the file loaded into our Jupiter Notebook, we ask ourselves a couple of questions:

Is the underlying class distribution balanced or not?

Is there any difference in message length between the spam and the ham class ?

We ask ourselves some NLP questions as well:

Do capital letters carry information?

Does distinguishing inflected form ("goes" vs. "go") carry information?

Do interjections, determiners carry information as well?

Don't forget - we follow the minimum viable data development approach - we want to get results and fast! That's why we will not spend valuable time answering all our questions - this will be done later. We pick the first two and move on!

In the cells below, we will aggregate some basic statistics per group and answer the question whether our dataset is balanced or not.

In [59]:

#aggregate basic statistics per category
df.groupby(['label']).describe()

Out[59]:

	sms
	count	unique	top	freq
label
ham	4827	4518	Sorry, I'll call later	30
spam	747	653	Please call our customer service representativ...	4

In [60]:

#create the class_info function
def class_info(classes):
    counts = Counter(classes)
    total = sum(counts.values())
    for cls in counts.keys():
        print("%6s: % 7d  =  % 5.1f%%" % (cls, counts[cls], counts[cls]/total*100))

In [61]:

class_info(df['label'])

   ham:    4827  =   86.6%
  spam:     747  =   13.4%

In [62]:

colors = ['#66b3ff','orange']

fig, ax = plt.subplots(figsize = (10,10))
labels = ['Spam','Ham']
sizes = [13.4,86.6]
explode = [0.1,0]
patches, texts, autotexts = ax.pie(sizes,
       labels = labels,
       explode = explode,
       shadow = True,
       autopct='%1.1f%%',
       startangle = 180,
        colors = colors)
ax.axis('equal')

for i in range(2):
    texts[i].set_fontsize(18)
    
for j in range(2):
    autotexts[j].set_fontsize(18)
    
ax.set_title('% of Spam and Ham classes', size = 20, y = 1.009)

Out[62]:

Text(0.5,1.009,'% of Spam and Ham classes')

From the pie chart, it's clear that the dataset is unbalanced, with the spam class comprising 13.4% and the ham 86.6% of the total. At this point we should have resampled our data to provide balanced classes. However, in a minimum viable data development context this step could be skipped-at least for now.

Next we will check the number of words each message contains and plot the message length per class.

In [86]:

#create a new column length with the number of words for each message
df['length'] = df['sms'].apply(lambda e: e.split()).apply(lambda e: len(e))

In [87]:

#plot the length distribution of each text message
fig, ax = plt.subplots(figsize = (10,10))

df['length'].plot(kind = 'hist',bins = 100 ,edgecolor = 'red', ax = ax)

ax.set_title('Number of words in each message', size = 20, y = 1.009)
ax.set_ylabel('Frequency', size = 15)
ax.grid(True)

In [101]:

#round the entries in the frame below to 2 decimals
df[['length']].describe().apply(lambda e: round(e,2))

Out[101]:

	length
count	5574.00
mean	15.59
std	11.39
min	1.00
25%	7.00
50%	12.00
75%	23.00
max	171.00

In [136]:

#plot the length distribution for each class
fig, ax = plt.subplots(figsize = (10,10))
ax1, ax2 = df.hist(column='length', by='label', bins= 20,edgecolor = 'red', ax = ax)

ax1.grid(True)
ax2.grid(True)

for ax in ax1.get_xticklabels():
    ax.set_rotation(0)
    
for ax in ax2.get_xticklabels():
    ax.set_rotation(0)

/Users/vlad/anaconda/lib/python3.6/site-packages/pandas/plotting/_core.py:2396: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  yrot=yrot, **kwds)

Looking at the two histograms, it's clear that the spam messages are much longer. Perhaps we could use this information to create a new feature and make our classier smarter. But not for now -remember, we want to build a minimal baseline model which can provide us a meaningful referrence point.

3) Data preprocessing

In this section we will convert the raw text messages into n-dimentional vectors. First, we will use a split_into_words function to split each message into individual words and then normalize them into their base form (lemmas) using the split_into_lemmas_function.

In [142]:

#split messages into their individual words
def split_into_tokens(text):
    text = str(text)
    return TextBlob(text).words

In [145]:

#apply the function on the first 10 rows of the dataframe
df['sms'].head(n = 10).apply(split_into_tokens)

Out[145]:

0    [Go, until, jurong, point, crazy, Available, o...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, in, 2, a, wkly, comp, to, win, F...
3    [U, dun, say, so, early, hor, U, c, already, t...
4    [Nah, I, do, n't, think, he, goes, to, usf, he...
5    [FreeMsg, Hey, there, darling, it, 's, been, 3...
6    [Even, my, brother, is, not, like, to, speak, ...
7    [As, per, your, request, 'Melle, Melle, Oru, M...
8    [WINNER, As, a, valued, network, customer, you...
9    [Had, your, mobile, 11, months, or, more, U, R...
Name: sms, dtype: object

In [146]:

#normalize words into their base form (lemmas) 
def split_into_lemmas(message):
    message = str(message).lower()
    words = TextBlob(message).words
    return [word.lemma for word in words]

In [147]:

#apply the function on the first 10 rows of the dataframe
df['sms'].head(n =10).apply(split_into_lemmas)

Out[147]:

0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, do, n't, think, he, go, to, usf, he, ...
5    [freemsg, hey, there, darling, it, 's, been, 3...
6    [even, my, brother, is, not, like, to, speak, ...
7    [a, per, your, request, 'melle, melle, oru, mi...
8    [winner, a, a, valued, network, customer, you,...
9    [had, your, mobile, 11, month, or, more, u, r,...
Name: sms, dtype: object

4) Experiment with the baseline machine learning model

Now we are ready to implement our baseline machine learning models. We will first use a simple Multinomial Naive Bayes Classifier and next will experiment with a Logistic Regression technique (since we work on a binary classification problem).

In the picture below we can see the pipeline of our Supervized Learning Model that we try to build.

In [2]:

from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/diagram_24.png", width=1000, height=600)

Out[2]:

In [156]:

#Split data into train and test. Keep 20 percent from the initial dataset for test.
msg_train, msg_test, label_train, label_test = train_test_split(df['sms'], df['label'], test_size=0.2)

#print(len(msg_train), len(msg_test), len(label_train) + len(label_test))
print('msg train: {:,}'.format(len(msg_train)))
print('msg test: {:,}'.format(len(msg_test)))
print('label train: {:,}'.format(len(label_train)))
print('label test: {:,}'.format(len(label_test)))

msg train: 4,459
msg test: 1,115
label train: 4,459
label test: 1,115

4.1 Multinomial Naive Bayes

In [157]:

#Build a pipeline
pipeline_MultinomialNB = Pipeline([
    ('bow', CountVectorizer(analyzer=split_into_lemmas)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
    ])

In [159]:

#cross validation
scores_MultinomialNB =  cross_val_score(pipeline_MultinomialNB,  # steps to convert raw messages into models
                                        msg_train,  # training data
                                        label_train,  # training labels
                                        cv = 10,  # split data randomly into 10 parts: 9 for training, 1 for scoring
                                        scoring = 'accuracy',  # which scoring metric?
                                        n_jobs = -1,  # -1 = use all cores = faster
                                        )
print(scores_MultinomialNB)

[0.95525727 0.95749441 0.94854586 0.9573991  0.94170404 0.94394619
 0.95730337 0.93932584 0.94382022 0.94606742]

In [164]:

#print overall mean and standard deviation
print('Overall mean score for Multinomial Naive Bayes: {}%'.format(round(scores_MultinomialNB.mean(),2)))
print('Overall std score for Multinomial Naive Bayes: {}%'.format(round(scores_MultinomialNB.std(),2)))

Overall mean score for Multinomial Naive Bayes: 0.95%
Overall std score for Multinomial Naive Bayes: 0.01%

In the next step, we will search the best parameters of our model using scikit-learn GridSearchCV.

In [172]:

print('The possible parameters for the Multinomial Naive Bayes classifier:\n \n \n', pipeline_MultinomialNB.get_params())

The possible parameters for the Multinomial Naive Bayes classifier:
 
 
 {'memory': None, 'steps': [('bow', CountVectorizer(analyzer=<function split_into_lemmas at 0x1a2ba647b8>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)), ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))], 'bow': CountVectorizer(analyzer=<function split_into_lemmas at 0x1a2ba647b8>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None), 'tfidf': TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True), 'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True), 'bow__analyzer': <function split_into_lemmas at 0x1a2ba647b8>, 'bow__binary': False, 'bow__decode_error': 'strict', 'bow__dtype': <class 'numpy.int64'>, 'bow__encoding': 'utf-8', 'bow__input': 'content', 'bow__lowercase': True, 'bow__max_df': 1.0, 'bow__max_features': None, 'bow__min_df': 1, 'bow__ngram_range': (1, 1), 'bow__preprocessor': None, 'bow__stop_words': None, 'bow__strip_accents': None, 'bow__token_pattern': '(?u)\\b\\w\\w+\\b', 'bow__tokenizer': None, 'bow__vocabulary': None, 'tfidf__norm': 'l2', 'tfidf__smooth_idf': True, 'tfidf__sublinear_tf': False, 'tfidf__use_idf': True, 'classifier__alpha': 1.0, 'classifier__class_prior': None, 'classifier__fit_prior': True}

In [173]:

#search for best parameters
params = {
    'tfidf__use_idf': (True, False),
    'bow__analyzer': (split_into_lemmas, split_into_tokens)
}

grid_MultinomialNB = GridSearchCV(
    pipeline_MultinomialNB,  # pipeline from above
    params,  # parameters to tune via cross validation
    refit=True,  # fit using all available data at the end, on the best found param combination
    n_jobs=-1,  # number of cores to use for parallelization; -1 for "all cores"
    scoring='accuracy',  # what score are we optimizing?
    cv=StratifiedKFold(n_splits = 5))  # what type of cross validation to use

In [174]:

#fit the model
%time MultinomialNB_detector = grid_MultinomialNB.fit(msg_train, label_train)

CPU times: user 2.13 s, sys: 72 ms, total: 2.21 s
Wall time: 37.1 s

In [176]:

print('The results of the search:\n \n \n',MultinomialNB_detector.cv_results_)

The results of the search:
 
 
 {'mean_fit_time': array([3.74130955, 3.36253433, 2.52802529, 2.495186  ]), 'std_fit_time': array([0.15595202, 0.08672026, 0.03931961, 0.01723119]), 'mean_score_time': array([0.97282844, 0.97809682, 0.61902804, 0.61772242]), 'std_score_time': array([0.01933412, 0.15491758, 0.00656946, 0.00508204]), 'param_bow__analyzer': masked_array(data=[<function split_into_lemmas at 0x1a2ba647b8>,
                   <function split_into_lemmas at 0x1a2ba647b8>,
                   <function split_into_tokens at 0x1a2ba64950>,
                   <function split_into_tokens at 0x1a2ba64950>],
             mask=[False, False, False, False],
       fill_value='?',
            dtype=object), 'param_tfidf__use_idf': masked_array(data=[True, False, True, False],
             mask=[False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'bow__analyzer': <function split_into_lemmas at 0x1a2ba647b8>, 'tfidf__use_idf': True}, {'bow__analyzer': <function split_into_lemmas at 0x1a2ba647b8>, 'tfidf__use_idf': False}, {'bow__analyzer': <function split_into_tokens at 0x1a2ba64950>, 'tfidf__use_idf': True}, {'bow__analyzer': <function split_into_tokens at 0x1a2ba64950>, 'tfidf__use_idf': False}], 'split0_test_score': array([0.94736842, 0.93281075, 0.95632699, 0.93281075]), 'split1_test_score': array([0.94843049, 0.93161435, 0.94730942, 0.93721973]), 'split2_test_score': array([0.94394619, 0.92825112, 0.94282511, 0.9293722 ]), 'split3_test_score': array([0.94163861, 0.93041526, 0.94276094, 0.92592593]), 'split4_test_score': array([0.94276094, 0.92480359, 0.94276094, 0.92255892]), 'mean_test_score': array([0.94483068, 0.92958062, 0.94640054, 0.92958062]), 'std_test_score': array([0.0026326 , 0.00282306, 0.00526781, 0.00512767]), 'rank_test_score': array([2, 3, 1, 3], dtype=int32), 'split0_train_score': array([0.96214246, 0.94054964, 0.96634885, 0.94251262]), 'split1_train_score': array([0.96439585, 0.94308943, 0.96776002, 0.94533221]), 'split2_train_score': array([0.96551724, 0.94365013, 0.96972246, 0.94757499]), 'split3_train_score': array([0.96776906, 0.94394619, 0.9728139 , 0.94899103]), 'split4_train_score': array([0.96552691, 0.9456278 , 0.97029148, 0.94786996]), 'mean_train_score': array([0.9650703 , 0.94337264, 0.96938734, 0.94645616]), 'std_train_score': array([0.00182859, 0.00164568, 0.00221593, 0.00230178])}

In [177]:

print('The best score: {}'.format(round(grid_MultinomialNB.best_score_,2)))

The best score: 0.95

In [178]:

#generate predictions
%time predictions_MultinomialNB = MultinomialNB_detector.predict(msg_test)

CPU times: user 459 ms, sys: 5.01 ms, total: 464 ms
Wall time: 464 ms

In [180]:

print('Confusion Matrix using Multinomial Naive Bayes classifier: \n \n',confusion_matrix(label_test, predictions_MultinomialNB))

Confusion Matrix using Multinomial Naive Bayes classifier: 
 
 [[974   0]
 [ 53  88]]

In [181]:

print('Classification report for Multinomial Naive Bayes: \n \n ',classification_report(label_test, predictions_MultinomialNB))

Classification report for Multinomial Naive Bayes: 
 
               precision    recall  f1-score   support

        ham       0.95      1.00      0.97       974
       spam       1.00      0.62      0.77       141

avg / total       0.95      0.95      0.95      1115

Now, we are ready to plot the learning curves for our Multinomial Naive Bayes classifier.

In [192]:

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure(figsize=(10,10))
    plt.title(title, size = 20, y = 1.008)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples", size = 15)
    plt.ylabel("Score", size = 15)
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [194]:

%time plot_learning_curve(pipeline_MultinomialNB, "Learning Curve for Multinomial Naive Bayes", msg_train, label_train, cv=3)

CPU times: user 293 ms, sys: 73.1 ms, total: 366 ms
Wall time: 25.8 s

Out[194]:

<module 'matplotlib.pyplot' from '/Users/vlad/anaconda/lib/python3.6/site-packages/matplotlib/pyplot.py'>

Since performance keeps growing, both for training and cross validation scores, we see that our model is not complex/flexible enough to capture all nuance, given little data. In this particular case, it's not very pronounced, since the accuracies are high anyway.

To solve this problem, we have two options:

Use more training data, to overcome low model complexity, or
Use a more complex (lower bias) model to start with, to get more out of the existing data

In a minimum viable data development context, the first approach would be more feasible - we want the simplest and easiest to implement algorithm.

We need to find more data!

Once we got additional data, for every data point that we add to the existing training set, we run our models over and over again, till we achieve the ideal learning curve (see the picture below).

In [198]:

from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/learning_curves.png", width=800, height=800)

Out[198]:

The ideal learning curve:

Represents a model that generalizes to new data

Testing and training learning curves converge at similar values

The smaller the gap, the better our model generalizes

4.2 Logistic Regression

Similarly, we experiment with a simple Logistic Regression

In [199]:

#Build the logistic regression pipeline
pipeline_LogisticRegression = Pipeline([
    ('bow', CountVectorizer(analyzer=split_into_lemmas)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', LogisticRegression()),  # train on TF-IDF vectors w/ Naive Bayes classifier
    ])

In [200]:

scores_LogisticRegression =  cross_val_score(pipeline_LogisticRegression,  # steps to convert raw messages into models
                                        msg_train,  # training data
                                        label_train,  # training labels
                                        cv=10,  # split data randomly into 10 parts: 9 for training, 1 for scoring
                                        scoring='accuracy',  # which scoring metric?
                                        n_jobs=-1,  # -1 = use all cores = faster
                                        )
print(scores_LogisticRegression)

[0.97762864 0.97986577 0.97091723 0.97533632 0.97309417 0.97085202
 0.96853933 0.97078652 0.96853933 0.96404494]

In [201]:

#print overall mean and standard deviation
print('Overall mean score for Logistic Regression: {}%'.format(round(scores_LogisticRegression.mean(),2)))
print('Overall std score for Logistic Regression: {}%'.format(round(scores_LogisticRegression.std(),2)))

Overall mean score for Logistic Regression: 0.97%
Overall std score for Logistic Regression: 0.0%

In [203]:

print('The possible parameters for the Logistic Regressor:\n \n \n', pipeline_MultinomialNB.get_params())

The possible parameters for the Logistic Regressor:
 
 
 {'memory': None, 'steps': [('bow', CountVectorizer(analyzer=<function split_into_lemmas at 0x1a2ba647b8>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)), ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))], 'bow': CountVectorizer(analyzer=<function split_into_lemmas at 0x1a2ba647b8>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None), 'tfidf': TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True), 'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True), 'bow__analyzer': <function split_into_lemmas at 0x1a2ba647b8>, 'bow__binary': False, 'bow__decode_error': 'strict', 'bow__dtype': <class 'numpy.int64'>, 'bow__encoding': 'utf-8', 'bow__input': 'content', 'bow__lowercase': True, 'bow__max_df': 1.0, 'bow__max_features': None, 'bow__min_df': 1, 'bow__ngram_range': (1, 1), 'bow__preprocessor': None, 'bow__stop_words': None, 'bow__strip_accents': None, 'bow__token_pattern': '(?u)\\b\\w\\w+\\b', 'bow__tokenizer': None, 'bow__vocabulary': None, 'tfidf__norm': 'l2', 'tfidf__smooth_idf': True, 'tfidf__sublinear_tf': False, 'tfidf__use_idf': True, 'classifier__alpha': 1.0, 'classifier__class_prior': None, 'classifier__fit_prior': True}

In [206]:

#search for best parameters
params = {
    'tfidf__use_idf': (True, False),
    'bow__analyzer': (split_into_lemmas, split_into_tokens)
}

grid_LogisticRegression = GridSearchCV(
    pipeline_LogisticRegression,  # pipeline from above
    params,  # parameters to tune via cross validation
    refit=True,  # fit using all available data at the end, on the best found param combination
    n_jobs=-1,  # number of cores to use for parallelization; -1 for "all cores"
    scoring='accuracy',  # what score are we optimizing?
    cv=StratifiedKFold(n_splits = 5))  # what type of cross validation to use

In [207]:

#fit the model
%time LogisticRegression_detector = grid_LogisticRegression.fit(msg_train, label_train)

CPU times: user 2.65 s, sys: 76.3 ms, total: 2.73 s
Wall time: 36.4 s

In [208]:

print('The results of the search:\n \n \n',LogisticRegression_detector.cv_results_)

The results of the search:
 
 
 {'mean_fit_time': array([3.49843407, 3.36551366, 2.51372495, 2.46109805]), 'std_fit_time': array([0.01610993, 0.09751291, 0.02174387, 0.01807802]), 'mean_score_time': array([0.96247916, 1.0925004 , 0.61095219, 0.60495777]), 'std_score_time': array([0.15345559, 0.23055426, 0.00475557, 0.01771899]), 'param_bow__analyzer': masked_array(data=[<function split_into_lemmas at 0x1a2ba647b8>,
                   <function split_into_lemmas at 0x1a2ba647b8>,
                   <function split_into_tokens at 0x1a2ba64950>,
                   <function split_into_tokens at 0x1a2ba64950>],
             mask=[False, False, False, False],
       fill_value='?',
            dtype=object), 'param_tfidf__use_idf': masked_array(data=[True, False, True, False],
             mask=[False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'bow__analyzer': <function split_into_lemmas at 0x1a2ba647b8>, 'tfidf__use_idf': True}, {'bow__analyzer': <function split_into_lemmas at 0x1a2ba647b8>, 'tfidf__use_idf': False}, {'bow__analyzer': <function split_into_tokens at 0x1a2ba64950>, 'tfidf__use_idf': True}, {'bow__analyzer': <function split_into_tokens at 0x1a2ba64950>, 'tfidf__use_idf': False}], 'split0_test_score': array([0.97760358, 0.97648376, 0.96976484, 0.9731243 ]), 'split1_test_score': array([0.96636771, 0.97085202, 0.96076233, 0.96524664]), 'split2_test_score': array([0.97309417, 0.97533632, 0.96300448, 0.96524664]), 'split3_test_score': array([0.96520763, 0.9674523 , 0.95847363, 0.96184063]), 'split4_test_score': array([0.96520763, 0.96632997, 0.94837262, 0.95286195]), 'mean_test_score': array([0.96949989, 0.97129401, 0.96008074, 0.96366898]), 'std_test_score': array([0.00500613, 0.00407122, 0.00696363, 0.00655026]), 'rank_test_score': array([2, 1, 4, 3], dtype=int32), 'split0_train_score': array([0.97335951, 0.97504206, 0.96494672, 0.9669097 ]), 'split1_train_score': array([0.97336698, 0.97504906, 0.96663863, 0.97196524]), 'split2_train_score': array([0.97252593, 0.9758901 , 0.9646762 , 0.96972246]), 'split3_train_score': array([0.97589686, 0.97729821, 0.96580717, 0.97197309]), 'split4_train_score': array([0.97561659, 0.97841928, 0.96889013, 0.97141256]), 'mean_train_score': array([0.97415317, 0.97633974, 0.96619177, 0.97039661]), 'std_train_score': array([0.00134744, 0.00132628, 0.00151498, 0.00192827])}

In [209]:

print('The best score: {}'.format(round(grid_LogisticRegression.best_score_,2)))

The best score: 0.97

In [210]:

#generate predictions
%time predictions_LogisticRegression = LogisticRegression_detector.predict(msg_test)

CPU times: user 575 ms, sys: 4.97 ms, total: 580 ms
Wall time: 580 ms

In [212]:

print('Confusion Matrix using Logistic Regressor classifier: \n \n',confusion_matrix(label_test, predictions_LogisticRegression))

Confusion Matrix using Logistic Regressor classifier: 
 
 [[973   1]
 [ 28 113]]

In [213]:

print('Classification report for Logistic Regressor: \n \n ',classification_report(label_test, predictions_LogisticRegression))

Classification report for Logistic Regressor: 
 
               precision    recall  f1-score   support

        ham       0.97      1.00      0.99       974
       spam       0.99      0.80      0.89       141

avg / total       0.97      0.97      0.97      1115

In [214]:

%time plot_learning_curve(pipeline_LogisticRegression, "Learning Curve for Multinomial Naive Bayes", msg_train, label_train, cv=3)

CPU times: user 293 ms, sys: 63.1 ms, total: 356 ms
Wall time: 20 s

Out[214]:

<module 'matplotlib.pyplot' from '/Users/vlad/anaconda/lib/python3.6/site-packages/matplotlib/pyplot.py'>

After many iterations and once we feel confident of its predictive capabilities, we can deploy the model to production, ship the product, get feedback, and then improve, by adding more complexity to it. Following this approach we were not only able to deliver a minimal viable data product, at the lowest possible cost, but also solved the problem on our local machine, avoiding this way a huge investment into a potentially costly infrastructure.

5) Experiment with a sophisticated convolutional neural network (perhaps the final data product)

Once we got a descent working prototype, we could move to more complex and sophisticated models. A possible approach to solve our text classification problem would be to implement a convolutional neural network. We could start with the simplest possible architecture, train and evaluate the model, then depending on the results, we might want to keep improving it. Perhaps we would add another hidden layer and then do the same modelling exercise again. We might want to use word embeddings as well.

Such an approach would have however many disadvantages. A major drawback of deep learning is generally its interpretability. It’s generally hard to interpret the results of a neural network depending on what kind of network you use. Moreover, you fairly spend a huge amount of time in tuning the parameters of a neural network where the gain in model performance is negligible.

In this part of our blog post we will show how we could solve the text classification problem within a deep learning framework. We will not dive into any details regarding the implementation of our neural network.

In [216]:

from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/convolutional_neural_network.png", width=1000, height=1000)

Out[216]:

In [218]:

#display the first 10 rows of the dataframe
df.head(n = 10)

Out[218]:

	label	sms	length
0	ham	Go until jurong point, crazy.. Available only ...	20
1	ham	Ok lar... Joking wif u oni...	6
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	28
3	ham	U dun say so early hor... U c already then say...	11
4	ham	Nah I don't think he goes to usf, he lives aro...	13
5	spam	FreeMsg Hey there darling it's been 3 week's n...	32
6	ham	Even my brother is not like to speak with me. ...	16
7	ham	As per your request 'Melle Melle (Oru Minnamin...	26
8	spam	WINNER!! As a valued network customer you have...	26
9	spam	Had your mobile 11 months or more? U R entitle...	29

In [219]:

#drop the length column 
df = df.drop('length', axis = 1)

5.1 Create the output signal

In [221]:

#get the labels
labels = sorted(set(df['label'].unique()))

In [225]:

print('The classes are: \n \n {0} and {1}'.format(labels[0], labels[1]))

The classes are: 
 
 ham and spam

In [227]:

num_classes = len(sorted(set(df['label'])))
print('The number of classes are equal to: {}'.format(num_classes))

The number of classes are equal to: 2

In [228]:

#create a 2 by 2 matrix of zeros
one_hot = np.zeros((num_classes, num_classes), dtype = np.int)

In [229]:

#fill the diagonal values of the matrix with one's
np.fill_diagonal(one_hot, 1)

In [231]:

#print the one_hot array
print('The final one-hot array: \n \n  {}'.format(one_hot))

The final one-hot array: 
 
  [[1 0]
 [0 1]]

In [232]:

#assign a one-hot representation to each label

#label_dictionary = dict(zip(labels, one_hot))
label_dictionary = {label:one_hot for label, one_hot in zip(labels, one_hot)}

#print the label_dictionary
print('One-hot representation for each class: \n {}'.format(label_dictionary))

One-hot representation for each class: 
 {'ham': array([1, 0]), 'spam': array([0, 1])}

In [233]:

#create the output signal
y_row = df['label'].apply(lambda e: label_dictionary[e]).tolist()

5.2 Create the input signal

In [236]:

#create the input signal (which is a nested list of lists)
x_row = df['sms'].apply(lambda e: e.split()).tolist()

5.3 Create the padding function

In [237]:

#define the padding function

def pad_sentences(sentences, padding_word="<PAD/>", forced_sequence_length=None):
    """Pad setences during training or prediction"""
    if forced_sequence_length is None: # Train
        sequence_length = max(len(x) for x in sentences)
    else: # Prediction
        logging.critical('This is prediction, reading the trained sequence length')
        sequence_length = forced_sequence_length
    logging.critical('The maximum length is {}'.format(sequence_length))

    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence)

        if num_padding < 0: # Prediction: cut off the sentence if it is longer than the sequence length
            logging.info('This sentence has to be cut off because it is longer than trained sequence length')
            padded_sentence = sentence[0:sequence_length]
        else:
            padded_sentence = sentence + [padding_word] * num_padding
        padded_sentences.append(padded_sentence)
    return padded_sentences

In [240]:

#apply padding on each sentence
x_row = pad_sentences(x_row)

CRITICAL:root:The maximum length is 171

5.3 Build vocabulary

In [241]:

#vocabulary_inv : most_frequent
def build_vocab(sentences):
    word_counts = Counter(itertools.chain(*sentences))
    vocabulary_inv = [word[0] for word in word_counts.most_common()]
    vocabulary = {word: index for index, word in enumerate(vocabulary_inv)}
    return vocabulary, vocabulary_inv #vocabulary_inv : most_frequent

In [243]:

#call the build_vocab
vocabulary, vocabulary_inv = build_vocab(x_row)

In [244]:

print('The vocabulary contains {:,} words.'.format(len(vocabulary.keys())))

The vocabulary contains 15,734 words.

In [245]:

print('The vocabulary_inv contains {:,} words.'.format(len(vocabulary_inv)))

The vocabulary_inv contains 15,734 words.

In [247]:

#create the final input array
x = np.array([[vocabulary[word] for word in sentence] for sentence in x_row])

print('The final INPUT array has {:,} rows and {:,} columns.'.format(x.shape[0], x.shape[1]))

The final INPUT array has 5,574 rows and 171 columns.

In [248]:

#create the final output array
y = np.array(y_row)

print('The final OUTPUT array has {:,} rows and {:,} columns (= #classes).'.format(y.shape[0], y.shape[1]))

The final OUTPUT array has 5,574 rows and 2 columns (= #classes).

5.4 Create the embedding matrix

In [249]:

#initialize embeddings
def load_embeddings(vocabulary):
    word_embeddings = {}
    for word in vocabulary:
        word_embeddings[word] = np.random.uniform(-0.25, 0.25, 300)
    return word_embeddings

In [250]:

#load embeddings
word_embeddings = load_embeddings(vocabulary)

In [251]:

#build the embedding matrix
embedding_mat = [word_embeddings[word] for index, word in enumerate(vocabulary_inv)]

In [252]:

#Create 300 dimentional embeddings
embedding_mat = np.array(embedding_mat, dtype = np.float32)

In [253]:

print('The embedding matrix has {:,} rows and {:,} columns.'.format(embedding_mat.shape[0], embedding_mat.shape[1]))

The embedding matrix has 15,734 rows and 300 columns.

5.5 Create the batch iterator function

In [254]:

#create batches
def batch_iter(data, batch_size, num_epochs, shuffle=True):
    data = np.array(data)
    data_size = len(data)
    num_batches_per_epoch = int(data_size / batch_size) + 1
    
    for epoch in range(num_epochs):
        if shuffle:
            shuffle_indices = np.random.permutation(np.arange(data_size))
            shuffled_data = data[shuffle_indices]
        else:
            shuffled_data = data
            
        for batch_num in range(num_batches_per_epoch):
            start_index = batch_num * batch_size
            end_index = min((batch_num + 1) * batch_size, data_size)
            yield shuffled_data[start_index:end_index]

5.6 Train-test split

In [255]:

# Split the original dataset into train set and test set
x_, x_test, y_, y_test = train_test_split(x, y, test_size=0.3)

In [256]:

# Split the train set into train set and dev set
x_train, x_dev, y_train, y_dev = train_test_split(x_, y_, test_size=0.3)

In [257]:

print('x_train: {:,}, x_dev: {:,}, x_test: {:,}'.format(len(x_train), len(x_dev), len(x_test)))
print('y_train: {:,}, y_dev: {:,}, y_test: {:,}'.format(len(y_train), len(y_dev), len(y_test)))

x_train: 2,730, x_dev: 1,171, x_test: 1,673
y_train: 2,730, y_dev: 1,171, y_test: 1,673

5.7 Convolutional Neural Network

In [258]:

class TextCNN(object):
    """
    A CNN for text classification.
    Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
    """
    def __init__(self, sequence_length, num_classes, vocab_size, embedding_size, filter_sizes, num_filters, l2_reg_lambda =0.0):
        # Placeholders for input, output and dropout
        self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
        self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
        self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
        
        l2_loss = tf.constant(0.0)
        
        with tf.device('/cpu:0'), tf.name_scope("embedding"):
            W = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),name="W")
            self.embedded_chars = tf.nn.embedding_lookup(embedding_mat, self.input_x)
            self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
        
        pooled_outputs = []
        
        for i, filter_size in enumerate(filter_sizes):
            with tf.name_scope("conv-maxpool-%s" % filter_size):
                # Convolution Layer
                filter_shape = [filter_size, embedding_size, 1, num_filters]
                W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
                b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
                conv = tf.nn.conv2d(self.embedded_chars_expanded,W,strides=[1, 1, 1, 1],padding="VALID",name="conv")
                # Apply nonlinearity
                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
                # Max-pooling over the outputs
                pooled = tf.nn.max_pool(h,ksize=[1, sequence_length - filter_size + 1, 1, 1],strides=[1, 1, 1, 1],padding='VALID',name="pool")
            pooled_outputs.append(pooled)
 
        # Combine all the pooled features
        num_filters_total = num_filters * len(filter_sizes)
        self.h_pool = tf.concat(pooled_outputs, 3)
        self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) #self.h_pool_flat
        
        #Add dropout
        with tf.name_scope("dropout"):
             self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
        
        
        with tf.name_scope("output"):
            W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W")
            b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
            
            l2_loss += tf.nn.l2_loss(W)
            l2_loss += tf.nn.l2_loss(b)
            
            self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
            self.predictions = tf.argmax(self.scores, 1, name="predictions")
            
        # Calculate mean cross-entropy loss
        with tf.name_scope("loss"):
            losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y, name = "loss")
            self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
            
        # Calculate Accuracy
        with tf.name_scope("accuracy"):
            correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
            self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
            
        with tf.name_scope('num_correct'):
            correct = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
            self.num_correct = tf.reduce_sum(tf.cast(correct, 'float'))

5.8 Train the convolutional neural network¶

In [ ]:

with tf.Graph().as_default():
    with tf.Session(config= tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)) as sess:
        #sess.run(tf.global_variables_initializer())
        
        cnn = TextCNN(sequence_length=171,
                      num_classes=2,
                      vocab_size=embedding_mat.shape[0],
                      embedding_size=embedding_mat.shape[1],
                      filter_sizes=[3,4,5],
                      num_filters=x.shape[0],
                      l2_reg_lambda = 0.0)
        
        global_step = tf.Variable(0, name="global_step", trainable=False)
        #optimizer = tf.train.AdamOptimizer(learning_rate = 1e-4)
        optimizer = tf.train.RMSPropOptimizer(1e-2, decay=0.9)
        
        #LEARNING RATE THAT CHANGES
        #starter_learning_rate = 0.1
        #learning_rate_test = tf.train.exponential_decay(starter_learning_rate, global_step, 100, 0.96, staircase=True)
        
        
        #optimizer = tf.train.GradientDescentOptimizer(learning_rate= 0.2)
        grads_and_vars = optimizer.compute_gradients(cnn.loss)
        train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
        
        # Keep track of gradient values and sparsity (optional)
        #grad_summaries = []
        #for g, v in grads_and_vars:
         #   if g is not None:
          #      grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
           #     sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
           #     grad_summaries.append(grad_hist_summary)
            #    grad_summaries.append(sparsity_summary)
        #grad_summaries_merged = tf.summary.merge(grad_summaries)
        
        def train_step(x_batch, y_batch):
            _ , step ,summaries, loss_val, accuracy_val = sess.run([train_op, global_step,train_summary_op, cnn.loss, cnn.accuracy],
                                                       feed_dict = {cnn.input_x: x_batch,
                                                                    cnn.input_y: y_batch,
                                                                    cnn.dropout_keep_prob: 0.5})
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss_val, accuracy_val))
            train_summary_writer.add_summary(summaries, step)
            
            
        def dev_step(x_batch, y_batch, writer=None):
            step ,summaries, loss, accuracy, num_correct, predictions = sess.run([global_step,dev_summary_op,
                                                            cnn.loss,
                                                            cnn.accuracy,
                                                            cnn.num_correct,
                                                            cnn.predictions],
                                                feed_dict = {cnn.input_x: x_batch,
                                                             cnn.input_y: y_batch,
                                                             cnn.dropout_keep_prob: 1.0})
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str,step,loss,accuracy))
            if writer:
                writer.add_summary(summaries, step)
            return accuracy, loss, num_correct, predictions
            
        
        # Output directory for models and summaries
        timestamp = str(int(time.time()))
        out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
        print("Writing to {}\n".format(out_dir))
        
        # Summaries for loss and accuracy
        loss_summary = tf.summary.scalar("loss", cnn.loss)
        acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)
        
        # Train Summaries
        train_summary_op = tf.summary.merge([loss_summary, acc_summary]) #grad_summaries_merged
        train_summary_dir = os.path.join(out_dir, "summaries", "train")
        train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
        
        # Dev summaries
        dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
        dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
        dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
        
        
        # Checkpointing
        checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
        checkpoint_prefix = os.path.join(checkpoint_dir, "model")
        # Tensorflow assumes this directory already exists so we need to create it
        if not os.path.exists(checkpoint_dir):
            os.makedirs(checkpoint_dir)
        saver = tf.train.Saver(tf.global_variables())  
        
        #start training
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver(tf.global_variables())
        
        # Train the model with x_train and y_train
        train_batches = batch_iter(list(zip(x_train, y_train)), batch_size = 5,  num_epochs = 10)
        best_accuracy, best_at_step = 0, 0

        for train_batch in train_batches:
            x_train_batch, y_train_batch = zip(*train_batch)
            #batch_xs = np.vstack([np.expand_dims(x, 0) for x in x_train_batch])
            #batch_ys = np.vstack([np.expand_dims(y, 0) for y in y_train_batch])
            train_step(x_train_batch,y_train_batch)
            current_step = tf.train.global_step(sess, global_step)
            
            #evaluate_every = 100, Evaluate model on dev set after this many steps (default: 100)
            #checkpoint_every = 100, Save model after this many steps (default: 100)
            evaluate_every = 100
            checkpoint_every = 100
            
            #if current_step % evaluate_every == 0:
            #    print("\nEvaluation:")
            #   dev_step(x_dev, y_dev, writer=dev_summary_writer)
            #    print("")
            #if current_step % checkpoint_every == 0:
            #    path = saver.save(sess, checkpoint_prefix, global_step=current_step)
            #    print("Saved model checkpoint to {}\n".format(path))
            #open TensorBoard with: tensorboard --logdir= 
            
            # Evaluate the model with x_dev and y_dev
            if current_step % evaluate_every == 0:
                dev_batches = batch_iter(list(zip(x_dev, y_dev)), batch_size = 5, num_epochs = 10)
                total_dev_correct = 0
                
                for dev_batch in dev_batches:
                    x_dev_batch, y_dev_batch = zip(*dev_batch)
                    acc, loss, num_dev_correct, predictions = dev_step(x_dev_batch, y_dev_batch)
                    total_dev_correct += num_dev_correct
                accuracy = float(total_dev_correct) / len(y_dev)
                print('Accuracy on dev set: {}'.format(accuracy))
                    
                if accuracy >= best_accuracy:
                    best_accuracy, best_at_step = accuracy, current_step
                    path = saver.save(sess, checkpoint_prefix, global_step=current_step)
                    print('Saved model {} at step {}'.format(path, best_at_step))
                    print('Best accuracy {} at step {}'.format(best_accuracy, best_at_step))
                    
        print('Training is complete, testing the best model on x_test and y_test')
        
        # Evaluate x_test and y_test   
        saver.restore(sess, checkpoint_prefix + '-' + str(best_at_step))
        test_batches = batch_iter(list(zip(x_test, y_test)), 100, 1, shuffle=False)
        total_test_correct = 0
        for test_batch in test_batches:
            x_test_batch, y_test_batch = zip(*test_batch)
            acc, loss, num_test_correct, predictions = dev_step(x_test_batch, y_test_batch)
            total_test_correct += int(num_test_correct)
            print('Accuracy on test set: {}'.format(float(total_test_correct) / len(y_test)))

As we already mentioned, we could ship the prototype to the market, get feedback and add new features to our model based on that feedback.

6 Conclusions

In this blog post, we provided some useful insights into what MVP means for machine learning. In essence, it is just about starting small and then iterate.

We tried, using this concept, to solve a simple spam-ham text classification problem. We started with a simplistic Bag-of-Words model and then gradually moved to a more sophisticated convolutional neural network architecture.

MVP is a simple approach that reduce significantly the risk that a product/service will fail. Just because is simple, it doesn't mean that you don't need to have a plan in place.

Three are the key areas you should focus on: