The amount of data we produce daily is truly mind-boggling. The are 2.5 quintillion bytes of data created each day at our current pace and that pace is only accelerating. With such a growth, more and more companies recognize the importance of incorporating artificial intelligence into their bussiness models. In order to succeed, many companies, especially small ones (start-ups), are taking the approach of the minimum viable (data) product development (abbreviated: MVP).
The MVP is basically the state of the art methodology for product development nowadays. Its central idea is that by building products or services iteratively by constantly integrating customer feedback, we can significantly reduce the risk that a product/service will fail (build-measure-learn).
from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/picture_1.png", width=1000, height=600)
For example, if your final product idea is an electric car, then according to the MVP approach, you will start with the smallest effort to deliver the car. In this case, you will perhaps, start with just two wheels, a board and a simple electric engine. Then you will ship this to the market and get feedback to continuously improve your product by adding more complexity to it. By starting with a small subset of the features of your final product, you can get to a solution sooner, even if it's not perfect.
from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/picture_2.png", width=1000, height=600)
The minimum viable data product development is based on this concept. First, we define the key business question, next we acquire the data needed to answer the question and finally implement a test - using the simplest possible appropriate statistical model, to answer that question. We deliver the product and build on the feedback that we receive from the market.
The key is to find the minimum viable data. It doesn't need to be perfect data, just to inform a decision on whether to stick, twist and to enable you to measure the impact of that decision. The machine learning model doesn't have to be perfect either, just a baseline model which can provide you a meaningful referrence point to which to compare.
In this blog post, we will try to solve a text classification problem, following the minimal viable product development approach. Suppose that our goal is to develop a machine learning system for a start-up that can classify an email as spam or ham. We will start with a simple bag-of-words, as a useful basis for modeling our prediction problem and will end up with a sophisticated convolutional neural network built in tensorflow.
Compared to the electric car, the bag-of-words model represents the two wheels with the simple electric engine prototype, while the convolutional neural network, the final, ready for delivery, electric car.
We will use a minimal annotated dataset and will analyse how the learning curves help us to answer the following question:
In other words: 'How much data is enough data ?'
The first step in the process of developing our machine learning model is to collect data relevant to our task. As previously mentioned, we will use the minimum viable data approach - the least amount of data nedeed to make an effective decision.
To set the wheels in motion, we first have to find a minimal dataset to start testing our assumptions. Such a dataset is the SMSSpamCollection corpus, which can be downloaded from: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. The file contains a collection of more than five thousand sms phone messages, mannualy annotated as spam and ham. The dataset is an excellent source on which we can start building our statistical model.
Generally, with less data:
First, we will import all the libraries used throughout this blog post. Once we do that, we load our minimal SMSSpamCollection dataset into a pandas dataframe.
#import libraries
import pandas as pd
import numpy as np
import logging
import itertools
import time
import datetime
import shutil
import os
import re
import sys
import json
import csv
import pickle
from collections import defaultdict, Counter
import matplotlib
import matplotlib.pyplot as plt
% matplotlib inline
from textblob import TextBlob
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split, learning_curve
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
import tensorflow as tf
from tensorflow.contrib import learn
#load dataset into a Pandas Dataframe
df = pd.read_csv('//Users//vlad//Desktop//MVP//smsspamcollection//SMSSpamCollection',
sep='\t',
skip_blank_lines = True,
quoting=csv.QUOTE_NONE,
names=["label", "sms"])
#display the first 10 rows
df.head(n = 10)
As we can see in the cell below, the corpus contains 5.574 text messages, labeled as spam and ham.
print('The dataframe contains {:,} messages.'.format(df.shape[0]))
With the file loaded into our Jupiter Notebook, we ask ourselves a couple of questions:
We ask ourselves some NLP questions as well:
Don't forget - we follow the minimum viable data development approach - we want to get results and fast! That's why we will not spend valuable time answering all our questions - this will be done later. We pick the first two and move on!
In the cells below, we will aggregate some basic statistics per group and answer the question whether our dataset is balanced or not.
#aggregate basic statistics per category
df.groupby(['label']).describe()
#create the class_info function
def class_info(classes):
counts = Counter(classes)
total = sum(counts.values())
for cls in counts.keys():
print("%6s: % 7d = % 5.1f%%" % (cls, counts[cls], counts[cls]/total*100))
class_info(df['label'])
colors = ['#66b3ff','orange']
fig, ax = plt.subplots(figsize = (10,10))
labels = ['Spam','Ham']
sizes = [13.4,86.6]
explode = [0.1,0]
patches, texts, autotexts = ax.pie(sizes,
labels = labels,
explode = explode,
shadow = True,
autopct='%1.1f%%',
startangle = 180,
colors = colors)
ax.axis('equal')
for i in range(2):
texts[i].set_fontsize(18)
for j in range(2):
autotexts[j].set_fontsize(18)
ax.set_title('% of Spam and Ham classes', size = 20, y = 1.009)
From the pie chart, it's clear that the dataset is unbalanced, with the spam class comprising 13.4% and the ham 86.6% of the total. At this point we should have resampled our data to provide balanced classes. However, in a minimum viable data development context this step could be skipped-at least for now.
Next we will check the number of words each message contains and plot the message length per class.
#create a new column length with the number of words for each message
df['length'] = df['sms'].apply(lambda e: e.split()).apply(lambda e: len(e))
#plot the length distribution of each text message
fig, ax = plt.subplots(figsize = (10,10))
df['length'].plot(kind = 'hist',bins = 100 ,edgecolor = 'red', ax = ax)
ax.set_title('Number of words in each message', size = 20, y = 1.009)
ax.set_ylabel('Frequency', size = 15)
ax.grid(True)
#round the entries in the frame below to 2 decimals
df[['length']].describe().apply(lambda e: round(e,2))
#plot the length distribution for each class
fig, ax = plt.subplots(figsize = (10,10))
ax1, ax2 = df.hist(column='length', by='label', bins= 20,edgecolor = 'red', ax = ax)
ax1.grid(True)
ax2.grid(True)
for ax in ax1.get_xticklabels():
ax.set_rotation(0)
for ax in ax2.get_xticklabels():
ax.set_rotation(0)
Looking at the two histograms, it's clear that the spam messages are much longer. Perhaps we could use this information to create a new feature and make our classier smarter. But not for now -remember, we want to build a minimal baseline model which can provide us a meaningful referrence point.
In this section we will convert the raw text messages into n-dimentional vectors. First, we will use a split_into_words function to split each message into individual words and then normalize them into their base form (lemmas) using the split_into_lemmas_function.
#split messages into their individual words
def split_into_tokens(text):
text = str(text)
return TextBlob(text).words
#apply the function on the first 10 rows of the dataframe
df['sms'].head(n = 10).apply(split_into_tokens)
#normalize words into their base form (lemmas)
def split_into_lemmas(message):
message = str(message).lower()
words = TextBlob(message).words
return [word.lemma for word in words]
#apply the function on the first 10 rows of the dataframe
df['sms'].head(n =10).apply(split_into_lemmas)
Now we are ready to implement our baseline machine learning models. We will first use a simple Multinomial Naive Bayes Classifier and next will experiment with a Logistic Regression technique (since we work on a binary classification problem).
In the picture below we can see the pipeline of our Supervized Learning Model that we try to build.
from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/diagram_24.png", width=1000, height=600)
#Split data into train and test. Keep 20 percent from the initial dataset for test.
msg_train, msg_test, label_train, label_test = train_test_split(df['sms'], df['label'], test_size=0.2)
#print(len(msg_train), len(msg_test), len(label_train) + len(label_test))
print('msg train: {:,}'.format(len(msg_train)))
print('msg test: {:,}'.format(len(msg_test)))
print('label train: {:,}'.format(len(label_train)))
print('label test: {:,}'.format(len(label_test)))
#Build a pipeline
pipeline_MultinomialNB = Pipeline([
('bow', CountVectorizer(analyzer=split_into_lemmas)), # strings to token integer counts
('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores
('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier
])
#cross validation
scores_MultinomialNB = cross_val_score(pipeline_MultinomialNB, # steps to convert raw messages into models
msg_train, # training data
label_train, # training labels
cv = 10, # split data randomly into 10 parts: 9 for training, 1 for scoring
scoring = 'accuracy', # which scoring metric?
n_jobs = -1, # -1 = use all cores = faster
)
print(scores_MultinomialNB)
#print overall mean and standard deviation
print('Overall mean score for Multinomial Naive Bayes: {}%'.format(round(scores_MultinomialNB.mean(),2)))
print('Overall std score for Multinomial Naive Bayes: {}%'.format(round(scores_MultinomialNB.std(),2)))
In the next step, we will search the best parameters of our model using scikit-learn GridSearchCV.
print('The possible parameters for the Multinomial Naive Bayes classifier:\n \n \n', pipeline_MultinomialNB.get_params())
#search for best parameters
params = {
'tfidf__use_idf': (True, False),
'bow__analyzer': (split_into_lemmas, split_into_tokens)
}
grid_MultinomialNB = GridSearchCV(
pipeline_MultinomialNB, # pipeline from above
params, # parameters to tune via cross validation
refit=True, # fit using all available data at the end, on the best found param combination
n_jobs=-1, # number of cores to use for parallelization; -1 for "all cores"
scoring='accuracy', # what score are we optimizing?
cv=StratifiedKFold(n_splits = 5)) # what type of cross validation to use
#fit the model
%time MultinomialNB_detector = grid_MultinomialNB.fit(msg_train, label_train)
print('The results of the search:\n \n \n',MultinomialNB_detector.cv_results_)
print('The best score: {}'.format(round(grid_MultinomialNB.best_score_,2)))
#generate predictions
%time predictions_MultinomialNB = MultinomialNB_detector.predict(msg_test)
print('Confusion Matrix using Multinomial Naive Bayes classifier: \n \n',confusion_matrix(label_test, predictions_MultinomialNB))
print('Classification report for Multinomial Naive Bayes: \n \n ',classification_report(label_test, predictions_MultinomialNB))
Now, we are ready to plot the learning curves for our Multinomial Naive Bayes classifier.
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure(figsize=(10,10))
plt.title(title, size = 20, y = 1.008)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples", size = 15)
plt.ylabel("Score", size = 15)
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
%time plot_learning_curve(pipeline_MultinomialNB, "Learning Curve for Multinomial Naive Bayes", msg_train, label_train, cv=3)
Since performance keeps growing, both for training and cross validation scores, we see that our model is not complex/flexible enough to capture all nuance, given little data. In this particular case, it's not very pronounced, since the accuracies are high anyway.
To solve this problem, we have two options:
In a minimum viable data development context, the first approach would be more feasible - we want the simplest and easiest to implement algorithm.
We need to find more data!
Once we got additional data, for every data point that we add to the existing training set, we run our models over and over again, till we achieve the ideal learning curve (see the picture below).
from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/learning_curves.png", width=800, height=800)
The ideal learning curve:
Similarly, we experiment with a simple Logistic Regression
#Build the logistic regression pipeline
pipeline_LogisticRegression = Pipeline([
('bow', CountVectorizer(analyzer=split_into_lemmas)), # strings to token integer counts
('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores
('classifier', LogisticRegression()), # train on TF-IDF vectors w/ Naive Bayes classifier
])
scores_LogisticRegression = cross_val_score(pipeline_LogisticRegression, # steps to convert raw messages into models
msg_train, # training data
label_train, # training labels
cv=10, # split data randomly into 10 parts: 9 for training, 1 for scoring
scoring='accuracy', # which scoring metric?
n_jobs=-1, # -1 = use all cores = faster
)
print(scores_LogisticRegression)
#print overall mean and standard deviation
print('Overall mean score for Logistic Regression: {}%'.format(round(scores_LogisticRegression.mean(),2)))
print('Overall std score for Logistic Regression: {}%'.format(round(scores_LogisticRegression.std(),2)))
print('The possible parameters for the Logistic Regressor:\n \n \n', pipeline_MultinomialNB.get_params())
#search for best parameters
params = {
'tfidf__use_idf': (True, False),
'bow__analyzer': (split_into_lemmas, split_into_tokens)
}
grid_LogisticRegression = GridSearchCV(
pipeline_LogisticRegression, # pipeline from above
params, # parameters to tune via cross validation
refit=True, # fit using all available data at the end, on the best found param combination
n_jobs=-1, # number of cores to use for parallelization; -1 for "all cores"
scoring='accuracy', # what score are we optimizing?
cv=StratifiedKFold(n_splits = 5)) # what type of cross validation to use
#fit the model
%time LogisticRegression_detector = grid_LogisticRegression.fit(msg_train, label_train)
print('The results of the search:\n \n \n',LogisticRegression_detector.cv_results_)
print('The best score: {}'.format(round(grid_LogisticRegression.best_score_,2)))
#generate predictions
%time predictions_LogisticRegression = LogisticRegression_detector.predict(msg_test)
print('Confusion Matrix using Logistic Regressor classifier: \n \n',confusion_matrix(label_test, predictions_LogisticRegression))
print('Classification report for Logistic Regressor: \n \n ',classification_report(label_test, predictions_LogisticRegression))
%time plot_learning_curve(pipeline_LogisticRegression, "Learning Curve for Multinomial Naive Bayes", msg_train, label_train, cv=3)
After many iterations and once we feel confident of its predictive capabilities, we can deploy the model to production, ship the product, get feedback, and then improve, by adding more complexity to it. Following this approach we were not only able to deliver a minimal viable data product, at the lowest possible cost, but also solved the problem on our local machine, avoiding this way a huge investment into a potentially costly infrastructure.
Once we got a descent working prototype, we could move to more complex and sophisticated models. A possible approach to solve our text classification problem would be to implement a convolutional neural network. We could start with the simplest possible architecture, train and evaluate the model, then depending on the results, we might want to keep improving it. Perhaps we would add another hidden layer and then do the same modelling exercise again. We might want to use word embeddings as well.
Such an approach would have however many disadvantages. A major drawback of deep learning is generally its interpretability. It’s generally hard to interpret the results of a neural network depending on what kind of network you use. Moreover, you fairly spend a huge amount of time in tuning the parameters of a neural network where the gain in model performance is negligible.
In this part of our blog post we will show how we could solve the text classification problem within a deep learning framework. We will not dive into any details regarding the implementation of our neural network.
from IPython.core.display import Image
Image(filename= "/Users/vlad/Desktop/MVP/convolutional_neural_network.png", width=1000, height=1000)
#display the first 10 rows of the dataframe
df.head(n = 10)
#drop the length column
df = df.drop('length', axis = 1)
#get the labels
labels = sorted(set(df['label'].unique()))
print('The classes are: \n \n {0} and {1}'.format(labels[0], labels[1]))
num_classes = len(sorted(set(df['label'])))
print('The number of classes are equal to: {}'.format(num_classes))
#create a 2 by 2 matrix of zeros
one_hot = np.zeros((num_classes, num_classes), dtype = np.int)
#fill the diagonal values of the matrix with one's
np.fill_diagonal(one_hot, 1)
#print the one_hot array
print('The final one-hot array: \n \n {}'.format(one_hot))
#assign a one-hot representation to each label
#label_dictionary = dict(zip(labels, one_hot))
label_dictionary = {label:one_hot for label, one_hot in zip(labels, one_hot)}
#print the label_dictionary
print('One-hot representation for each class: \n {}'.format(label_dictionary))
#create the output signal
y_row = df['label'].apply(lambda e: label_dictionary[e]).tolist()
#create the input signal (which is a nested list of lists)
x_row = df['sms'].apply(lambda e: e.split()).tolist()
#define the padding function
def pad_sentences(sentences, padding_word="<PAD/>", forced_sequence_length=None):
"""Pad setences during training or prediction"""
if forced_sequence_length is None: # Train
sequence_length = max(len(x) for x in sentences)
else: # Prediction
logging.critical('This is prediction, reading the trained sequence length')
sequence_length = forced_sequence_length
logging.critical('The maximum length is {}'.format(sequence_length))
padded_sentences = []
for i in range(len(sentences)):
sentence = sentences[i]
num_padding = sequence_length - len(sentence)
if num_padding < 0: # Prediction: cut off the sentence if it is longer than the sequence length
logging.info('This sentence has to be cut off because it is longer than trained sequence length')
padded_sentence = sentence[0:sequence_length]
else:
padded_sentence = sentence + [padding_word] * num_padding
padded_sentences.append(padded_sentence)
return padded_sentences
#apply padding on each sentence
x_row = pad_sentences(x_row)
#vocabulary_inv : most_frequent
def build_vocab(sentences):
word_counts = Counter(itertools.chain(*sentences))
vocabulary_inv = [word[0] for word in word_counts.most_common()]
vocabulary = {word: index for index, word in enumerate(vocabulary_inv)}
return vocabulary, vocabulary_inv #vocabulary_inv : most_frequent
#call the build_vocab
vocabulary, vocabulary_inv = build_vocab(x_row)
print('The vocabulary contains {:,} words.'.format(len(vocabulary.keys())))
print('The vocabulary_inv contains {:,} words.'.format(len(vocabulary_inv)))
#create the final input array
x = np.array([[vocabulary[word] for word in sentence] for sentence in x_row])
print('The final INPUT array has {:,} rows and {:,} columns.'.format(x.shape[0], x.shape[1]))
#create the final output array
y = np.array(y_row)
print('The final OUTPUT array has {:,} rows and {:,} columns (= #classes).'.format(y.shape[0], y.shape[1]))
#initialize embeddings
def load_embeddings(vocabulary):
word_embeddings = {}
for word in vocabulary:
word_embeddings[word] = np.random.uniform(-0.25, 0.25, 300)
return word_embeddings
#load embeddings
word_embeddings = load_embeddings(vocabulary)
#build the embedding matrix
embedding_mat = [word_embeddings[word] for index, word in enumerate(vocabulary_inv)]
#Create 300 dimentional embeddings
embedding_mat = np.array(embedding_mat, dtype = np.float32)
print('The embedding matrix has {:,} rows and {:,} columns.'.format(embedding_mat.shape[0], embedding_mat.shape[1]))
#create batches
def batch_iter(data, batch_size, num_epochs, shuffle=True):
data = np.array(data)
data_size = len(data)
num_batches_per_epoch = int(data_size / batch_size) + 1
for epoch in range(num_epochs):
if shuffle:
shuffle_indices = np.random.permutation(np.arange(data_size))
shuffled_data = data[shuffle_indices]
else:
shuffled_data = data
for batch_num in range(num_batches_per_epoch):
start_index = batch_num * batch_size
end_index = min((batch_num + 1) * batch_size, data_size)
yield shuffled_data[start_index:end_index]
# Split the original dataset into train set and test set
x_, x_test, y_, y_test = train_test_split(x, y, test_size=0.3)
# Split the train set into train set and dev set
x_train, x_dev, y_train, y_dev = train_test_split(x_, y_, test_size=0.3)
print('x_train: {:,}, x_dev: {:,}, x_test: {:,}'.format(len(x_train), len(x_dev), len(x_test)))
print('y_train: {:,}, y_dev: {:,}, y_test: {:,}'.format(len(y_train), len(y_dev), len(y_test)))
class TextCNN(object):
"""
A CNN for text classification.
Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
"""
def __init__(self, sequence_length, num_classes, vocab_size, embedding_size, filter_sizes, num_filters, l2_reg_lambda =0.0):
# Placeholders for input, output and dropout
self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
l2_loss = tf.constant(0.0)
with tf.device('/cpu:0'), tf.name_scope("embedding"):
W = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),name="W")
self.embedded_chars = tf.nn.embedding_lookup(embedding_mat, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
with tf.name_scope("conv-maxpool-%s" % filter_size):
# Convolution Layer
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(self.embedded_chars_expanded,W,strides=[1, 1, 1, 1],padding="VALID",name="conv")
# Apply nonlinearity
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
# Max-pooling over the outputs
pooled = tf.nn.max_pool(h,ksize=[1, sequence_length - filter_size + 1, 1, 1],strides=[1, 1, 1, 1],padding='VALID',name="pool")
pooled_outputs.append(pooled)
# Combine all the pooled features
num_filters_total = num_filters * len(filter_sizes)
self.h_pool = tf.concat(pooled_outputs, 3)
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) #self.h_pool_flat
#Add dropout
with tf.name_scope("dropout"):
self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
with tf.name_scope("output"):
W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
l2_loss += tf.nn.l2_loss(W)
l2_loss += tf.nn.l2_loss(b)
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
self.predictions = tf.argmax(self.scores, 1, name="predictions")
# Calculate mean cross-entropy loss
with tf.name_scope("loss"):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y, name = "loss")
self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
# Calculate Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
with tf.name_scope('num_correct'):
correct = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.num_correct = tf.reduce_sum(tf.cast(correct, 'float'))
with tf.Graph().as_default():
with tf.Session(config= tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)) as sess:
#sess.run(tf.global_variables_initializer())
cnn = TextCNN(sequence_length=171,
num_classes=2,
vocab_size=embedding_mat.shape[0],
embedding_size=embedding_mat.shape[1],
filter_sizes=[3,4,5],
num_filters=x.shape[0],
l2_reg_lambda = 0.0)
global_step = tf.Variable(0, name="global_step", trainable=False)
#optimizer = tf.train.AdamOptimizer(learning_rate = 1e-4)
optimizer = tf.train.RMSPropOptimizer(1e-2, decay=0.9)
#LEARNING RATE THAT CHANGES
#starter_learning_rate = 0.1
#learning_rate_test = tf.train.exponential_decay(starter_learning_rate, global_step, 100, 0.96, staircase=True)
#optimizer = tf.train.GradientDescentOptimizer(learning_rate= 0.2)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
# Keep track of gradient values and sparsity (optional)
#grad_summaries = []
#for g, v in grads_and_vars:
# if g is not None:
# grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
# sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
# grad_summaries.append(grad_hist_summary)
# grad_summaries.append(sparsity_summary)
#grad_summaries_merged = tf.summary.merge(grad_summaries)
def train_step(x_batch, y_batch):
_ , step ,summaries, loss_val, accuracy_val = sess.run([train_op, global_step,train_summary_op, cnn.loss, cnn.accuracy],
feed_dict = {cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_keep_prob: 0.5})
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss_val, accuracy_val))
train_summary_writer.add_summary(summaries, step)
def dev_step(x_batch, y_batch, writer=None):
step ,summaries, loss, accuracy, num_correct, predictions = sess.run([global_step,dev_summary_op,
cnn.loss,
cnn.accuracy,
cnn.num_correct,
cnn.predictions],
feed_dict = {cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_keep_prob: 1.0})
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str,step,loss,accuracy))
if writer:
writer.add_summary(summaries, step)
return accuracy, loss, num_correct, predictions
# Output directory for models and summaries
timestamp = str(int(time.time()))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))
# Summaries for loss and accuracy
loss_summary = tf.summary.scalar("loss", cnn.loss)
acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)
# Train Summaries
train_summary_op = tf.summary.merge([loss_summary, acc_summary]) #grad_summaries_merged
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
# Dev summaries
dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
# Checkpointing
checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
# Tensorflow assumes this directory already exists so we need to create it
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
saver = tf.train.Saver(tf.global_variables())
#start training
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver(tf.global_variables())
# Train the model with x_train and y_train
train_batches = batch_iter(list(zip(x_train, y_train)), batch_size = 5, num_epochs = 10)
best_accuracy, best_at_step = 0, 0
for train_batch in train_batches:
x_train_batch, y_train_batch = zip(*train_batch)
#batch_xs = np.vstack([np.expand_dims(x, 0) for x in x_train_batch])
#batch_ys = np.vstack([np.expand_dims(y, 0) for y in y_train_batch])
train_step(x_train_batch,y_train_batch)
current_step = tf.train.global_step(sess, global_step)
#evaluate_every = 100, Evaluate model on dev set after this many steps (default: 100)
#checkpoint_every = 100, Save model after this many steps (default: 100)
evaluate_every = 100
checkpoint_every = 100
#if current_step % evaluate_every == 0:
# print("\nEvaluation:")
# dev_step(x_dev, y_dev, writer=dev_summary_writer)
# print("")
#if current_step % checkpoint_every == 0:
# path = saver.save(sess, checkpoint_prefix, global_step=current_step)
# print("Saved model checkpoint to {}\n".format(path))
#open TensorBoard with: tensorboard --logdir=
# Evaluate the model with x_dev and y_dev
if current_step % evaluate_every == 0:
dev_batches = batch_iter(list(zip(x_dev, y_dev)), batch_size = 5, num_epochs = 10)
total_dev_correct = 0
for dev_batch in dev_batches:
x_dev_batch, y_dev_batch = zip(*dev_batch)
acc, loss, num_dev_correct, predictions = dev_step(x_dev_batch, y_dev_batch)
total_dev_correct += num_dev_correct
accuracy = float(total_dev_correct) / len(y_dev)
print('Accuracy on dev set: {}'.format(accuracy))
if accuracy >= best_accuracy:
best_accuracy, best_at_step = accuracy, current_step
path = saver.save(sess, checkpoint_prefix, global_step=current_step)
print('Saved model {} at step {}'.format(path, best_at_step))
print('Best accuracy {} at step {}'.format(best_accuracy, best_at_step))
print('Training is complete, testing the best model on x_test and y_test')
# Evaluate x_test and y_test
saver.restore(sess, checkpoint_prefix + '-' + str(best_at_step))
test_batches = batch_iter(list(zip(x_test, y_test)), 100, 1, shuffle=False)
total_test_correct = 0
for test_batch in test_batches:
x_test_batch, y_test_batch = zip(*test_batch)
acc, loss, num_test_correct, predictions = dev_step(x_test_batch, y_test_batch)
total_test_correct += int(num_test_correct)
print('Accuracy on test set: {}'.format(float(total_test_correct) / len(y_test)))
As we already mentioned, we could ship the prototype to the market, get feedback and add new features to our model based on that feedback.
In this blog post, we provided some useful insights into what MVP means for machine learning. In essence, it is just about starting small and then iterate.
We tried, using this concept, to solve a simple spam-ham text classification problem. We started with a simplistic Bag-of-Words model and then gradually moved to a more sophisticated convolutional neural network architecture.
MVP is a simple approach that reduce significantly the risk that a product/service will fail. Just because is simple, it doesn't mean that you don't need to have a plan in place.
Three are the key areas you should focus on:
2019-05-31 12:52:41.441911
2019-01-21 12:49:52.633478
2018-10-19 08:22:01.419442
2018-09-10 11:25:54.616537