Module 01 : What is Machine Learning (ML)?

Machine learning is a way to use standard algorithms to derive predictive insights from data and make repeated decisions.

  • Business intelligence is understand the past
  • Machine learning use past data to predict the future

Stage 1 : Train an Model model with examples.

Example : label + input (input + what it is, ‘cat’, ‘dog’ etc in case of classification problem with pictures.

Stage 2 : Predict with a trained model

-> present a new images not labelled and see if the model can give you the label. Once you have the model you can use it on any picture.

  • Label : true answer
  • Input : Predictor variables, what you can use to predict the label
  • Example : Input + corresponding label
  • Model : Math function that takes input variables and creates approximation to label
  • Training : Adjusting model to minimize error
  • Predictions : Using model on unlabeled data

Two of the most common classes of machine learning models are supervised and unsupervised ML models. The key difference is that with supervised models, we have labels, or in other words, the correct answers to whatever it is that we want to learn to predict. In unsupervised learning, the data does not have labels.

Within supervised ML there are two types of problems: regression and classification. To explain them, let’s dive a little deeper into this data.

This course will be about supervised learning

Tons of models can be imagined. You can predict the cost of reparation on a car at an auction based on the state of the car. You can predict the gestation weeks for a mother based on various parameters etc.. This would be regression as you are trying to predict a continuous number. The model is fed with information collected about the mothers and gestation weeks as training sample and once trained, then predict it for a new case.

The simplest model is a simple linear regression :

A classification model can be used to detect whether email is spam or not.

Everything that goes to a ML model has to be numeric. Even for images, we can take each pixel and check the RDB. An image is a Tensor (N-dimensional array of pixel values).

For words we have to also vectorize it to some kind of numeric vector. Word to vectors -> word to voc model and just use it to convert your text input to vectors.

Machine learning used in lots of industries :

Playing with ML

On the training dataset the separation line between spam and non spam email to minimize the error rate on the training data set.

You draw a line between the different class identified in the classification problem. The equation for a line is quite simple :

Now how do we find good values for w1, w2 and b ? We use Gradient descent.

Gradient descent is iterative. And we go from there to minimize the error. You can speed up the process or slow it with the learning rate : a hyper parameter

This perfect learning rate (or alpha like we call it usually :)) varies from problem to problem.

Here is the process for ML :

New terms :

  • Weights : parameters we optimize
  • Batch size : the amount of data we compute error on
  • Epoch : One pass through entire dataset
  • Gradient descent : process of reducing error
  • Evaluation : Is the model good enough ? Has to be done on full dataset
  • Training : Process of optimizing the weights; includes gradient descent + evaluation

For images? how does it works ? To recognize numbers for instance we use the gre scale image of each pixel on the image :

For digits there are actually more classes than in the spam exercise, there are 10 different outputs (all digits) in comparison to the 2 of the spam exercise (spam or not spam).

Softmax : we normalize the output to be beween 0 and 1 so we can have a clear benchmark on the output probability. (Sigmoid function anyone ? 🙂 )

Now we have to also use polynomials or simply add neurons to our neural network when the separation in classification problems is not a simple line, the separation becomes something more adaptable and more flexible.

In the tensor flow playground :

More neurons -> more input combinations (features), every neurons is a combination of input features.

You can also add new layers, this will help for more complex data sets.

A neural network is only as good as the data provided.

We can also like said before add polynomials instead of using more nodes. We will egineer some extra features, this works well also.

  • Neurons : one unit of combining inputs
  • Hidden layer : set of neurons that operate on the same set of inputs
  • Inputs : what you feed into a neuron
  • Features : transformations of inputs, such as x^2
  • Feature engineering : coming up with what transformations to include

As discussed in image models we use the pixels grey scale or color to do recognition.

In reality ML is :

  • Collect data
  • Organize data
  • Create model
  • Use machines to flesh out the model from data
  • Deploy fleshed out model

What makes a good dataset ?

  • The dataset should cover all cases.
  • You also need negative examples and near misses, like add flowers or sheeps or whatever looks like a cloud but is not a cloud in an algorithme whom’s purpose is cloud type recognition.
  • You need to explore the data you have and fix problems. You need to find outliers, normalize ? You want to find why the problems exist and then fix it, collect more data or w/e.

Calculating regression error :

Get the error (divergence between prediction and reality) , you square the error and you calculate the mean over all your cases.

MSE (mean squared error) =

For classification problems it is a bit different :

As the result and the classification are between 0 and 1 basically when the result expected is 0 you consider only the right side of the sum, and when the result expected is 1 you only consider left side of the sum.

Accuracy, Precision and Recall

Confusion Matrix :

Accuracy basically gives you the accuracy of your classification

But Accuracy is not really a good metric when the datasets are unbalanced :

So we can also use Precision and Recall :

Precision :

For the parking problem, the precision is 100% ! So maybe we have to look at something else..

Recall :

For the parking, 10 empty spaces and we only found 1 : the recall is 10% !

So we see that we have to improve the recall.

Here are some more definitions:

  • MSE : the loss measure for regression problems
  • Cross-entropy : the loss measure for classification problems
  • Accuracy : A more intuitive measure of skill for classifiers (when data is balanced)
  • Precision : Accuracy when classifier says ‘yes’ (Unbalanced and lot of positives in the dataset)
  • Recall : Accuracy when the truth is ‘yes’ (Unbalanced and few of positives in the dataset)

Creating Machine Learning Datasets

If we have Distrance traveled and Fare amount data for taxi rides, we can use RMSE (linear regression) to find a model.

Now we can also see another model, that has 0 mistakes in the test set but that will not generalize very well, it has high variance.

It means that you cannot really use it to predict new points. too much errors on a new data set.

This is the reason why we should take the original data set and split it :

In reality you need 3 data sets : training, cross validation and test data.

Disgression : we will use datalabs, it’s like Jupyter notebooks. It’s a python execution environment with the possibility to really explain and do it like an article. Easy collaboration as well.

Lab: Create Machine Learning Datasets

We will split some data in this lab into the training + validation + testing data sets.

In this lab we want to predict the estimate Taxi Fare based on various parameters.

We first go to AI platform -> notebooks and create a new instance and a new jupyter notebook

In the Jupyter notebook we open the terminal and we clone the git folder.

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Open the jupyter notebook :

And execute the cells one by one.

Module 02 : Tensorflow

  • Scalar : 1, 4,5 (number)
  • Vector : One dimensional array
  • Rank 2 Vector : Matrix
  • Rank 3 : Tensor
  • Rank 4 : Tensor

A tensor is a N dimensional array of data

TensorFlow is a library that lets you do numerical computation that uses directed graphs.

Nodes represent mathematical operations and edges represent arrays of data :

It is a way of writing code that works on different kinds of hardware. The core is implemented in C++ to provide this diversity. But we will use Core TensorFlow (Python).

You can use high level code with Python and then execute it on GPUs.

Python API for tensorflow :

#You build first in tensorflow : 
c=tf.add(a,b)

#you execute then :
session=tf.Session()
numpy_c=session.run(c,feed_dict=...)

a,b being tensors and add being an operation. Programming in tensorflow is two step : first build the graph then execute the graph. As such it is a bit different than Numpy.

Tensorflow has the ability to execute part of the graphs on different machines and add debug nodes, compile nodes etc.

Lab : Getting Started with TensorFlow

Open AI Platform, create a new instance of a Jupyter notebook and open it.

Then clone the repo within your AI Platform Notebook instance via the terminal in the Jupyter notebok.

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Then open the ipynb file and follow the Notebook. Here is the code for the whole lab :

import tensorflow as tf
import numpy as np

print(tf.__version__)

#Adding two tensors ¶
a = np.array([5, 3, 8])
b = np.array([3, -1, 2])
c = np.add(a, b)
print(c)

#The equivalent code in TensorFlow consists of two steps
#Step 1: Build the graph

a = tf.constant([5, 3, 8])
b = tf.constant([3, -1, 2])
c = tf.add(a, b)
print(c)
#this will print the fact that c is a tensor, that it will have 
#3 elements, and those will be int32

#Step 2: Run the graph

with tf.Session() as sess:
  result = sess.run(c)
  print(result)

#Using a feed_dict
#Same graph, but without hardcoding inputs at build stage

a = tf.placeholder(dtype=tf.int32, shape=(None,))  # batchsize x scalar
b = tf.placeholder(dtype=tf.int32, shape=(None,))
c = tf.add(a, b)
with tf.Session() as sess:
  result = sess.run(c, feed_dict={
      a: [3, 4, 5],
      b: [-1, 2, 3]
    })
  print(result)
  
#Heron's Formula in TensorFlow 

def compute_area(sides):
  # slice the input to get the sides
  a = sides[:,0]  # 5.0, 2.3
  b = sides[:,1]  # 3.0, 4.1
  c = sides[:,2]  # 7.1, 4.8
  
  # Heron's formula
  s = (a + b + c) * 0.5   # (a + b) is a short-cut to tf.add(a, b)
  areasq = s * (s - a) * (s - b) * (s - c) # (a * b) is a short-cut to tf.multiply(a, b), not tf.matmul(a, b)
  return tf.sqrt(areasq)

with tf.Session() as sess:
  # pass in two triangles
  area = compute_area(tf.constant([
      [5.0, 3.0, 7.1],
      [2.3, 4.1, 4.8]
    ]))
  result = sess.run(area)
  print(result)
  
  
#Placeholder and feed_dict
#More common is to define the input to a program as a 
#placeholder and then to feed in the inputs. The difference between 
#the code below and the code above is whether the "area" graph is 
#coded up with the input values or whether the "area" graph is coded 
#up with a placeholder through which inputs will be passed in at 
#run-time.

with tf.Session() as sess:
  sides = tf.placeholder(tf.float32, shape=(None, 3))  # batchsize number of triangles, 3 sides
  area = compute_area(sides)
  result = sess.run(area, feed_dict = {
      sides: [
        [5.0, 3.0, 7.1],
        [2.3, 4.1, 4.8]
      ]
    })
  print(result)

#tf.eager
#tf.eager allows you to avoid the build-then-run stages. However, 
#most production code will follow the lazy evaluation paradigm 
#because the lazy evaluation paradigm is what allows for
#multi-device support and distribution.
#One thing you could do is to develop using tf.eager and then 
#comment out the eager execution and add in the session management 
#code.

import tensorflow as tf
from tensorflow.contrib.eager.python import tfe

tfe.enable_eager_execution()

def compute_area(sides):
  # slice the input to get the sides
  a = sides[:,0]  # 5.0, 2.3
  b = sides[:,1]  # 3.0, 4.1
  c = sides[:,2]  # 7.1, 4.8
  
  # Heron's formula
  s = (a + b + c) * 0.5   # (a + b) is a short-cut to tf.add(a, b)
  areasq = s * (s - a) * (s - b) * (s - c) # (a * b) is a short-cut to tf.multiply(a, b), not tf.matmul(a, b)
  return tf.sqrt(areasq)

area = compute_area(tf.constant([
      [5.0, 3.0, 7.1],
      [2.3, 4.1, 4.8]
    ]))

print(area)

How to use TensorFlow for Machine learning ?

1: Setup machine learning model

  • Regression or classification
  • What is the label
  • What are the features

2 : Carry out ML steps

  • Train the model
  • Evaluate the model
  • Predict with the model

So what are the steps to define an Estimator API model :

  1. Set up feature column
  2. Create a model, passing in the feature columns
  3. Write input_fn (returns features, labels) Features is a dict
  4. Train the model
  5. USe trained model to predict

Lab Machine learning using tf.estimator

As usual, create notebook instance etc.

import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

print(tf.__version__)

#Read data created in the previous chapter.

# In CSV, label is the first column, after the features, followed by the key
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1]
LABEL = CSV_COLUMNS[0]

df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMNS)
df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMNS)

#Input function to read from Pandas Dataframe into tf.constant

def make_input_fn(df, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = df[LABEL],
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 1000,
    num_threads = 1
  )
  
#Create feature columns for estimator
def make_feature_cols():
  input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES]
  return input_columns
 
 
#Linear Regression with tf.estimator framework 
tf.logging.set_verbosity(tf.logging.INFO)

OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

model = tf.estimator.LinearRegressor(
      feature_columns = make_feature_cols(), model_dir = OUTDIR)

model.train(input_fn = make_input_fn(df_train, num_epochs = 10)) 

#Evaluate on the validation data

def print_rmse(model, name, df):
  metrics = model.evaluate(input_fn = make_input_fn(df, 1))
  print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss'])))
print_rmse(model, 'validation', df_valid)

import itertools
# Read saved model and use it for prediction
model = tf.estimator.LinearRegressor(
      feature_columns = make_feature_cols(), model_dir = OUTDIR)
preds_iter = model.predict(input_fn = make_input_fn(df_valid, 1))
print([pred['predictions'][0] for pred in list(itertools.islice(preds_iter, 5))])


#Deep Neural Network regression 

tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.DNNRegressor(hidden_units = [32, 8, 2],
      feature_columns = make_feature_cols(), model_dir = OUTDIR)
model.train(input_fn = make_input_fn(df_train, num_epochs = 100));
print_rmse(model, 'validation', df_valid)

We are getting quite bad results with this code. This is because we are not using tensorflow properly. And this is the subject of the next part of the course.

Gaining more Flexibility

We need to refactor our Estimator model :

  • Big Data : Read out of memory Data,
  • Feature engineering : Add new features easily
  • Model Archi9tectures : Evaluate as part of training

To read shaded files we create a TextLineDataset giving it a function to decode the CSV into features and labels.

Repeat the data and send it along in chunks.

Lab : Refactoring to add batching and feature creation

As usual, create notebook instance etc.

import tensorflow as tf
import numpy as np
import shutil
print(tf.__version__)

#Refactor the input
#Read data created in Lab1a, but this time make it more general and 
#performant. Instead of using Pandas, we will use TensorFlow's 
#Dataset API.

CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
LABEL_COLUMN = 'fare_amount'
DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]

def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
    def decode_csv(value_column):
      columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
      features = dict(zip(CSV_COLUMNS, columns))
      label = features.pop(LABEL_COLUMN)
      return features, label

    # Create list of files that match pattern
    file_list = tf.gfile.Glob(filename)

    # Create dataset from file list
    dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # indefinitely
        dataset = dataset.shuffle(buffer_size = 10 * batch_size)
    else:
        num_epochs = 1 # end-of-input after this

    dataset = dataset.repeat(num_epochs).batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()
  return _input_fn
    

def get_train():
  return read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN)

def get_valid():
  return read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL)

def get_test():
  return read_dataset('./taxi-test.csv', mode = tf.estimator.ModeKeys.EVAL)


#Refactor the way features are created
#For now, pass these through (same as previous lab). However, 
#refactoring this way will enable us to break the one-to-one 
#relationship between inputs and features.

INPUT_COLUMNS = [
    tf.feature_column.numeric_column('pickuplon'),
    tf.feature_column.numeric_column('pickuplat'),
    tf.feature_column.numeric_column('dropofflat'),
    tf.feature_column.numeric_column('dropofflon'),
    tf.feature_column.numeric_column('passengers'),
]

def add_more_features(feats):
  # Nothing to add (yet!)
  return feats

feature_cols = add_more_features(INPUT_COLUMNS)

#Create and train the model
tf.logging.set_verbosity(tf.logging.INFO)
OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.LinearRegressor(
      feature_columns = feature_cols, model_dir = OUTDIR)
model.train(input_fn = get_train(), steps = 100);  # TODO: change the name of input_fn as needed

#Evaluate model 
def print_rmse(model, name, input_fn):
  metrics = model.evaluate(input_fn = input_fn, steps = 1)
  print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss'])))
print_rmse(model, 'validation', get_valid())

Train and Evaluate

Shuffle the data is important for distributed training

The TrainSpec consists of the things that used to be passed into the train() method

The EvalSpec controls the evaluation and the checkpointing of the model since they happen at the same time

You can also TensorFlow to display Info messages to get additional information on the current run. You can also use TensorBoard to monitor the training.

Lab : Distributed training and monitoring

Launch AI Platform Notebooks

Navigate to cloud shell and launch :

export IMAGE_FAMILY="tf-1-14-cpu"
export ZONE="us-west1-b"
export INSTANCE_NAME="tf-tensorboard-1"
export INSTANCE_TYPE="n1-standard-4"
gcloud compute instances create "${INSTANCE_NAME}" \
        --zone="${ZONE}" \
        --image-family="${IMAGE_FAMILY}" \
        --image-project=deeplearning-platform-release \
        --machine-type="${INSTANCE_TYPE}" \
        --boot-disk-size=200GB \
        --scopes=https://www.googleapis.com/auth/cloud-platform \
        --metadata="proxy-mode=project_editors"

Then open the jupyter notebook

import tensorflow as tf
import numpy as np
import shutil
print(tf.__version__)

#Read data created in Lab1a, but this time make it more general, 
#so that we are reading in batches. Instead of using Pandas, 
#we will use Datasets.

CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
LABEL_COLUMN = 'fare_amount'
DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]

def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
    def decode_csv(value_column):
      columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
      features = dict(zip(CSV_COLUMNS, columns))
      label = features.pop(LABEL_COLUMN)
      return features, label
    
    # Create list of files that match pattern
    file_list = tf.gfile.Glob(filename)

    # Create dataset from file list
    dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
    
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # indefinitely
        dataset = dataset.shuffle(buffer_size = 10 * batch_size)
    else:
        num_epochs = 1 # end-of-input after this
 
    dataset = dataset.repeat(num_epochs).batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()
  return _input_fn
  
#Create features out of input data

INPUT_COLUMNS = [
    tf.feature_column.numeric_column('pickuplon'),
    tf.feature_column.numeric_column('pickuplat'),
    tf.feature_column.numeric_column('dropofflat'),
    tf.feature_column.numeric_column('dropofflon'),
    tf.feature_column.numeric_column('passengers'),
]

def add_more_features(feats):
  # Nothing to add (yet!)
  return feats

feature_cols = add_more_features(INPUT_COLUMNS)


#train_and_evaluate
def serving_input_fn():
  feature_placeholders = {
    'pickuplon' : tf.placeholder(tf.float32, [None]),
    'pickuplat' : tf.placeholder(tf.float32, [None]),
    'dropofflat' : tf.placeholder(tf.float32, [None]),
    'dropofflon' : tf.placeholder(tf.float32, [None]),
    'passengers' : tf.placeholder(tf.float32, [None]),
  }
  features = {
      key: tf.expand_dims(tensor, -1)
      for key, tensor in feature_placeholders.items()
  }
  return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)
  
  
def train_and_evaluate(output_dir, num_train_steps):
  estimator = tf.estimator.LinearRegressor(
                       model_dir = output_dir,
                       feature_columns = feature_cols)
  train_spec=tf.estimator.TrainSpec(
                       input_fn = read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN),
                       max_steps = num_train_steps)
  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
  eval_spec=tf.estimator.EvalSpec(
                       input_fn = read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL),
                       steps = None,
                       start_delay_secs = 1, # start evaluating after N seconds
                       throttle_secs = 10,  # evaluate every N seconds
                       exporters = exporter)
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) 
  
  
# Run training    
OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
train_and_evaluate(OUTDIR, num_train_steps = 5000)

Open the TensorBoard and explore :

Module 3 : Scaling ML Models with Cloud ML Engine

OK model on LARGE amounts of data is better than a GREAT model on SMALL amounts of data

When we deal with huge amounts of data we will have to iterate through it and distribute the trainings among many machines, scaling out.

Cloud ML will scale with your needs

There are 3 steps in training your model with CLoud ML Engine :

  • Step1 : Use TensorFlow to create computation graph and training application
  • Step 2 : Package your trainer application
  • Step 3 : Configure and start a Cloud ML Engine job

Store your data online ! Cloud storage for instance can be used

Create task.py to parse command-line parameters and send along to train_and_evaluate.

Model.py contains the ML model in TensorFlow (Estimator API)

Package up TensorFlow model as Python package : Python modules need to conain an __init__.py in every folder.

Verify that the model works as a Python package

Then use the gcloud command to submit the training job, either locally or to cloud

We will need an extra input function that will map between the Json from the REST API and the features as expected by the Model, this is the serving_input_fn function

Once we have that we can use it to create a micro service through the shell for instance.

The model is ready to receive REST API requests. Create ENDPOINT URL, Send JSON to end-point.

Lab : Scaling up ML with Cloud ML Engine

  • Package up the code
  • Find absolute paths to data
  • Run the Python module from the console
  • Run Locally using gcloud
  • Submit training job using gcloud
  • Deploy model
  • Prediction
  • Train on a larger dataset
  • 1-million row dataset

Create Bucket -> Create Notebook -> clone repo

#Environment variables for project and bucket
import os
PROJECT = 'qwiklabs-gcp-03-8da6a5ea6e2a' # REPLACE WITH YOUR PROJECT ID
REGION = 'us-central1' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions.
BUCKET = 'qwiklabs-gcp-03-8da6a5ea6e2a' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected.


# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.4'  # Tensorflow version

%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

%%bash
PROJECT_ID=$PROJECT
AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin); \
    print(response['serviceAccount'])")

echo "Authorizing the Cloud ML Service account $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET
gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET  # error message (if bucket is empty) can be ignored
gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET

#Packaging up the code
!find taxifare
!cat taxifare/trainer/model.py

%%bash
echo $PWD
rm -rf $PWD/taxi_trained
cp $PWD/../tensorflow/taxi-train.csv .
cp $PWD/../tensorflow/taxi-valid.csv .
head -1 $PWD/taxi-train.csv
head -1 $PWD/taxi-valid.csv


#Running the Python module from the command-line
%%bash
rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare
python -m trainer.task \
   --train_data_paths="${PWD}/taxi-train*" \
   --eval_data_paths=${PWD}/taxi-valid.csv  \
   --output_dir=${PWD}/taxi_trained \
   --train_steps=1000 --job-dir=./tmp
   
%%bash
ls $PWD/taxi_trained/export/exporter/   
   
%%writefile ./test.json
{"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2}

#Running locally using gcloud
%%bash
rm -rf taxifare.tar.gz taxi_trained
gcloud ai-platform local train \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   -- \
   --train_data_paths=${PWD}/taxi-train.csv \
   --eval_data_paths=${PWD}/taxi-valid.csv  \
   --train_steps=1000 \
   --output_dir=${PWD}/taxi_trained 

!ls $PWD/taxi_trained


#Submit training job using gcloud
%%bash
echo $BUCKET
gsutil -m rm -rf gs://${BUCKET}/taxifare/smallinput/
gsutil -m cp ${PWD}/*.csv gs://${BUCKET}/taxifare/smallinput/

%%bash
OUTDIR=gs://${BUCKET}/taxifare/smallinput/taxi_trained
JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-train*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=10000

#Deploy model
%%bash
gsutil ls gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/exporter

%%bash
MODEL_NAME="taxifare"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/exporter | tail -1)
echo "Run these commands one-by-one (the very first time, you'll create a model and then create a version)"
#gcloud ai-platform versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ai-platform models delete ${MODEL_NAME}
gcloud ai-platform models create ${MODEL_NAME} --regions $REGION
gcloud ai-platform versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version $TFVERSION

Prediction
%%bash
gcloud ai-platform predict --model=taxifare --version=v1 --json-instances=./test.json

from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

request_data = {'instances':
  [
      {
        'pickuplon': -73.885262,
        'pickuplat': 40.773008,
        'dropofflon': -73.987232,
        'dropofflat': 40.732403,
        'passengers': 2,
      }
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'taxifare', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print("response={0}".format(response))

#Train on larger datase
#IN GOOGLE BIG QUERY
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  'nokeyindata' AS key
FROM
  [nyc-tlc:yellow.trips]
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  AND ABS(HASH(pickup_datetime)) % 1000 == 1
  
#Run Cloud training on 1-million row dataset
%%bash

#XXXXX  this takes 60 minutes. if you are sure you want to run it, then remove this line.

OUTDIR=gs://${BUCKET}/taxifare/ch3/taxi_trained
JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S)
CRS_BUCKET=cloud-training-demos # use the already exported data
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=STANDARD_1 \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${CRS_BUCKET}/taxifare/ch3/train.csv" \
   --eval_data_paths="gs://${CRS_BUCKET}/taxifare/ch3/valid.csv"  \
   --output_dir=$OUTDIR \
   --train_steps=100000
  





Lab is difficult..

Kubeflow Pipelines

Kubeflow is an open project that packages machine learning code for Kubernetes. Kubeflow PIpelines is a platform for composing, deploying, and managing end-to-end machine learning workflows.

Kubeflow Pipelines reusability helps separate the work which enables people to specialize. In this example a Machine Learning Engineer can focus on Feature Engineering and Linear Regression Modeling and Hyperparameter Tuning. Their solutions are bundled up into the Kubeflow Pipeline which can be used by a Data Engineer as part of a data engineering solution. The solution appears as a service that is used by a Data Analyst to derive business insights.

Module 4 : Feature Engineering

How to get better features to get better results ?

What makes a good feature :

  • Should be related to the objective
  • Should be known at production-time
  • Has to be numeric with meaningful magnitude (for words you need to use for instance word to vec)
  • Has enough examples
  • Brings human insight to problem

Categorical features

if you want to use the employee ID, you can instead use a vector with 0 and 1 with a 1 on the employee ID of your example. if you have 5 employees in your store, you need then 5 columns. -> Sparse column in TensorFlow.

What if you don’t know the keys ? (employee ID). You have to create a vocabulary of keys to identify all the different keys, you need to do that before you train your model. Mapping needs to be identical at prediction time.

What happens if you hire a new employee ? Plan ahead and determine what to do if you have a new employee. Cold start.

Customer rating can be used as continuous or as one-hot encoded value.

You also need to make sure you know what to do if you have missing data, for example if you received a rating from the client ! Don’t mix magic numbers with data, you can for intance add a column with 0 or 1 depending on the presence of a rating or not.

Features crossing

Is car a taxi ? Car color + city not enough (yellow cars in NY are usually a taxi but not in Rome). You can add a new column CarXCity to provide an human insight so you can avoid giving too much weight to New York in the taxi recognition, but highly appropriate if the car is yellow and in New york.

Bucketizing

If you have high difference in outcome on very little changes (like price of houses depending on latitude, where a very little change can modify the neighboorhood or a city and have huge impact) then it is better to group the feature into buckets.

Model Architecture

two types of featrures : Dense and Sparse

  • Price is easy, one column -> Dense
  • If you have lot of employees you need 25 columns -> Sparse

Deep neural network : good for dense, highly correlated, like for pixels in an image

You rather use Linear for sparse, independent features

You can use both at the same time in case you need it with a Wide-and-deep network does. tf.estimator.DNNLinearCombinedClassifier

Lab Feature Engineering

#Feature Engineering
%%bash
sudo pip install httplib2==0.12.0 apache-beam[gcp]==2.16.0

import tensorflow as tf
import apache_beam as beam
import shutil
print(tf.__version__)

#Environment variables for project and bucket 
import os
PROJECT = 'cloud-training-demos'    # CHANGE THIS
BUCKET = 'cloud-training-demos' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected.
REGION = 'us-central1' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions.

# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8' 

## ensure we're using python3 env
os.environ['CLOUDSDK_PYTHON'] = 'python3'

%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

## ensure we predict locally with our current Python environment
gcloud config set ml_engine/local_python `which python`

#Specifying query to pull the data
def create_query(phase, EVERY_N):
  if EVERY_N == None:
    EVERY_N = 4 #use full dataset
    
  #select and pre-process fields
  base_query = """
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  DAYOFWEEK(pickup_datetime) AS dayofweek,
  HOUR(pickup_datetime) AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  CONCAT(STRING(pickup_datetime), STRING(pickup_longitude), STRING(pickup_latitude), STRING(dropoff_latitude), STRING(dropoff_longitude)) AS key
FROM
  [nyc-tlc:yellow.trips]
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  """
  
  #add subsampling criteria by modding with hashkey
  if phase == 'train': 
    query = "{} AND ABS(HASH(pickup_datetime)) % {} < 2".format(base_query,EVERY_N)
  elif phase == 'valid': 
    query = "{} AND ABS(HASH(pickup_datetime)) % {} == 2".format(base_query,EVERY_N)
  elif phase == 'test':
    query = "{} AND ABS(HASH(pickup_datetime)) % {} == 3".format(base_query,EVERY_N)
  return query
    
print(create_query('valid', 100)) #example query using 1% of data

#Preprocessing Dataflow job from BigQuery
%%bash
if gsutil ls | grep -q gs://${BUCKET}/taxifare/ch4/taxi_preproc/; then
  gsutil -m rm -rf gs://$BUCKET/taxifare/ch4/taxi_preproc/
fi

import datetime

####
# Arguments:
#   -rowdict: Dictionary. The beam bigquery reader returns a PCollection in
#     which each row is represented as a python dictionary
# Returns:
#   -rowstring: a comma separated string representation of the record with dayofweek
#     converted from int to string (e.g. 3 --> Tue)
####
def to_csv(rowdict):
  days = ['null', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
  CSV_COLUMNS = 'fare_amount,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers,key'.split(',')
  rowdict['dayofweek'] = days[rowdict['dayofweek']]
  rowstring = ','.join([str(rowdict[k]) for k in CSV_COLUMNS])
  return rowstring


####
# Arguments:
#   -EVERY_N: Integer. Sample one out of every N rows from the full dataset.
#     Larger values will yield smaller sample
#   -RUNNER: 'DirectRunner' or 'DataflowRunner'. Specfy to run the pipeline
#     locally or on Google Cloud respectively. 
# Side-effects:
#   -Creates and executes dataflow pipeline. 
#     See https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline
####
def preprocess(EVERY_N, RUNNER):
  job_name = 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')
  print('Launching Dataflow job {} ... hang on'.format(job_name))
  OUTPUT_DIR = 'gs://{0}/taxifare/ch4/taxi_preproc/'.format(BUCKET)

  #dictionary of pipeline options
  options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'),
    'project': PROJECT,
    'runner': RUNNER,
    'num_workers' : 4,
    'max_num_workers' : 5
  }
  #instantiate PipelineOptions object using options dictionary
  opts = beam.pipeline.PipelineOptions(flags=[], **options)
  #instantantiate Pipeline object using PipelineOptions
  with beam.Pipeline(options=opts) as p:
      for phase in ['train', 'valid']:
        query = create_query(phase, EVERY_N) 
        outfile = os.path.join(OUTPUT_DIR, '{}.csv'.format(phase))
        (
          p | 'read_{}'.format(phase) >> beam.io.Read(beam.io.BigQuerySource(query=query))
            | 'tocsv_{}'.format(phase) >> beam.Map(to_csv)
            | 'write_{}'.format(phase) >> beam.io.Write(beam.io.WriteToText(outfile))
        )
  print("Done")
  
  
preprocess(50*10000, 'DirectRunner') 

%%bash
gsutil ls gs://$BUCKET/taxifare/ch4/taxi_preproc/

#Run Beam pipeline on Cloud Dataflow
%%bash
if gsutil ls | grep -q gs://${BUCKET}/taxifare/ch4/taxi_preproc/; then
  gsutil -m rm -rf gs://$BUCKET/taxifare/ch4/taxi_preproc/
fi

preprocess(50*100, 'DataflowRunner') 
#change first arg to None to preprocess full dataset

%%bash
gsutil ls -l gs://$BUCKET/taxifare/ch4/taxi_preproc/

%%bash
#print first 10 lines of first shard of train.csv
gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/train.csv-00000-of-*" | head


#Develop model with new inputs
#Download the first shard of the preprocessed data to enable local development.


%%bash
if [ -d sample ]; then
  rm -rf sample
fi
mkdir sample
gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/train.csv-00000-of-*" > sample/train.csv
gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/valid.csv-00000-of-*" > sample/valid.csv

#We have two new inputs in the INPUT_COLUMNS, three engineered features, and the estimator involves bucketization and feature crosses.
%%bash
grep -A 20 "INPUT_COLUMNS =" taxifare/trainer/model.py

%%bash
grep -A 50 "build_estimator" taxifare/trainer/model.py

%%bash
grep -A 15 "add_engineered(" taxifare/trainer/model.py

%%bash
rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare
python -m trainer.task \
  --train_data_paths=${PWD}/sample/train.csv \
  --eval_data_paths=${PWD}/sample/valid.csv  \
  --output_dir=${PWD}/taxi_trained \
  --train_steps=10 \
  --job-dir=/tmp

%%bash
ls taxi_trained/export/exporter/

%%bash
model_dir=$(ls ${PWD}/taxi_trained/export/exporter | tail -1)
saved_model_cli show --dir ${PWD}/taxi_trained/export/exporter/${model_dir} --all

%%writefile /tmp/test.json
{"dayofweek": "Sun", "hourofday": 17, "pickuplon": -73.885262, "pickuplat": 40.773008, "dropofflon": -73.987232, "dropofflat": 40.732403, "passengers": 2}

%%bash
model_dir=$(ls ${PWD}/taxi_trained/export/exporter)
gcloud ml-engine local predict \
  --model-dir=${PWD}/taxi_trained/export/exporter/${model_dir} \
  --json-instances=/tmp/test.json
  

#Train on cloud
%%bash
OUTDIR=gs://${BUCKET}/taxifare/ch4/taxi_trained
JOBNAME=lab4a_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=${PWD}/taxifare/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=BASIC \
  --runtime-version=$TFVERSION \
  -- \
  --train_data_paths="gs://$BUCKET/taxifare/ch4/taxi_preproc/train*" \
  --eval_data_paths="gs://${BUCKET}/taxifare/ch4/taxi_preproc/valid*"  \
  --train_steps=5000 \
  --output_dir=$OUTDIR
  
  


%%bash
gsutil ls gs://${BUCKET}/taxifare/ch4/taxi_trained/export/exporter | tail -1

%%bash
model_dir=$(gsutil ls gs://${BUCKET}/taxifare/ch4/taxi_trained/export/exporter | tail -1)
saved_model_cli show --dir ${model_dir} --all

%%bash
model_dir=$(gsutil ls gs://${BUCKET}/taxifare/ch4/taxi_trained/export/exporter | tail -1)
gcloud ml-engine local predict \
  --model-dir=${model_dir} \
  --json-instances=/tmp/test.json
  
#Optional: deploy model to cloud
%%bash
MODEL_NAME="feateng"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/taxifare/ch4/taxi_trained/export/exporter | tail -1)
echo "Run these commands one-by-one (the very first time, you'll create a model and then create a version)"
#gcloud ai-platform versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ai-platform delete ${MODEL_NAME}
gcloud ai-platform models create ${MODEL_NAME} --regions $REGION
gcloud ai-platform versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version $TFVERSION


%%bash
gcloud ai-platform predict --model=feateng --version=v1 --json-instances=/tmp/test.json







What we did : added 2 engineered features and added the day of the week and run the model on the cloud. we got a result which was a little bit better

Hyperparameter Tuning

We basically can play with parameters in a YAML file that will pilot the learning rate, the number of nodes and layers, the batchsize, to find the optimal one.

Accuracy improves by features tuning, hyperparameter tuning and using larget sets of data.

Going forward

Cloud Speech-to-Text converts audio to text for data processing. Cloud Natural Language API recognizes parts of speech called entities and sentiment. Cloud Translation converts text in one language to another. Dialogflow Enterprise Edition is used to build chatbots to conduct conversations. Cloud Text-to-Speech converts text into high quality voice audio. Cloud Vision API is for working with and recognizing content in still images. And Cloud Video Intelligence API is for recognizing motion and action in video.

You can also use BigQuery directly :