Module 01 : What is Machine Learning (ML)?
Machine learning is a way to use standard algorithms to derive predictive insights from data and make repeated decisions.
- Business intelligence is understand the past
- Machine learning use past data to predict the future
Stage 1 : Train an Model model with examples.
Example : label + input (input + what it is, ‘cat’, ‘dog’ etc in case of classification problem with pictures.
Stage 2 : Predict with a trained model
-> present a new images not labelled and see if the model can give you the label. Once you have the model you can use it on any picture.
- Label : true answer
- Input : Predictor variables, what you can use to predict the label
- Example : Input + corresponding label
- Model : Math function that takes input variables and creates approximation to label
- Training : Adjusting model to minimize error
- Predictions : Using model on unlabeled data
Two of the most common classes of machine learning models are supervised and unsupervised ML models. The key difference is that with supervised models, we have labels, or in other words, the correct answers to whatever it is that we want to learn to predict. In unsupervised learning, the data does not have labels.
Within supervised ML there are two types of problems: regression and classification. To explain them, let’s dive a little deeper into this data.
This course will be about supervised learning
Tons of models can be imagined. You can predict the cost of reparation on a car at an auction based on the state of the car. You can predict the gestation weeks for a mother based on various parameters etc.. This would be regression as you are trying to predict a continuous number. The model is fed with information collected about the mothers and gestation weeks as training sample and once trained, then predict it for a new case.
The simplest model is a simple linear regression :
A classification model can be used to detect whether email is spam or not.
Everything that goes to a ML model has to be numeric. Even for images, we can take each pixel and check the RDB. An image is a Tensor (N-dimensional array of pixel values).
For words we have to also vectorize it to some kind of numeric vector. Word to vectors -> word to voc model and just use it to convert your text input to vectors.
Machine learning used in lots of industries :
Playing with ML
On the training dataset the separation line between spam and non spam email to minimize the error rate on the training data set.
You draw a line between the different class identified in the classification problem. The equation for a line is quite simple :
Now how do we find good values for w1, w2 and b ? We use Gradient descent.
Gradient descent is iterative. And we go from there to minimize the error. You can speed up the process or slow it with the learning rate : a hyper parameter
This perfect learning rate (or alpha like we call it usually :)) varies from problem to problem.
Here is the process for ML :
New terms :
- Weights : parameters we optimize
- Batch size : the amount of data we compute error on
- Epoch : One pass through entire dataset
- Gradient descent : process of reducing error
- Evaluation : Is the model good enough ? Has to be done on full dataset
- Training : Process of optimizing the weights; includes gradient descent + evaluation
For images? how does it works ? To recognize numbers for instance we use the gre scale image of each pixel on the image :
For digits there are actually more classes than in the spam exercise, there are 10 different outputs (all digits) in comparison to the 2 of the spam exercise (spam or not spam).
Softmax : we normalize the output to be beween 0 and 1 so we can have a clear benchmark on the output probability. (Sigmoid function anyone ? 🙂 )
Now we have to also use polynomials or simply add neurons to our neural network when the separation in classification problems is not a simple line, the separation becomes something more adaptable and more flexible.
In the tensor flow playground :
More neurons -> more input combinations (features), every neurons is a combination of input features.
You can also add new layers, this will help for more complex data sets.
A neural network is only as good as the data provided.
We can also like said before add polynomials instead of using more nodes. We will egineer some extra features, this works well also.
- Neurons : one unit of combining inputs
- Hidden layer : set of neurons that operate on the same set of inputs
- Inputs : what you feed into a neuron
- Features : transformations of inputs, such as x^2
- Feature engineering : coming up with what transformations to include
As discussed in image models we use the pixels grey scale or color to do recognition.
In reality ML is :
- Collect data
- Organize data
- Create model
- Use machines to flesh out the model from data
- Deploy fleshed out model
What makes a good dataset ?
- The dataset should cover all cases.
- You also need negative examples and near misses, like add flowers or sheeps or whatever looks like a cloud but is not a cloud in an algorithme whom’s purpose is cloud type recognition.
- You need to explore the data you have and fix problems. You need to find outliers, normalize ? You want to find why the problems exist and then fix it, collect more data or w/e.
Calculating regression error :
Get the error (divergence between prediction and reality) , you square the error and you calculate the mean over all your cases.
MSE (mean squared error) =
For classification problems it is a bit different :
As the result and the classification are between 0 and 1 basically when the result expected is 0 you consider only the right side of the sum, and when the result expected is 1 you only consider left side of the sum.
Accuracy, Precision and Recall
Confusion Matrix :
Accuracy basically gives you the accuracy of your classification
But Accuracy is not really a good metric when the datasets are unbalanced :
So we can also use Precision and Recall :
Precision :
For the parking problem, the precision is 100% ! So maybe we have to look at something else..
Recall :
For the parking, 10 empty spaces and we only found 1 : the recall is 10% !
So we see that we have to improve the recall.
Here are some more definitions:
- MSE : the loss measure for regression problems
- Cross-entropy : the loss measure for classification problems
- Accuracy : A more intuitive measure of skill for classifiers (when data is balanced)
- Precision : Accuracy when classifier says ‘yes’ (Unbalanced and lot of positives in the dataset)
- Recall : Accuracy when the truth is ‘yes’ (Unbalanced and few of positives in the dataset)
Creating Machine Learning Datasets
If we have Distrance traveled and Fare amount data for taxi rides, we can use RMSE (linear regression) to find a model.
Now we can also see another model, that has 0 mistakes in the test set but that will not generalize very well, it has high variance.
It means that you cannot really use it to predict new points. too much errors on a new data set.
This is the reason why we should take the original data set and split it :
In reality you need 3 data sets : training, cross validation and test data.
Disgression : we will use datalabs, it’s like Jupyter notebooks. It’s a python execution environment with the possibility to really explain and do it like an article. Easy collaboration as well.
Lab: Create Machine Learning Datasets
We will split some data in this lab into the training + validation + testing data sets.
In this lab we want to predict the estimate Taxi Fare based on various parameters.
We first go to AI platform -> notebooks and create a new instance and a new jupyter notebook
In the Jupyter notebook we open the terminal and we clone the git folder.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
Open the jupyter notebook :
And execute the cells one by one.
Module 02 : Tensorflow
- Scalar : 1, 4,5 (number)
- Vector : One dimensional array
- Rank 2 Vector : Matrix
- Rank 3 : Tensor
- Rank 4 : Tensor
A tensor is a N dimensional array of data
TensorFlow is a library that lets you do numerical computation that uses directed graphs.
Nodes represent mathematical operations and edges represent arrays of data :
It is a way of writing code that works on different kinds of hardware. The core is implemented in C++ to provide this diversity. But we will use Core TensorFlow (Python).
You can use high level code with Python and then execute it on GPUs.
Python API for tensorflow :
#You build first in tensorflow : c=tf.add(a,b) #you execute then : session=tf.Session() numpy_c=session.run(c,feed_dict=...)
a,b being tensors and add being an operation. Programming in tensorflow is two step : first build the graph then execute the graph. As such it is a bit different than Numpy.
Tensorflow has the ability to execute part of the graphs on different machines and add debug nodes, compile nodes etc.
Lab : Getting Started with TensorFlow
Open AI Platform, create a new instance of a Jupyter notebook and open it.
Then clone the repo within your AI Platform Notebook instance via the terminal in the Jupyter notebok.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
Then open the ipynb file and follow the Notebook. Here is the code for the whole lab :
import tensorflow as tf import numpy as np print(tf.__version__) #Adding two tensors ¶ a = np.array([5, 3, 8]) b = np.array([3, -1, 2]) c = np.add(a, b) print(c) #The equivalent code in TensorFlow consists of two steps #Step 1: Build the graph a = tf.constant([5, 3, 8]) b = tf.constant([3, -1, 2]) c = tf.add(a, b) print(c) #this will print the fact that c is a tensor, that it will have #3 elements, and those will be int32 #Step 2: Run the graph with tf.Session() as sess: result = sess.run(c) print(result) #Using a feed_dict #Same graph, but without hardcoding inputs at build stage a = tf.placeholder(dtype=tf.int32, shape=(None,)) # batchsize x scalar b = tf.placeholder(dtype=tf.int32, shape=(None,)) c = tf.add(a, b) with tf.Session() as sess: result = sess.run(c, feed_dict={ a: [3, 4, 5], b: [-1, 2, 3] }) print(result) #Heron's Formula in TensorFlow def compute_area(sides): # slice the input to get the sides a = sides[:,0] # 5.0, 2.3 b = sides[:,1] # 3.0, 4.1 c = sides[:,2] # 7.1, 4.8 # Heron's formula s = (a + b + c) * 0.5 # (a + b) is a short-cut to tf.add(a, b) areasq = s * (s - a) * (s - b) * (s - c) # (a * b) is a short-cut to tf.multiply(a, b), not tf.matmul(a, b) return tf.sqrt(areasq) with tf.Session() as sess: # pass in two triangles area = compute_area(tf.constant([ [5.0, 3.0, 7.1], [2.3, 4.1, 4.8] ])) result = sess.run(area) print(result) #Placeholder and feed_dict #More common is to define the input to a program as a #placeholder and then to feed in the inputs. The difference between #the code below and the code above is whether the "area" graph is #coded up with the input values or whether the "area" graph is coded #up with a placeholder through which inputs will be passed in at #run-time. with tf.Session() as sess: sides = tf.placeholder(tf.float32, shape=(None, 3)) # batchsize number of triangles, 3 sides area = compute_area(sides) result = sess.run(area, feed_dict = { sides: [ [5.0, 3.0, 7.1], [2.3, 4.1, 4.8] ] }) print(result) #tf.eager #tf.eager allows you to avoid the build-then-run stages. However, #most production code will follow the lazy evaluation paradigm #because the lazy evaluation paradigm is what allows for #multi-device support and distribution. #One thing you could do is to develop using tf.eager and then #comment out the eager execution and add in the session management #code. import tensorflow as tf from tensorflow.contrib.eager.python import tfe tfe.enable_eager_execution() def compute_area(sides): # slice the input to get the sides a = sides[:,0] # 5.0, 2.3 b = sides[:,1] # 3.0, 4.1 c = sides[:,2] # 7.1, 4.8 # Heron's formula s = (a + b + c) * 0.5 # (a + b) is a short-cut to tf.add(a, b) areasq = s * (s - a) * (s - b) * (s - c) # (a * b) is a short-cut to tf.multiply(a, b), not tf.matmul(a, b) return tf.sqrt(areasq) area = compute_area(tf.constant([ [5.0, 3.0, 7.1], [2.3, 4.1, 4.8] ])) print(area)
How to use TensorFlow for Machine learning ?
1: Setup machine learning model
- Regression or classification
- What is the label
- What are the features
2 : Carry out ML steps
- Train the model
- Evaluate the model
- Predict with the model
So what are the steps to define an Estimator API model :
- Set up feature column
- Create a model, passing in the feature columns
- Write input_fn (returns features, labels) Features is a dict
- Train the model
- USe trained model to predict
Lab Machine learning using tf.estimator
As usual, create notebook instance etc.
import tensorflow as tf import pandas as pd import numpy as np import shutil print(tf.__version__) #Read data created in the previous chapter. # In CSV, label is the first column, after the features, followed by the key CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key'] FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1] LABEL = CSV_COLUMNS[0] df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMNS) df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMNS) #Input function to read from Pandas Dataframe into tf.constant def make_input_fn(df, num_epochs): return tf.estimator.inputs.pandas_input_fn( x = df, y = df[LABEL], batch_size = 128, num_epochs = num_epochs, shuffle = True, queue_capacity = 1000, num_threads = 1 ) #Create feature columns for estimator def make_feature_cols(): input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES] return input_columns #Linear Regression with tf.estimator framework tf.logging.set_verbosity(tf.logging.INFO) OUTDIR = 'taxi_trained' shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time model = tf.estimator.LinearRegressor( feature_columns = make_feature_cols(), model_dir = OUTDIR) model.train(input_fn = make_input_fn(df_train, num_epochs = 10)) #Evaluate on the validation data def print_rmse(model, name, df): metrics = model.evaluate(input_fn = make_input_fn(df, 1)) print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss']))) print_rmse(model, 'validation', df_valid) import itertools # Read saved model and use it for prediction model = tf.estimator.LinearRegressor( feature_columns = make_feature_cols(), model_dir = OUTDIR) preds_iter = model.predict(input_fn = make_input_fn(df_valid, 1)) print([pred['predictions'][0] for pred in list(itertools.islice(preds_iter, 5))]) #Deep Neural Network regression tf.logging.set_verbosity(tf.logging.INFO) shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time model = tf.estimator.DNNRegressor(hidden_units = [32, 8, 2], feature_columns = make_feature_cols(), model_dir = OUTDIR) model.train(input_fn = make_input_fn(df_train, num_epochs = 100)); print_rmse(model, 'validation', df_valid)
We are getting quite bad results with this code. This is because we are not using tensorflow properly. And this is the subject of the next part of the course.
Gaining more Flexibility
We need to refactor our Estimator model :
- Big Data : Read out of memory Data,
- Feature engineering : Add new features easily
- Model Archi9tectures : Evaluate as part of training
To read shaded files we create a TextLineDataset giving it a function to decode the CSV into features and labels.
Repeat the data and send it along in chunks.
Lab : Refactoring to add batching and feature creation
As usual, create notebook instance etc.
import tensorflow as tf import numpy as np import shutil print(tf.__version__) #Refactor the input #Read data created in Lab1a, but this time make it more general and #performant. Instead of using Pandas, we will use TensorFlow's #Dataset API. CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key'] LABEL_COLUMN = 'fare_amount' DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']] def read_dataset(filename, mode, batch_size = 512): def _input_fn(): def decode_csv(value_column): columns = tf.decode_csv(value_column, record_defaults = DEFAULTS) features = dict(zip(CSV_COLUMNS, columns)) label = features.pop(LABEL_COLUMN) return features, label # Create list of files that match pattern file_list = tf.gfile.Glob(filename) # Create dataset from file list dataset = tf.data.TextLineDataset(file_list).map(decode_csv) if mode == tf.estimator.ModeKeys.TRAIN: num_epochs = None # indefinitely dataset = dataset.shuffle(buffer_size = 10 * batch_size) else: num_epochs = 1 # end-of-input after this dataset = dataset.repeat(num_epochs).batch(batch_size) return dataset.make_one_shot_iterator().get_next() return _input_fn def get_train(): return read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN) def get_valid(): return read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL) def get_test(): return read_dataset('./taxi-test.csv', mode = tf.estimator.ModeKeys.EVAL) #Refactor the way features are created #For now, pass these through (same as previous lab). However, #refactoring this way will enable us to break the one-to-one #relationship between inputs and features. INPUT_COLUMNS = [ tf.feature_column.numeric_column('pickuplon'), tf.feature_column.numeric_column('pickuplat'), tf.feature_column.numeric_column('dropofflat'), tf.feature_column.numeric_column('dropofflon'), tf.feature_column.numeric_column('passengers'), ] def add_more_features(feats): # Nothing to add (yet!) return feats feature_cols = add_more_features(INPUT_COLUMNS) #Create and train the model tf.logging.set_verbosity(tf.logging.INFO) OUTDIR = 'taxi_trained' shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time model = tf.estimator.LinearRegressor( feature_columns = feature_cols, model_dir = OUTDIR) model.train(input_fn = get_train(), steps = 100); # TODO: change the name of input_fn as needed #Evaluate model def print_rmse(model, name, input_fn): metrics = model.evaluate(input_fn = input_fn, steps = 1) print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss']))) print_rmse(model, 'validation', get_valid())
Train and Evaluate
Shuffle the data is important for distributed training
The TrainSpec consists of the things that used to be passed into the train() method
The EvalSpec controls the evaluation and the checkpointing of the model since they happen at the same time
You can also TensorFlow to display Info messages to get additional information on the current run. You can also use TensorBoard to monitor the training.
Lab : Distributed training and monitoring
Launch AI Platform Notebooks
Navigate to cloud shell and launch :
export IMAGE_FAMILY="tf-1-14-cpu" export ZONE="us-west1-b" export INSTANCE_NAME="tf-tensorboard-1" export INSTANCE_TYPE="n1-standard-4" gcloud compute instances create "${INSTANCE_NAME}" \ --zone="${ZONE}" \ --image-family="${IMAGE_FAMILY}" \ --image-project=deeplearning-platform-release \ --machine-type="${INSTANCE_TYPE}" \ --boot-disk-size=200GB \ --scopes=https://www.googleapis.com/auth/cloud-platform \ --metadata="proxy-mode=project_editors"
Then open the jupyter notebook
import tensorflow as tf import numpy as np import shutil print(tf.__version__) #Read data created in Lab1a, but this time make it more general, #so that we are reading in batches. Instead of using Pandas, #we will use Datasets. CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key'] LABEL_COLUMN = 'fare_amount' DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']] def read_dataset(filename, mode, batch_size = 512): def _input_fn(): def decode_csv(value_column): columns = tf.decode_csv(value_column, record_defaults = DEFAULTS) features = dict(zip(CSV_COLUMNS, columns)) label = features.pop(LABEL_COLUMN) return features, label # Create list of files that match pattern file_list = tf.gfile.Glob(filename) # Create dataset from file list dataset = tf.data.TextLineDataset(file_list).map(decode_csv) if mode == tf.estimator.ModeKeys.TRAIN: num_epochs = None # indefinitely dataset = dataset.shuffle(buffer_size = 10 * batch_size) else: num_epochs = 1 # end-of-input after this dataset = dataset.repeat(num_epochs).batch(batch_size) return dataset.make_one_shot_iterator().get_next() return _input_fn #Create features out of input data INPUT_COLUMNS = [ tf.feature_column.numeric_column('pickuplon'), tf.feature_column.numeric_column('pickuplat'), tf.feature_column.numeric_column('dropofflat'), tf.feature_column.numeric_column('dropofflon'), tf.feature_column.numeric_column('passengers'), ] def add_more_features(feats): # Nothing to add (yet!) return feats feature_cols = add_more_features(INPUT_COLUMNS) #train_and_evaluate def serving_input_fn(): feature_placeholders = { 'pickuplon' : tf.placeholder(tf.float32, [None]), 'pickuplat' : tf.placeholder(tf.float32, [None]), 'dropofflat' : tf.placeholder(tf.float32, [None]), 'dropofflon' : tf.placeholder(tf.float32, [None]), 'passengers' : tf.placeholder(tf.float32, [None]), } features = { key: tf.expand_dims(tensor, -1) for key, tensor in feature_placeholders.items() } return tf.estimator.export.ServingInputReceiver(features, feature_placeholders) def train_and_evaluate(output_dir, num_train_steps): estimator = tf.estimator.LinearRegressor( model_dir = output_dir, feature_columns = feature_cols) train_spec=tf.estimator.TrainSpec( input_fn = read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN), max_steps = num_train_steps) exporter = tf.estimator.LatestExporter('exporter', serving_input_fn) eval_spec=tf.estimator.EvalSpec( input_fn = read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL), steps = None, start_delay_secs = 1, # start evaluating after N seconds throttle_secs = 10, # evaluate every N seconds exporters = exporter) tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) # Run training OUTDIR = 'taxi_trained' shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time train_and_evaluate(OUTDIR, num_train_steps = 5000)
Open the TensorBoard and explore :
Module 3 : Scaling ML Models with Cloud ML Engine
OK model on LARGE amounts of data is better than a GREAT model on SMALL amounts of data
When we deal with huge amounts of data we will have to iterate through it and distribute the trainings among many machines, scaling out.
Cloud ML will scale with your needs
There are 3 steps in training your model with CLoud ML Engine :
- Step1 : Use TensorFlow to create computation graph and training application
- Step 2 : Package your trainer application
- Step 3 : Configure and start a Cloud ML Engine job
Store your data online ! Cloud storage for instance can be used
Create task.py to parse command-line parameters and send along to train_and_evaluate.
Model.py contains the ML model in TensorFlow (Estimator API)
Package up TensorFlow model as Python package : Python modules need to conain an __init__.py in every folder.
Verify that the model works as a Python package
Then use the gcloud command to submit the training job, either locally or to cloud
We will need an extra input function that will map between the Json from the REST API and the features as expected by the Model, this is the serving_input_fn function
Once we have that we can use it to create a micro service through the shell for instance.
The model is ready to receive REST API requests. Create ENDPOINT URL, Send JSON to end-point.
Lab : Scaling up ML with Cloud ML Engine
- Package up the code
- Find absolute paths to data
- Run the Python module from the console
- Run Locally using gcloud
- Submit training job using gcloud
- Deploy model
- Prediction
- Train on a larger dataset
- 1-million row dataset
Create Bucket -> Create Notebook -> clone repo
#Environment variables for project and bucket import os PROJECT = 'qwiklabs-gcp-03-8da6a5ea6e2a' # REPLACE WITH YOUR PROJECT ID REGION = 'us-central1' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions. BUCKET = 'qwiklabs-gcp-03-8da6a5ea6e2a' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected. # for bash os.environ['PROJECT'] = PROJECT os.environ['BUCKET'] = BUCKET os.environ['REGION'] = REGION os.environ['TFVERSION'] = '1.4' # Tensorflow version %%bash gcloud config set project $PROJECT gcloud config set compute/region $REGION %%bash PROJECT_ID=$PROJECT AUTH_TOKEN=$(gcloud auth print-access-token) SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \ -H "Authorization: Bearer $AUTH_TOKEN" \ https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \ | python -c "import json; import sys; response = json.load(sys.stdin); \ print(response['serviceAccount'])") echo "Authorizing the Cloud ML Service account $SVC_ACCOUNT to access files in $BUCKET" gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET # error message (if bucket is empty) can be ignored gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET #Packaging up the code !find taxifare !cat taxifare/trainer/model.py %%bash echo $PWD rm -rf $PWD/taxi_trained cp $PWD/../tensorflow/taxi-train.csv . cp $PWD/../tensorflow/taxi-valid.csv . head -1 $PWD/taxi-train.csv head -1 $PWD/taxi-valid.csv #Running the Python module from the command-line %%bash rm -rf taxifare.tar.gz taxi_trained export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare python -m trainer.task \ --train_data_paths="${PWD}/taxi-train*" \ --eval_data_paths=${PWD}/taxi-valid.csv \ --output_dir=${PWD}/taxi_trained \ --train_steps=1000 --job-dir=./tmp %%bash ls $PWD/taxi_trained/export/exporter/ %%writefile ./test.json {"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2} #Running locally using gcloud %%bash rm -rf taxifare.tar.gz taxi_trained gcloud ai-platform local train \ --module-name=trainer.task \ --package-path=${PWD}/taxifare/trainer \ -- \ --train_data_paths=${PWD}/taxi-train.csv \ --eval_data_paths=${PWD}/taxi-valid.csv \ --train_steps=1000 \ --output_dir=${PWD}/taxi_trained !ls $PWD/taxi_trained #Submit training job using gcloud %%bash echo $BUCKET gsutil -m rm -rf gs://${BUCKET}/taxifare/smallinput/ gsutil -m cp ${PWD}/*.csv gs://${BUCKET}/taxifare/smallinput/ %%bash OUTDIR=gs://${BUCKET}/taxifare/smallinput/taxi_trained JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S) echo $OUTDIR $REGION $JOBNAME gsutil -m rm -rf $OUTDIR gcloud ai-platform jobs submit training $JOBNAME \ --region=$REGION \ --module-name=trainer.task \ --package-path=${PWD}/taxifare/trainer \ --job-dir=$OUTDIR \ --staging-bucket=gs://$BUCKET \ --scale-tier=BASIC \ --runtime-version=$TFVERSION \ -- \ --train_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-train*" \ --eval_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-valid*" \ --output_dir=$OUTDIR \ --train_steps=10000 #Deploy model %%bash gsutil ls gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/exporter %%bash MODEL_NAME="taxifare" MODEL_VERSION="v1" MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/taxifare/smallinput/taxi_trained/export/exporter | tail -1) echo "Run these commands one-by-one (the very first time, you'll create a model and then create a version)" #gcloud ai-platform versions delete ${MODEL_VERSION} --model ${MODEL_NAME} #gcloud ai-platform models delete ${MODEL_NAME} gcloud ai-platform models create ${MODEL_NAME} --regions $REGION gcloud ai-platform versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version $TFVERSION Prediction %%bash gcloud ai-platform predict --model=taxifare --version=v1 --json-instances=./test.json from googleapiclient import discovery from oauth2client.client import GoogleCredentials import json credentials = GoogleCredentials.get_application_default() api = discovery.build('ml', 'v1', credentials=credentials, discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json') request_data = {'instances': [ { 'pickuplon': -73.885262, 'pickuplat': 40.773008, 'dropofflon': -73.987232, 'dropofflat': 40.732403, 'passengers': 2, } ] } parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'taxifare', 'v1') response = api.projects().predict(body=request_data, name=parent).execute() print("response={0}".format(response)) #Train on larger datase #IN GOOGLE BIG QUERY SELECT (tolls_amount + fare_amount) AS fare_amount, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count*1.0 AS passengers, 'nokeyindata' AS key FROM [nyc-tlc:yellow.trips] WHERE trip_distance > 0 AND fare_amount >= 2.5 AND pickup_longitude > -78 AND pickup_longitude < -70 AND dropoff_longitude > -78 AND dropoff_longitude < -70 AND pickup_latitude > 37 AND pickup_latitude < 45 AND dropoff_latitude > 37 AND dropoff_latitude < 45 AND passenger_count > 0 AND ABS(HASH(pickup_datetime)) % 1000 == 1 #Run Cloud training on 1-million row dataset %%bash #XXXXX this takes 60 minutes. if you are sure you want to run it, then remove this line. OUTDIR=gs://${BUCKET}/taxifare/ch3/taxi_trained JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S) CRS_BUCKET=cloud-training-demos # use the already exported data echo $OUTDIR $REGION $JOBNAME gsutil -m rm -rf $OUTDIR gcloud ai-platform jobs submit training $JOBNAME \ --region=$REGION \ --module-name=trainer.task \ --package-path=${PWD}/taxifare/trainer \ --job-dir=$OUTDIR \ --staging-bucket=gs://$BUCKET \ --scale-tier=STANDARD_1 \ --runtime-version=$TFVERSION \ -- \ --train_data_paths="gs://${CRS_BUCKET}/taxifare/ch3/train.csv" \ --eval_data_paths="gs://${CRS_BUCKET}/taxifare/ch3/valid.csv" \ --output_dir=$OUTDIR \ --train_steps=100000
Lab is difficult..
Kubeflow Pipelines
Kubeflow is an open project that packages machine learning code for Kubernetes. Kubeflow PIpelines is a platform for composing, deploying, and managing end-to-end machine learning workflows.
Kubeflow Pipelines reusability helps separate the work which enables people to specialize. In this example a Machine Learning Engineer can focus on Feature Engineering and Linear Regression Modeling and Hyperparameter Tuning. Their solutions are bundled up into the Kubeflow Pipeline which can be used by a Data Engineer as part of a data engineering solution. The solution appears as a service that is used by a Data Analyst to derive business insights.
Module 4 : Feature Engineering
How to get better features to get better results ?
What makes a good feature :
- Should be related to the objective
- Should be known at production-time
- Has to be numeric with meaningful magnitude (for words you need to use for instance word to vec)
- Has enough examples
- Brings human insight to problem
Categorical features
if you want to use the employee ID, you can instead use a vector with 0 and 1 with a 1 on the employee ID of your example. if you have 5 employees in your store, you need then 5 columns. -> Sparse column in TensorFlow.
What if you don’t know the keys ? (employee ID). You have to create a vocabulary of keys to identify all the different keys, you need to do that before you train your model. Mapping needs to be identical at prediction time.
What happens if you hire a new employee ? Plan ahead and determine what to do if you have a new employee. Cold start.
Customer rating can be used as continuous or as one-hot encoded value.
You also need to make sure you know what to do if you have missing data, for example if you received a rating from the client ! Don’t mix magic numbers with data, you can for intance add a column with 0 or 1 depending on the presence of a rating or not.
Features crossing
Is car a taxi ? Car color + city not enough (yellow cars in NY are usually a taxi but not in Rome). You can add a new column CarXCity to provide an human insight so you can avoid giving too much weight to New York in the taxi recognition, but highly appropriate if the car is yellow and in New york.
Bucketizing
If you have high difference in outcome on very little changes (like price of houses depending on latitude, where a very little change can modify the neighboorhood or a city and have huge impact) then it is better to group the feature into buckets.
Model Architecture
two types of featrures : Dense and Sparse
- Price is easy, one column -> Dense
- If you have lot of employees you need 25 columns -> Sparse
Deep neural network : good for dense, highly correlated, like for pixels in an image
You rather use Linear for sparse, independent features
You can use both at the same time in case you need it with a Wide-and-deep network does. tf.estimator.DNNLinearCombinedClassifier
Lab Feature Engineering
#Feature Engineering %%bash sudo pip install httplib2==0.12.0 apache-beam[gcp]==2.16.0 import tensorflow as tf import apache_beam as beam import shutil print(tf.__version__) #Environment variables for project and bucket import os PROJECT = 'cloud-training-demos' # CHANGE THIS BUCKET = 'cloud-training-demos' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected. REGION = 'us-central1' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions. # for bash os.environ['PROJECT'] = PROJECT os.environ['BUCKET'] = BUCKET os.environ['REGION'] = REGION os.environ['TFVERSION'] = '1.8' ## ensure we're using python3 env os.environ['CLOUDSDK_PYTHON'] = 'python3' %%bash gcloud config set project $PROJECT gcloud config set compute/region $REGION ## ensure we predict locally with our current Python environment gcloud config set ml_engine/local_python `which python` #Specifying query to pull the data def create_query(phase, EVERY_N): if EVERY_N == None: EVERY_N = 4 #use full dataset #select and pre-process fields base_query = """ SELECT (tolls_amount + fare_amount) AS fare_amount, DAYOFWEEK(pickup_datetime) AS dayofweek, HOUR(pickup_datetime) AS hourofday, pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat, passenger_count*1.0 AS passengers, CONCAT(STRING(pickup_datetime), STRING(pickup_longitude), STRING(pickup_latitude), STRING(dropoff_latitude), STRING(dropoff_longitude)) AS key FROM [nyc-tlc:yellow.trips] WHERE trip_distance > 0 AND fare_amount >= 2.5 AND pickup_longitude > -78 AND pickup_longitude < -70 AND dropoff_longitude > -78 AND dropoff_longitude < -70 AND pickup_latitude > 37 AND pickup_latitude < 45 AND dropoff_latitude > 37 AND dropoff_latitude < 45 AND passenger_count > 0 """ #add subsampling criteria by modding with hashkey if phase == 'train': query = "{} AND ABS(HASH(pickup_datetime)) % {} < 2".format(base_query,EVERY_N) elif phase == 'valid': query = "{} AND ABS(HASH(pickup_datetime)) % {} == 2".format(base_query,EVERY_N) elif phase == 'test': query = "{} AND ABS(HASH(pickup_datetime)) % {} == 3".format(base_query,EVERY_N) return query print(create_query('valid', 100)) #example query using 1% of data #Preprocessing Dataflow job from BigQuery %%bash if gsutil ls | grep -q gs://${BUCKET}/taxifare/ch4/taxi_preproc/; then gsutil -m rm -rf gs://$BUCKET/taxifare/ch4/taxi_preproc/ fi import datetime #### # Arguments: # -rowdict: Dictionary. The beam bigquery reader returns a PCollection in # which each row is represented as a python dictionary # Returns: # -rowstring: a comma separated string representation of the record with dayofweek # converted from int to string (e.g. 3 --> Tue) #### def to_csv(rowdict): days = ['null', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'] CSV_COLUMNS = 'fare_amount,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers,key'.split(',') rowdict['dayofweek'] = days[rowdict['dayofweek']] rowstring = ','.join([str(rowdict[k]) for k in CSV_COLUMNS]) return rowstring #### # Arguments: # -EVERY_N: Integer. Sample one out of every N rows from the full dataset. # Larger values will yield smaller sample # -RUNNER: 'DirectRunner' or 'DataflowRunner'. Specfy to run the pipeline # locally or on Google Cloud respectively. # Side-effects: # -Creates and executes dataflow pipeline. # See https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline #### def preprocess(EVERY_N, RUNNER): job_name = 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S') print('Launching Dataflow job {} ... hang on'.format(job_name)) OUTPUT_DIR = 'gs://{0}/taxifare/ch4/taxi_preproc/'.format(BUCKET) #dictionary of pipeline options options = { 'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'), 'temp_location': os.path.join(OUTPUT_DIR, 'tmp'), 'job_name': 'preprocess-taxifeatures' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S'), 'project': PROJECT, 'runner': RUNNER, 'num_workers' : 4, 'max_num_workers' : 5 } #instantiate PipelineOptions object using options dictionary opts = beam.pipeline.PipelineOptions(flags=[], **options) #instantantiate Pipeline object using PipelineOptions with beam.Pipeline(options=opts) as p: for phase in ['train', 'valid']: query = create_query(phase, EVERY_N) outfile = os.path.join(OUTPUT_DIR, '{}.csv'.format(phase)) ( p | 'read_{}'.format(phase) >> beam.io.Read(beam.io.BigQuerySource(query=query)) | 'tocsv_{}'.format(phase) >> beam.Map(to_csv) | 'write_{}'.format(phase) >> beam.io.Write(beam.io.WriteToText(outfile)) ) print("Done") preprocess(50*10000, 'DirectRunner') %%bash gsutil ls gs://$BUCKET/taxifare/ch4/taxi_preproc/ #Run Beam pipeline on Cloud Dataflow %%bash if gsutil ls | grep -q gs://${BUCKET}/taxifare/ch4/taxi_preproc/; then gsutil -m rm -rf gs://$BUCKET/taxifare/ch4/taxi_preproc/ fi preprocess(50*100, 'DataflowRunner') #change first arg to None to preprocess full dataset %%bash gsutil ls -l gs://$BUCKET/taxifare/ch4/taxi_preproc/ %%bash #print first 10 lines of first shard of train.csv gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/train.csv-00000-of-*" | head #Develop model with new inputs #Download the first shard of the preprocessed data to enable local development. %%bash if [ -d sample ]; then rm -rf sample fi mkdir sample gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/train.csv-00000-of-*" > sample/train.csv gsutil cat "gs://$BUCKET/taxifare/ch4/taxi_preproc/valid.csv-00000-of-*" > sample/valid.csv #We have two new inputs in the INPUT_COLUMNS, three engineered features, and the estimator involves bucketization and feature crosses. %%bash grep -A 20 "INPUT_COLUMNS =" taxifare/trainer/model.py %%bash grep -A 50 "build_estimator" taxifare/trainer/model.py %%bash grep -A 15 "add_engineered(" taxifare/trainer/model.py %%bash rm -rf taxifare.tar.gz taxi_trained export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare python -m trainer.task \ --train_data_paths=${PWD}/sample/train.csv \ --eval_data_paths=${PWD}/sample/valid.csv \ --output_dir=${PWD}/taxi_trained \ --train_steps=10 \ --job-dir=/tmp %%bash ls taxi_trained/export/exporter/ %%bash model_dir=$(ls ${PWD}/taxi_trained/export/exporter | tail -1) saved_model_cli show --dir ${PWD}/taxi_trained/export/exporter/${model_dir} --all %%writefile /tmp/test.json {"dayofweek": "Sun", "hourofday": 17, "pickuplon": -73.885262, "pickuplat": 40.773008, "dropofflon": -73.987232, "dropofflat": 40.732403, "passengers": 2} %%bash model_dir=$(ls ${PWD}/taxi_trained/export/exporter) gcloud ml-engine local predict \ --model-dir=${PWD}/taxi_trained/export/exporter/${model_dir} \ --json-instances=/tmp/test.json #Train on cloud %%bash OUTDIR=gs://${BUCKET}/taxifare/ch4/taxi_trained JOBNAME=lab4a_$(date -u +%y%m%d_%H%M%S) echo $OUTDIR $REGION $JOBNAME gsutil -m rm -rf $OUTDIR gcloud ai-platform jobs submit training $JOBNAME \ --region=$REGION \ --module-name=trainer.task \ --package-path=${PWD}/taxifare/trainer \ --job-dir=$OUTDIR \ --staging-bucket=gs://$BUCKET \ --scale-tier=BASIC \ --runtime-version=$TFVERSION \ -- \ --train_data_paths="gs://$BUCKET/taxifare/ch4/taxi_preproc/train*" \ --eval_data_paths="gs://${BUCKET}/taxifare/ch4/taxi_preproc/valid*" \ --train_steps=5000 \ --output_dir=$OUTDIR %%bash gsutil ls gs://${BUCKET}/taxifare/ch4/taxi_trained/export/exporter | tail -1 %%bash model_dir=$(gsutil ls gs://${BUCKET}/taxifare/ch4/taxi_trained/export/exporter | tail -1) saved_model_cli show --dir ${model_dir} --all %%bash model_dir=$(gsutil ls gs://${BUCKET}/taxifare/ch4/taxi_trained/export/exporter | tail -1) gcloud ml-engine local predict \ --model-dir=${model_dir} \ --json-instances=/tmp/test.json #Optional: deploy model to cloud %%bash MODEL_NAME="feateng" MODEL_VERSION="v1" MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/taxifare/ch4/taxi_trained/export/exporter | tail -1) echo "Run these commands one-by-one (the very first time, you'll create a model and then create a version)" #gcloud ai-platform versions delete ${MODEL_VERSION} --model ${MODEL_NAME} #gcloud ai-platform delete ${MODEL_NAME} gcloud ai-platform models create ${MODEL_NAME} --regions $REGION gcloud ai-platform versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version $TFVERSION %%bash gcloud ai-platform predict --model=feateng --version=v1 --json-instances=/tmp/test.json
What we did : added 2 engineered features and added the day of the week and run the model on the cloud. we got a result which was a little bit better
Hyperparameter Tuning
We basically can play with parameters in a YAML file that will pilot the learning rate, the number of nodes and layers, the batchsize, to find the optimal one.
Accuracy improves by features tuning, hyperparameter tuning and using larget sets of data.
Going forward
Cloud Speech-to-Text converts audio to text for data processing. Cloud Natural Language API recognizes parts of speech called entities and sentiment. Cloud Translation converts text in one language to another. Dialogflow Enterprise Edition is used to build chatbots to conduct conversations. Cloud Text-to-Speech converts text into high quality voice audio. Cloud Vision API is for working with and recognizing content in still images. And Cloud Video Intelligence API is for recognizing motion and action in video.
You can also use BigQuery directly :