What is ML ?

ML is a way to get predictive insights from data to make repeated decisions off of. You can train the software to estimate the amount of taxes that you owe. Or train that same software to estimate the amount of time it will take to get you home. The ML software once trained on your specific use case is called a Model.

Whatever the domain ML modeling requires lots of training examples. We’ll train the model to estimate tax by showing it many, many, many examples of prior year tax returns. An example consists of an input and the correct answer for that input, that’s called the Label. 

So why do we say these algorithms are standard? Well, the algorithms exist independently of your use case. Even though detecting manufacturing defects and parts in those images and detecting something like diseased leaves and tree images, are two very different use cases, the same algorithm an image classification network works for both. Similarly, there are standard algorithms for predicting the future value of a time series dataset or to transcribe human speech to text. When you use the same algorithm on different datasets, there are different features or inputs relative to the different use cases. So even though we start with the same standard algorithm, after training the trained model that classifies leaves is different from the trained model that classifies manufacturing parts.

The main thing to know is that for Machine Learning your model will only be as good as your data.

AI refers to machines that are capable of acting autonomously, machines that think. AI has to do with the theory and methods to build machines that can solve problems by thinking and acting like humans.Start transcript at Machine Learning within there is a tool set like Newton’s Laws of Mechanics. Deep Learning is a type of Machine Learning that works even when the data consists of unstructured data, like images, speech, video, natural language text and so on. The basic difference between Machine Learning and other techniques in AI is that in Machine Learning machines learn. They don’t start out intelligent, they become intelligent.

Much of the hype around ML now is because the barriers to entry a building these models has fallen dramatically. This is because the convergence of a number of critical factors, the increasing availability of data.The increasing maturity and sophistication of those ML algorithms for you to choose from. And the increasing power in the availability of computing hardware and software through things like cloud computing.

in GCP :

PRebuilt ML model APIS

When we say unstructured data, we are referring to the data that comes in the form of audio, video, images, freeform text, etc.

Let’s give a few real-world example of businesses using unstructured data in their products. Can you differentiate between snow and cloud cover in these two images? One of GCP’s customer, Airbus Defense and Space, works with satellite imagery such as this, and it’s very important that they can detect and correct imperfections in the images, such as the presence of cloud formations. Historically, this imperfection correcting process was time-consuming, prone to error, and not scalable. Airubs solved these issues with machine learning.

The NLP API provides many features with which text analytics can be performed. The first feature is Syntactic Analysis. Syntactic Analysis first breaks up text into a series of tokens, which are generally words and sentences, and provides information about the tokens internal structure and its role in the sentence. It can label a token as a noun or a verb, singular or plural, first person or second person, masculine, feminine or neutral. And provides grammatical information such as case, tense, mood and voice. 

Lab :Using the Natural Language API to classify unstructured text

Task 1: Confirm that the Cloud Natural Language API is enabled

Task 2: Create an API Key

Task 3: Classify a news article

We create json file in the editor with the following content

{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"A Smoky Lobster Salad With a Tapa Twist. This spin on the Spanish pulpo a la gallega skips the octopus, but keeps the sea salt, olive oil, pimentón and boiled potatoes."
  }
}

send this text to the Natural Language API’s classifyText method with the following curl command:

curl "https://language.googleapis.com/v1/documents:classifyText?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json

Output :

{ categories:
  [
    {
      name: '/Food & Drink/Cooking & Recipes',
      confidence: 0.85
    },
    {
      name: '/Food & Drink/Food/Meat & Seafood',
      confidence: 0.63
    }
  ]
}

Task 4: Classifying a large text dataset

Task 5: Creating a BigQuery table for our categorized text data

Create a dataset and a new table with 3 columns that will be filled by our code

Task 6: Classifying news data and storing the result in BigQuery

Create a service account :

gcloud iam service-accounts create my-account --display-name my-account
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:my-account@$PROJECT.iam.gserviceaccount.com --role=roles/bigquery.admin
gcloud iam service-accounts keys create key.json --iam-account=my-account@$PROJECT.iam.gserviceaccount.com
export GOOGLE_APPLICATION_CREDENTIALS=key.json

The python code :

from google.cloud import storage, language, bigquery

# Set up our GCS, NL, and BigQuery clients
storage_client = storage.Client()
nl_client = language.LanguageServiceClient()
# TODO: replace YOUR_PROJECT with your project id below
bq_client = bigquery.Client(project='YOUR_PROJECT')

dataset_ref = bq_client.dataset('news_classification_dataset')
dataset = bigquery.Dataset(dataset_ref)
table_ref = dataset.table('article_data') # Update this if you used a different table name
table = bq_client.get_table(table_ref)

# Send article text to the NL API's classifyText method
def classify_text(article):
        response = nl_client.classify_text(
                document=language.types.Document(
                        content=article,
                        type=language.enums.Document.Type.PLAIN_TEXT
                )
        )
        return response

rows_for_bq = []
files = storage_client.bucket('text-classification-codelab').list_blobs()
print("Got article files from GCS, sending them to the NL API (this will take ~2 minutes)...")

# Send files to the NL API and save the result to send to BigQuery
for file in files:
        if file.name.endswith('txt'):
                article_text = file.download_as_string()
                nl_response = classify_text(article_text)
                if len(nl_response.categories) > 0:
                        rows_for_bq.append((str(article_text), str(nl_response.categories[0].name), str(nl_response.categories[0].confidence)))

print("Writing NL API article data to BigQuery...")
# Write article text + category data to BQ
errors = bq_client.insert_rows(table, rows_for_bq)
assert errors == []

Run the script :

python3 classify-text.py

Task 7: Analyzing categorized news data in BigQuery

We go to big query and query the data in the table we created

Cloud AI Platform notebooks

Standard software development tools are not very efficient for writing code for data analysis and machine learning. Data analysis and machine learning coding often involve looking at plots, repeatedly executing small chunks of code with minor changes, and frequently having to print output. Iteratively, running whole scripts for this task is burdensome. These were some of the issues that motivated the development of notebooks. Notebook environments seamlessly integrate commentary, plots, and code.

Collaborative notebook, with installed libraries and collaborative step by step alogirthm creations and annotation

magic functions allow you to execute system commands from within notebook cells.This is useful for checking query correctness and output. The BigQuery magic function allows you to save the query output to a Pandas data frame so that you can manipulate it further. 

Lab BigQuery in Jupyter Labs on AI Platform

Start a JupyterLab Notebook Instance

Notebooks for AI Platform -> NEW INSTANCE -> Python -> OPEN JUPYTERLAB ->  Python 3 ->

First cell :

%%bigquery df
SELECT
  departure_delay,
  COUNT(1) AS num_flights,
  APPROX_QUANTILES(arrival_delay, 10) AS arrival_delay_deciles
FROM
  `bigquery-samples.airline_ontime_data.flights`
GROUP BY
  departure_delay
HAVING
  num_flights > 100
ORDER BY
  departure_delay ASC

View dataframe :

df.head()

Additional setup of the data and final plot :

 Enter the following code in a new cell to convert the list of arrival_delay_deciles into a Pandas Series object. Since we want to relate departure delay times to arrival delay times we have to concatenate our percentiles table to the departure_delay field in our original DataFrame. fore plotting the contents of our DataFrame, we’ll want to drop extreme values stored in the 0% and 100% fields.

import pandas as pd

percentiles = df['arrival_delay_deciles'].apply(pd.Series)
percentiles.rename(columns = lambda x : '{0}%'.format(x*10), inplace=True)
percentiles.head()

df = pd.concat([df['departure_delay'], percentiles], axis=1)
df.head()

df.drop(labels=['0%', '100%'], axis=1, inplace=True)
df.plot(x='departure_delay', xlim=(-30,50), ylim=(-50,50));

Documentation on pandas : https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

Productionizing custom ML models

You may want a more tailored model trained on your own data. For that, we will need a custom model.There are a few ways of doing custom model development, training, and serving. We will highlight a few major ways and then focus on four :

Kubeflow


The capabilities provided by Kubeflow Pipelines can largely be put into three buckets: ML workflow orchestration, share, reuse, and composed, rapid reliable experimentation. 

AI Hub

AI Hub is a repository for ML components. Don’t reinvent the wheel. Avoid buildings some component when someone else has already built it, and most likely has already optimized it.Among the assets stored on AI Hub are entire Kubeflow Pipelines, Jupyter notebooks, TensorFlow modules, fully trained models, services and VM images.

Lab : Running AI models on Kubeflow

Kubernetes is a mature, production ready platform that gives developers a simple API to deploy programs to a cluster of machines as if they were a single piece of hardware. Using Kubernetes, computational resources can be added or removed as desired, and the same cluster can be used to both train and serve ML models.This lab will serve as an introduction to Kubeflow, an open-source project which aims to make running ML workloads on Kubernetes simple, portable and scalable.

What You’ll Learn

  • How to set up a Kubeflow cluster on GCP
  • How to package a TensorFlow program in a container, and upload it to Google Container Registry
  • How to submit a Tensorflow training job, and save the resulting model to Google Cloud Storage
  • How to serve and interact with a trained model

My god.. Easily the most difficult and long Lab

You successfully set up Kubeflow on a Kubernetes Engine cluster, used Kubeflow to deploy a TensorFlow model training service, and deployed an interactive API based on the trained model.

BigQuery ML

BigQuery Machine Learning allows you to build machine learning models using SQL syntax.

First, we must write a query on data stored in BigQuery to extract our training data. Then, we can create a model where we specify a model type and other hyperparameters. After the model is trained, we can evaluate the model and verify that it meets our requirements. Finally, we can make predictions using our model on data extracted from BigQuery. 

First, we have another example of linear classification or what is also called logistic regression. In this example, we’re using a flights arrivals data set to predict whether a flight would be on time or not, which is a binary outcome. For logistic regression, we simply need to specify the type of model and the label we’re trying to predict. For more advanced users, there are other options such as if you want your model to use regularization. While logistic regression models are the Swiss army knife of machine learning, deep neural networks or DNN’s allow you to better model nonlinear relationships in your data. An example of a non-linear relationship would be in a car depreciation. A car loses a vast majority of its value within the first few years after which the value more or less stabilizes.

Here are the modal options available in BigQuery ML to forecast things like numeric values. So far, we’ve only talked about classification. You can also use BQML to do regression. In this example, we’re fitting a linear regression model to predict taxi fare based on features such as the hour of day, pick-up and drop off location, and the day of the week. If you need a model more complex than linear regression, you can use a DNN regressor in BQML.

Let’s wrap up with unsupervised machine learning. Supervised is when you have labeled training data, meaning you know the answer in the past and unsupervised is where you are using no such label. 

Lab : Predict Bike Trip Duration with a Regression Model in BQML

Objectives

In this lab, you learn to perform the following tasks:

  • Query and explore the London bicycles dataset for feature engineering
  • Create a linear regression model in BQML
  • Evaluate the performance of your machine learning model
  • Extract your model weight

Basic model :

CREATE OR REPLACE MODEL
  bike_model.model
OPTIONS
  (input_label_cols=['duration'],
    model_type='linear_reg') AS
SELECT
  duration,
  start_station_name,
  CAST(EXTRACT(dayofweek
    FROM
      start_date) AS STRING) AS dayofweek,
  CAST(EXTRACT(hour
    FROM
      start_date) AS STRING) AS hourofday
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire

To see the metrics of the model :

SELECT * FROM ML.EVALUATE(MODEL `bike_model.model`)

Final model with some transformation and data aggregation :

CREATE OR REPLACE MODEL
  bike_model.model_bucketized
OPTIONS
  (input_label_cols=['duration'],
    model_type='linear_reg') AS
SELECT
  duration,
  start_station_name,
IF
  (EXTRACT(dayofweek
    FROM
      start_date) BETWEEN 2 AND 6,
    'weekday',
    'weekend') AS dayofweek,
  ML.BUCKETIZE(EXTRACT(hour
    FROM
      start_date),
    [5, 10, 17]) AS hourofday
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire

Our best model contains several data transformations. Wouldn’t it be nice if BigQuery could remember the sets of transformations we did at the time of training and automatically apply them at the time of prediction? It can, using the TRANSFORM clause!

In this case, the resulting model requires just the start_station_name and start_date to predict the duration. The transformations are saved and carried out on the provided raw data to create input features for the model. The main advantage of placing all preprocessing functions inside the TRANSFORM clause is that clients of the model do not have to know what kind of preprocessing has been carried out.

CREATE OR REPLACE MODEL
  bike_model.model_bucketized TRANSFORM(* EXCEPT(start_date),
  IF
    (EXTRACT(dayofweek
      FROM
        start_date) BETWEEN 2 AND 6,
      'weekday',
      'weekend') AS dayofweek,
    ML.BUCKETIZE(EXTRACT(HOUR
      FROM
        start_date),
      [5, 10, 17]) AS hourofday )
OPTIONS
  (input_label_cols=['duration'],
    model_type='linear_reg') AS
SELECT
  duration,
  start_station_name,
  start_date
FROM
  `bigquery-public-data`.london_bicycles.cycle_hire

With the TRANSFORM clause in place, enter this query to predict the duration of a rental from Park Lane right now 

SELECT
  *
FROM
  ML.PREDICT(MODEL bike_model.model_bucketized,
    (
    SELECT
      'Park Lane , Hyde Park' AS start_station_name,
      CURRENT_TIMESTAMP() AS start_date) )

To check model weights :

SELECT * FROM ML.WEIGHTS(MODEL bike_model.model_bucketized)

Lab : Movie Recommendations in BigQuery ML

Objectives

In this lab, you learn to perform the following tasks:

  • Train a recommendation model in BigQuery
  • Make product predictions for both single users and batch users

Task 1: Get MovieLens Data

Make a bucket and download data

bq --location=EU mk --dataset movies

curl -O 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
unzip ml-20m.zip

bq --location=EU load --source_format=CSV \
   --autodetect movies.movielens_ratings ml-20m/ratings.csv
bq --location=EU load --source_format=CSV \
   --autodetect movies.movielens_movies_raw ml-20m/movies.csv

Task 2: Explore the Data

SELECT
  COUNT(DISTINCT userId) numUsers,
  COUNT(DISTINCT movieId) numMovies,
  COUNT(*) totalRatings
FROM
  movies.movielens_ratings
  
SELECT
  *
FROM
  movies.movielens_movies_raw
WHERE
  movieId < 5

We can see that the genres column is a formatted string. Parse the genres into an array and rewrite the results into a table named movielens_movies.

CREATE OR REPLACE TABLE
  movies.movielens_movies AS
SELECT
  * REPLACE(SPLIT(genres, "|") AS genres)
FROM
  movies.movielens_movies_raw

Task 3: Collaborative Filtering

Creation of the model and evaluation

CREATE OR REPLACE MODEL
  movies.movie_recommender
OPTIONS
  (model_type='matrix_factorization',
    user_col='userId',
    item_col='movieId',
    rating_col='rating',
    l2_reg=0.2,
    num_factors=16) AS
SELECT
  userId,
  movieId,
  rating
FROM
  movies.movielens_ratings
  
  SELECT * FROM ML.EVALUATE(MODEL `cloud-training-prod-bucket.movies.movie_recommender`)

Task 4: Making Recommendations

Let’s find the best comedy movies to recommend to the user whose userId is 903. Enter the query below:

SELECT
  *
FROM
  ML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,
    (
    SELECT
      movieId,
      title,
      903 AS userId
    FROM
      `movies.movielens_movies`,
      UNNEST(genres) g
    WHERE
      g = 'Comedy' ))
ORDER BY
  predicted_rating DESC
LIMIT
  5  

This result includes movies the user has already seen and rated in the past. Let’s remove them:

SELECT
  *
FROM
  ML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,
    (
    WITH
      seen AS (
      SELECT
        ARRAY_AGG(movieId) AS movies
      FROM
        movies.movielens_ratings
      WHERE
        userId = 903 )
    SELECT
      movieId,
      title,
      903 AS userId
    FROM
      movies.movielens_movies,
      UNNEST(genres) g,
      seen
    WHERE
      g = 'Comedy'
      AND movieId NOT IN UNNEST(seen.movies) ))
ORDER BY
  predicted_rating DESC
LIMIT
  5

Task 5: Customer Targeting

We wish to get more reviews for movieId=96481 which has only one rating and we wish to send coupons to the 100 users who are likely to rate it the highest. Identify those users using:

SELECT
  *
FROM
  ML.PREDICT(MODEL `cloud-training-prod-bucket.movies.movie_recommender`,
    (
    WITH
      allUsers AS (
      SELECT
        DISTINCT userId
      FROM
        movies.movielens_ratings )
    SELECT
      96481 AS movieId,
      (
      SELECT
        title
      FROM
        movies.movielens_movies
      WHERE
        movieId=96481) title,
      userId
    FROM
      allUsers ))
ORDER BY
  predicted_rating DESC
LIMIT
  100

The result gives us 100 users to target

Task 6: Batch predictions for all users and movies

Enter the following query to obtain batch predictions:

SELECT
  *
FROM
  ML.RECOMMEND(MODEL `cloud-training-prod-bucket.movies.movie_recommender`)
LIMIT 
  100000

Cloud Auto ML

Cloud AutoML is a service on Google Cloud platform that allows you to build powerful machine learning models with minimal effort and machine learning expertise.

Cloud AutoML uses a prepared data set to train a custom model. You can make small prepared datasets for experimentation directly in the web UI but it is more common to assemble the information in a CSV comma separated value file. The CSV file must be utf-8 encoded and located in the same cloud storage bucket as the source files. You can also create and manage prepared datasets programmatically in Python, Java, or Node.js.

With Cloud AutoML you can create smaller more specialized custom models and use them programmatically.

Now we’ll describe AutoML Vision. This is a Cloud AutoML product for image data. Cloud AutoML Vision specializes in training models for image classification.

Cloud auto ML natural language specializes in training models for text data. For example, if you have a set of newspaper articles, you can use the auto ML NLP service to classify if a given article is for example about sports or politics. 

While AutoML vision and NLP are for unstructured data, AutoML table is for structured data. The development of AutoML table is a collaboration with the Google brain team, while the technical details of the project haven’t been released yet to the public. The team basically took the architecture search capability used for image classification and translation problems, and found a way to apply it to tabular data.

Categories: CloudTech

Brax

Dude in his 30s starting his digital notepad