Preparing for Data

Designing data processing systems

System availability is important to pipeline processing, but not data representation, and capacity is important to processing, but not the abstract pipeline or the representation. Think about data engineering and Google Cloud as a platform consisting of components that can be assembled into solutions. Let’s review the elements of GCP that form the data engineering platform.

For example, a Cloud SQL database is very good at storing consistent individual transactions, but it’s not really optimized for storing large amounts of unstructured data like video files. Database services perform minimal operations on the data within the context of the access method, for example, SQL queries can aggregate, accumulate, count, and summarize results of a search query. 

Here’s an exam tip, know the differences between Cloud SQL and Cloud Spanner and when to use each. 

An exam tip, know how to identify technologies backwards from their properties. For example, which data technology offers the fastest ingestive data? Which one might you use for ingestive streaming data? Managed services are ones where you can see the individual instance or cluster.

Exam tip, managed services still have some IT overhead. It doesn’t completely eliminate the overhead or manual procedures, but it minimizes them compared with on-prem solutions. Serverless services remove more of the IT responsibility, so managing the underlying servers is not part of your overhead and the individual instances are not visible.

Cloud Firestore is a NoSQL document database built for automatic scaling. It offers high performance and ease of application development, and it includes a data store compatibility mode. 

So where do you get these resources? You could use any of these computing platforms to write your own application or parts of an application that you storage your database services. You could install open-source software such as MySQL, an open-source database, or Hadoop, an open source data processing platform on Compute Engine. Build-your-own solutions are driven mostly by business requirements.

Data processing services combine storage and compute and automate the storage and compute aspects of data processing through abstractions. For example, in Cloud Dataproc, the data abstraction with Spark is a resilient distributed dataset, or RDD, and the processing abstraction is a directed acyclic graph, DAG. In BigQuery, the abstractions are table and query, and in Dataflow, the abstractions are PCollection and pipeline.

Your exam tip is to understand the array of machine learning technologies offered on TCP, and when you might want to use each. A data engineering solution involves data ingest, management during processing, analysis, and visualization. These elements can be critical to the business requirements. Here are a few services that you should be generally familiar with. 

Data transfer services operate online and a data transfer appliance is a shippable device that’s used for synchronizing data in the Cloud with an external source. 

Cloud Data Studio is used for visualization of data after it has been processed.

Cloud Dataprep is used to prepare or condition data and to prepare pipelines before processing data

Cloud Datalab is a notebook that is a self-contained workspace that holds code, executes the code, and displays results.

Dialogflow is a service for creating chatbots. It uses AI to provide a method for direct human interaction with data.

Cloud Pub/Sub, a messaging service, features in virtually all live or streaming data solutions because it decouples data arrival from data ingest. Cloud VPN, Partner Interconnect or Dedicated Interconnect, play a role whenever there’s data on premise, it must be transmitted to services in the Cloud. Cloud IAM, firewall rules, and key management are critical to some verticals, such as the health care and financial industries. Every solution need to be monitored and managed, which usually involves panels displayed in Cloud Console and data sent to Stackdriver monitoring.

However, the fact that an enforce standard instance has higher IAPs than an N1 standard instance, or that the N4 standard cost more than an N1 standard, are concepts that I would need to know as a data engineer.

Designing flexible data representations

There are different abstractions for storing data. If you store data in one abstraction instead of another, it makes different processes easier or faster.

For example, if you store data in a file system, it makes it easier to retrieve that data by name. If you store data in a database, it makes it easier to find data by logic such as SQL. If you store data in a processing system it makes it easier and faster to transform the data not just retrieve it. 

For example, if a problem is described using the terms rows and columns, since those concepts are used in SQL, you might be thinking about a SQL database such as Cloud SQL or Cloud Spanner. If an exam question describes an entity and a kind which are concepts used in Cloud Datastore, and you don’t know what they are, you’ll have a difficult time answering the question.

Exam tip is that it’s good to know, how data is stored and what purpose or use case is the storage or database optimized for.

Flat serialized data is easy to work with but it lacks structure and therefore meaning. If you want to represent data that has meaningful relationships, you need a method that not only represents the data but also the relationships. 

CSV, which stands for comma-separated values is a simple file format used to store tabular data. XML, which stands for eXtensible Markup Language was designed to store and transport data and was designed to be self-descriptive. JSON, which stands for JavaScript Object Notation is a lightweight data interchange format based on name-value pairs and an ordered list of values, which maps easily to common objects in many programming languages.

Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols and serializes data in a compact binary format. 

For example there’s a data type in modern SQL called Numeric. Numeric is similar to floating point. However, it provides a 38 digit value with nine digits to represent the location of the decimal point. Numeric is very good at storing common fractions associated with money. Numeric avoids the rounding error that occurs in a full floating point representation. So, it’s used primarily for financial transactions. 

You should already know that every resource in GCP exist inside a project and besides security and access control, a project is what links usage of a resource to a credit card. It’s what makes up resource billable.Then in BigQuery data stored inside datasets, and datasets contain tables, and tables contain columns. When you process the data BigQuery creates a job. Often the job runs a SQL query, although there are some update maintenance activity supported using data manipulation language or DML.

BigQuery is called a columnar store. Meaning that it’s designed for processing columns not rows. Column processing is very cheap and fast in BigQuery and row processing is slow and expensive.

 You can stream append data easily to BigQuery tables but you can’t easily change existing values. Replicating the data three times also helps the system determine optimal compute nodes to do filtering mixing and so forth.

You treat your data in Cloud Dataproc and Spark as a single entity but Spark knows the truth. Your data is stored in Resilient Distributed Datasets or RDDs. RDDs are an abstraction that hides the complicated details of how data is located and replicated in a cluster. Spark partitions data in memory across the cluster and knows how to recover the data through an RDDs lineage, should anything go wrong.Data partitioning, data replication, data recovery, pipelining of processing, are all automated by Spark so you don’t have to worry about them.

Here’s an exam tip, “you should know how different services store data, and how each method is optimized for specific use cases as previously mentioned but also understand the key value of the approach”.

 In this case RDDs hide complexity and allow Spark to make decisions on your behalf.

There are a number of concepts that you should know about Cloud Dataflow. Your data in data flow is represented in PCollections. The pipeline shown in this example reads data from BigQuery, does a bunch of processing and writes it’s output to cloud storage.

For development there’s a local runner, and for production there’s a Cloud Runner. When the pipeline is running on the Cloud each step, each transform, is applied to a PCollection and results in a PCollection.

The idea is to write Python or Java Code and deploy it to Cloud Dataflow, which then executes the pipeline in a scalable serverless context.Unlike Cloud Dataproc, there’s no need to launch a cluster or scale the cluster, that’s handled automatically. 

Here are some key concepts from Dataflow, that a data engineer should know: in a Cloud Dataflow pipeline, all the data is stored in a PCollection. The input data is a PCollection. Transformations make changes to a PCollection and then output another PCollection. A PCollection is immutable. That means you don’t modify it. That’s one of the secrets of its speed. Every time you pass data through a transformation it creates another PCollection. 

Cloud Dataflow is designed to use the same pipeline, the same operations, the same code for both batch and stream processing. Remember that batch data is also called bounded data and it’s usually a file. Batch data has a finite end. Streaming data is also called unbounded data and it might be dynamically generated. For example, it might be generated by sensors or by sales transactions. Streaming data just keeps going. Day after day, year after year with no defined end. 

Many Hadoop workloads can be run more easily and are easier to maintain with Cloud Dataflow. But PCollections and RDDs are not identical. So, existing code has to be redesigned and adapted to run in the Cloud Dataflow pipeline.

the flow is a pipeline just like we discussed in Cloud Dataflow but the data object in TensorFlow is not a PCollection but something called a tensor. A Tensor is a special mathematical object that unifies scalars, vectors, and matrixes. Tensor zero is just a single value, a scalar. Tensor one is a vector. Having direction and magnitude. Tensor two is a matrix. Tensor three is a cube-shape. Tensors are very good at representing certain kinds of math functions such as coefficients in an equation, and TensorFlow makes it possible to work with tensor data objects of any dimension. 

Preparing for Pipelines

Design data pipelines

pipeline is some kinds of sequence of actions or operations to be performed on the data representation. 

Cloud Dataproc is a managed Hadoop service. And there are number of things you should know including standard software and the Hadoop ecosystem and components of Hadoop.

However, the main thing you should know about Cloud Dataproc is how to use it differently from standard Hadoop. If you store your data external from the cluster, storing HDFS-type data in cloud storage and storing HBase-type data in Cloud Bigtable. Then you can shut your cluster down when you’re not actually processing a job, that’s very important. 

What are the two problems with Hadoop? First, trying to tweak all of its settings so it can run efficiently with multiple different kinds of jobs, and second, trying to cost justify utilization. So you search for users to increase your utilization, and that means tuning the cluster. And then if you succeed in making it efficient, it’s probably time to grow the cluster.

When you have a stateless Cloud Dataproc cluster, it typically takes only about 90 seconds for the cluster to start up and become active. Cloud Dataproc supports Hadoop, Pig, Hive, and Spark.

One exam tip, Spark is important because it does part of its pipeline processing in memory rather than copying from disk. For some applications, this makes Spark extremely fast. With a Spark pipeline, you have two different kinds of operations, transforms and actions. Spark builds its pipeline used an abstraction called a directed graph. Each transform builds additional nodes into the graph but spark doesn’t execute the pipeline until it sees an action.

Tip : Sparks can wait till all the requests are in before applying resources.Very simply, Spark waits until it has the whole story, all the information. This allows Spark to choose the best way to distribute the work and run the pipeline.

For a transformation, the input is an RDD and the output is an RDD. When Spark sees a transformation, it registers it in the directed graph and then it waits. An action triggers Spark to process the pipeline, the output is usually a result format, such as a text file, rather than an RDD.

Transformations and actions are API calls that reference the functions you want them to perform. Anonymous functions in Python, lambda functions, are commonly used to make the API calls.

An interesting and opposite approach where the system tries to process the data as soon as it’s received is called, eager execution.

You can use Cloud Dataproc in BigQuery to gather in several ways. BigQuery is great at running SQL queries, but what it isn’t built for is modifying data, real data-processing work. So if you need to do some kind of analysis that’s really hard to accomplish in SQL. Sometimes the answer is to extract the data from BigQuery into Cloud Dataproc and let Spark run the analysis.Also, if you needed to alter or process the data, you might read from BigQuery into Cloud Dataproc, process the data, and write it back out to another dataset in BigQuery. Here’s another tip, if the situation you’re analyzing has data in BigQuery, and perhaps the business logic is better expressed in terms of functional code rather than SQL. You may want to run a Spark job on the data.

Cloud Dataproc has connectors to all kinds of GCP resources. You can read from GCP sources, and write to GCP sources, and use Cloud Dataproc as the interconnecting glue.

You can also run open source software from the Hadoop ecosystem on a cluster. It would be wise to be at least familiar with the most popular Hadoop software and to know whether alternative services exist in the cloud. For example, Kafka has a messaging service, and the alternative on GCP would be Cloud Pub/Sub. Do you know what the alternative on GCP is to the open-source HBase? That’s right, it’s Cloud Bigtable and alternative to HTFS, cloud storage. Installing and running Hadoop open source software on Cloud Dataproc cluster is also available.

Here is a tip about modifying the Cloud Dataproc cluster, if you need to modify the cluster, consider whether you have the right data-processing solution. There are so many services available on Google Cloud, you might be able to use a service rather than hosting your own on the cluster.

If you’re migrating Data Center Hadoop to Cloud Dataproc, you may already have customized Hadoop settings that you would to apply to the cluster. You may want to customize some cluster configurations so that it’d work similarly. That’s supported in a limited way by cluster properties. Security in Cloud Dataproc is controlled by access to the cluster as a resource.

Cloud Dataflow pipelines

You can write pipeline code in Java or Python.

You can use the open source Apache Beam API to define the pipeline and submit it to Cloud Dataflow. Then Cloud Dataflow provides the execution framework. 

Parallel tasks are automatically scaled by the framework and the same code does real-time streaming and batch processing. One great thing about Cloud Dataflow is that you can get input from many sources and write output to many sinks but the pipeline code in-between remains the same. Cloud Dataflow supports side inputs. That’s where you can take data and transform it in one way and transform it in a different way in parallel so that the two can be used together in the same pipeline. Security and Cloud Dataflow is based on assigning roles that limit access to the Cloud dataflow resources. So, your exam tip is, for Cloud Dataflow users use roles to limit access to only dataflow resources not just the project.

The dataflow pipeline not only appears in code, but also is displayed in the GCP Console as a diagram.

GroupByKey for one, could consume resources on big data. This is one reason you might want to test your pipeline a few times on sample data to make sure you know how it scales before executing it production scale.

A pipeline is a more maintainable way to organize data processing code than for example, an application running on an instance.

Templates create the single-step of indirection, that allows the two classes of users to have different access. Dataflow Templates enable a new development in execution workflow. The Templates helps separate the development activities and the developers, from the execution activities and the users.

Dataflow Templates open up new options for separation of work. That means better security and resource accountability.

Preparing for infrastructure solutions

BigQuery and Cloud Dataflow Solutions

BigQuery is two services, a front-end service that does analysis, and a back-end service that does storage. It offers near real time analysis of massive datasets. The data storage is durable and inexpensive, and you could connect and work with different datasets to drive new insights and business value. BigQuery uses SQL for queries, so it’s immediately usable by many data analysts. BigQuery is fast, but how fast is fast? Well, if you’re using it with structure data for analytics, it can take a few seconds. BigQuery connects to many services for flexible ingest and output. And it supports nested and repeated fields for efficiency, and user-defined functions for extensibility. Exempt it, access control, and BigQuery is at the project and the data set level.

Here is a major design tip, separate, compute, and processing from storage and database enables serverless operations. BigQuery has its own analytic SQL Query front-end available in console and from the command line with BQ.

Cloud storage replaces HDFS, Cloud bigtable replaces HBASE

Cloud Dataproc and Cloud Dataflow can output separate files as CSV files in Cloud Storage. In other words, you can have a distributed set of nodes or servers processing the data in parallel and writing the results out in separate small files. This is an easy way to accumulate distributed results for later collating. Access any storage service from any data processing service. Cloud Dataflow is an excellent ETL solution for BigQuery. Use Cloud Dataflow to aggregate data in support of common queries.

Design data processing infrastructure

example :

 You can load data into BigQuery from the GCP console. You can stream data using Cloud Dataflow, and from Cloud Logging, or you can use POST calls from a program. And it’s very convenient that BigQuery can automatically detect CSV and JSON format files.

Cloud Pub/SUB

The Cloud Pub/Sub message broker enables complete ingest solutions. It provides loose coupling between systems and long-lived connections between systems. Exam tip. You need to know how long Cloud Pub/Sub holds messages. It’s up to seven days.

Cloud Pub/Sub handles exactly once delivery and Cloud Dataflow handles de-duplication ordering and windowing. Separation of duties enables a scalable solution that surpasses bottlenecks in competing messaging systems.

Designing Data Processing Systems

Tip: Be familiar with the common use cases and qualities of the different storage options. Each storage system or database is optimized for different things; some are best at atomically updating the data for transactions. Some are optimized for speed of data retrieval, but not for updates or changes. Some are very fast and inexpensive for simple retrieval, but slow for complex queries

Tip: An important element in designing the data processing pipeline start with selecting the appropriaten service or collection of services.
Tip: Cloud Datalab, Google Data Studio, and BigQuery all have interactive interfaces. Do you know when to use each?

Tip: Cloud Pub/Sub and Cloud Dataflow together provide once, in-order processing of possibly delayed or repeated streaming data.
Tip: Be familiar with the common assemblies of services and how they are often used together: Cloud Dataflow, Cloud Dataproc, BigQuery, Cloud Storage, and Cloud Pub/Sub.

Tip: Technologically, Cloud Dataproc is superior to open-source Hadoop, and Cloud Dataflow is superior to Cloud Dataproc. However, this does not mean that the most advanced technology is always the best solution; you need to consider the business requirements. The client might want to first
migrate from the data center to the cloud. Make sure everything is working (validate it). And only after they are confident with that solution, consider improving or modernizing.

Practice questions :

Question 1:

you need a storage solution for CSV files. Analysts will run ANSI SQL queries. You need to support complex aggregate queries, and reuse existing I/O-intensive custom Apache Spark transformations. How should you transform the input data?

The correct answer is B, use BigQuery for the storage solution, and Cloud dataproc for the processing solution. Okay, Cloud Dataproc is correct because the question states you need to plan to reuse Apache Spark code. The CSV files could be in Cloud Storage, or could be ingested into BigQuery. In this case, you need to support complex SQL queries, so best to use BigQuery for storage. This was not a once in a while straightforward case where you might consider just keeping the data in cloud storage.

Question 2 :

You are selecting a streaming service for log messages that must include final result message ordering as part of building a data pipeline on Google Cloud. You want to stream input for five days, and be able to query the most recent message value. You’ll be storing the data in a searchable repository. How should you set up the input messages?

We can kind of figure that Apache Kafka is not the recommended solution in this scenario because you would have to set it up and maintain it. That could be a lot of work. Why not just use the Cloud Pub/Sub service and eliminate the overhead. You need a timestamp to implement the rest of the solution. So, applying it at ingest in the publisher is a good consistent way to get the timestamp that’s required.

Data representations, pipelines, and processing infrastructure

Preparing for building and Operationalizing Solutions

Building data processing systems

The first area of data processing we’ll look at is building and maintaining structures and databases.

BigQuery is recommended as a data warehouse. BigQuery is the default storage for tabular data. Use Cloud Bigtable if you need transactions, use Cloud Bigtable if you want low latency, high throughput.

Here’s some concrete advice on flexible data representation. You want the data divided up in a way that makes the most sense for your given use case.

 In Cloud Datastore, there are only two APIs that provide a strongly consistent view for reading entity values and indexes. One, the lookup by key method and two the ancestor query. 

Cloud Storage, Cloud SQL, Cloud Bigtable

Cloud Storage is persistent. It has storage classes, Nearline, Coldline, Regional and Multi-Regional. There’s granular access control. You should be familiar with all the methods of access control including IAM roles and signed URLs. Cloud Storage has a ton of features that people often miss. They end up trying to duplicate the function in code when in fact, all they need to do is use the capacity already available.

For example, you can change storage classes, you can stream data to Cloud Storage. Cloud Storage supports a kind of versioning, and there are multiple encryption options to meet different needs. Also, you can automate some of these features using Lifecycle management. For example, you could change the class of storage for an object or delete that object after a period of time.

Cloud Bigtable is meant for high throughput data. It has millisecond latency so it’s much faster than BigQuery, for example. It’s NoSQL. So this is good for columnar store. When would you want to select an SSD for the machines in the cluster rather than HDD? I guess, if you needed faster performance.

Cloud Spanner is strongly typed and globally consistent. The two characteristics that distinguish it from Cloud SQL are global consistent transactions and size. Cloud Spanner can work with much larger databases than Cloud SQL.

Cloud SQL is fine if you can get by with a single database. But if your needs are such that you need multiple databases, Cloud Spanner is a great choice.

Distributing MySQL is hard. However, Spanner distributes easily even globally and provides consistent performance to support more throughput by just adding more nodes. 

Cloud Datastore is a NoSQL solution that used to be private to App Engine. It offers many features that are mainly useful to applications such as persisting state information. It’s now available to clients besides App Engine.

Example, if the exam question contains data warehouse, you should be thinking BigQuery as a candidate. If the case says something about large media files, you should immediately be thinking Cloud Storage.

Building and operationalizing pipelines


Cloud Dataflow is Apache Beam as a service, a fully managed auto-scaling service that runs Beam pipelines. Continuous data can arrive out of order. Simple windowing can separate related events into independent windows.

Windowing creates individual results for different slices of event time. Windowing divides a peak collection up into finite chunks based on event time of each message, it can be useful in many contexts, but it’s required when aggregating over infinite data.Remember to study side inputs.

Building and operationalizing processing infrastructure

You can stream unbounded data into BigQuery, but it maxes out at a 100,000 rows per table per second. Cloud Pub/Sub guarantees delivery but might deliver the messages out of order.If you have a timestamp, then Cloud Dataflow can remove duplicates and work out the order of messages.

Practice Exam Questions 2

Question 1

An application that relies on Cloud SQL to read infrequently changing data is predicted to grow dramatically. How can you increase capacity for more read-only clients? 

The clue is that the clients are read-only, and the challenge is scale. Read replicas increase capacity for simultaneous reads. Note that a high availability configuration wouldn’t help in this scenario because it would not necessarily increase throughput. 

Question 2 :

A BigQuery data set was located near Tokyo. For efficiency reasons, the company wants the dataset duplicated in Germany

BigQuery imports and exports data to local or multi-regional buckets in the same location. So, you need to use Cloud Storage as an intermediary.

Question 3:

A transactionally consistent global relational repository where you can monitor and adjust node count for unpredictable traffic spikes.

B is correct because of the requirement to globally scalable transactions, so use Cloud Spanner. CPU utilization is the recommended metric for scaling per Google best practices.

Building and Operationalizing Data Processing Systems

Tip: Data management is often influenced by business requirements. After the data has been used for the “live” application, is it collected for reporting, for backup and recovery, for audits, or for legal compliance? What are the changing business purposes of the data in different time frames?

Tip: “Effective use of managed services”: Choose the right service and the correct settings/features for specific use cases.

Consider costs, performance, and effective use cases (key features) for these:
● Cloud Bigtable
● Cloud Spanner
● Cloud SQL
● BigQuery
● Cloud Storage
● Cloud Datastore
● Cloud Memorystore

Tip: What is Data cleansing? Data cleansing is improving the data quality through consistency. You could use Cloud Dataprep to Extract, Transform, or Load (ETL). You could run a data transformation job on Cloud Dataproc.

Tip: Batch and streaming together? You should already be thinking “Cloud Dataflow.”

Tip: Integrating with new data sources? You should be familiar with the connectors available between services in the cloud and common import/acquisition configurations.

Tip: Testing and quality control and monitoring: You should be familiar with the common approaches to testing that are used in production environments, such as A/B testing, and other rollout scenarios. Similarly, there are operational and administrative monitoring elements for most services in the GCP Console, and statistical and log monitoring in stackdriver. Do you know how to enable and use Stackdriver with common services?

Case Study 1

We initially considered using Dataproc or Dataflow. However, the customer already had analysts that were familiar with BigQuery and SQL. So if we developed in BigQuery it was going to make the solution more maintainable and usable to the group. If we developed in Dataproc, for example, they would have had to rely on another team that had Spark programmers. So this is an example where the technical solution was influenced by the business context.

To make this solution work, we needed some automation. And for that we chose Apache Airflow. In the original design we ran Airflow on a Compute Engine instance. You might be familiar with the Google service called Cloud Composer, which provides a managed Apache Airflow service. Cloud Composer was not yet available when we began the design.

-> Common data warehouse in BigQuery. Apache Airflow to automate query dependencies

One of the time-sinks in their original process had to do with that 30 hour start-to-finish window. What they would do is start processing jobs and sometimes they would fail because the data from a previous dependency wasn’t yet available. And they had a manual process for restarting those jobs. We were able to automate away that toil and the re-work by implementing the logic in Apache Airflow.

Analyzing and Modeling

what are the three modes of the Natural Language API? The answer is sentiment analysis, entities, and syntax.

Deploying an ML pipeline

One of the basic concepts of machine learning is correctable error. If you can make a guess about something like a value or a state and if you know whether that guess was right or not and especially if you know how far off the guess was, you can correct it repeat that hundreds and thousands of times and it becomes possible to improve the guessing algorithm until the error is acceptable for your application. 

Concepts like fast failure, life cycle and iterations become important in developing and refining a model.

Each time you run through the training data, it’s called an epoch, and you would change some parameters to help the model develop more predictive accuracy. 

Preparing for Machine Learning

On GCP, we can use logging APIs, cloud pub sub, etc, and other real time streaming to collect the data. BigQuery, data flow and MLP processing SDK to organize the data using different types of organization. Use TensorFlow to create the model and use Cloud ML to train and deploy the model.

The way TensorFlow works is that you create a Directed Graph, a DG, to represent your computation.

TensorFlow does lazy evaluation, you write a directed graph or DG, then you run the DG in the context of a session to get the results. TensorFlow can also run in eager mode, using the tf.eager method where the evaluation is immediate and it’s not lazy.

The difference, however, is in execution. NumPy executes immediately, TensorFlow runs in stages. The build stage builds the directory graph, and the run stage executes the graph and produces the results.

Because developing ML models are so processor intensive, it’s important to get the model right before scaling up. Otherwise, the models can become expensive.

ML and Unstructured Data

Working with unstrctured data :

 It’s important to recognize that machine learning has two stages, training and inference.  If you have an ML question that refers to labels, it is a question about supervised learning. If the question is about regression or classification, it’s using supervised machine learning. 

Mean square error :

Gradient descent

One reason for using the root of the Mean Square Error rather than the Mean Square Error is because the RMSE is in the units of the measurement, making it easier to read and understand the significance of the value. Categorizing produces discrete values, and regression produces continuous values. Each uses different methods. Is the result you’re looking for like deciding whether an instances in category A or category B? If so, it’s a discrete value and therefore uses classification. If the result you’re looking for is more like a number like the current value of a house, if so, it’s a continuous value and therefore uses regression. If the question describes cross entropy it’s a classification ML problem.

Training and Validating

You can use the evaluation data set to determine if the model parameters are leading to over fitting.Over fitting or memorizing your training data set can be far worse than having a model that only adequately fits your data. If someone said they had a machine learning model that recognizes new instances and categorizes the correctly 100% of the time, it would be an indicator that the validation data somehow got mixed up with the training data, and that the data is no longer a good measure of how well the model’s working.

Recap :

Practice questions

Question 1

Quickly and inexpensively develop an application that sorts product reviews by most favorable to least favorable.

Use sentiment analysis to sort the reviews. The story here is to use a pre-trained model if it will do. Creating models is expensive and time consuming. Use a pre-trained model whenever possible. In this case, the natural language API with sentiment analysis returns score and magnitude of sentiment.

Question 2 :

Maximize speed and minimize cost of deploying a TensorFlow machine-learning model on GCP.

A is correct because it follows Google’s recommended practices. Best practice is to use each tool for the purpose for which it was designed and built. So a tip here is to note when recommended best practices are called out, because those might be on the exam.B is incorrect because Kubernetes isn’t the right tool for this circumstance. And C and D are not correct, because in this situation you don’t need to export two copies of the train model.

Outline review :

Operationalizing Machine Learning Models

Tip: There are a few elements here. The first is building systems that use these services. The second is using additional services to augment, improve, or enhance the base functionality.


Study these:
● Cloud Vision API
● Cloud Text-to-speech API
● Cloud Speech-to-text API
● Cloud AutoML Vision
● Cloud AutoML Natural Language
● Cloud AutoML Translation
● Dialogflow

Tip: You need to know how to deploy existing models to Cloud Machine Learning Engine and to maintain them, which might involve retraining.
Tip: Continuous evaluation is setting up continuous evaluation of the machine learning model so that steps can be taken to improve it.

Study these:
● Kubeflow
● Cloud Machine Learning Engine
● Spark ML
● BigQuery ML

Tip: Edge computing is the design of distributing processing in a strategic way so that model processing is pushed closer to the inputs; for example, in IoT, doing machine learning processing closer to the IoT sensors by performing work in nearby data centers or regions is edge computing.

Study these:
● GPU
● TPU

Tip: One common source of error is accidental inclusion of biased data in the data being used for model training or validation.


Do you know these terms in a machine learning context?
● Features
● Labels
● Models
● Regression
● Classification
● Recommendation
● Supervised and unsupervised learning
● Evaluation
● Metrics
● Assumptions about data

Case Study 2

This case involves a media company that’s decided to move their in-house data processing into BigQuery. This example is focused on security and compliance. As part of the migration, they’ve been moving their data centers from on-prem to BigQuery and the cloud. They have a lot of concerns about security. Who has access to the data they’re migrating into the cloud? How is access audited and logged? What kind of controls can be placed on top of that? They’re very concerned about data exfiltration. 

Each group was isolated in separate projects and allowed limited access between them using VPC Service Controls. BigQuery allows separation of access by role, so we were able to limit some roles to only loading data and other to only running queries.Some groups were able to run queries in their own project using datasets for which they only had read access, and the data was stored in a separate repository. We made sure that at the folder level of the resource hierarchy, we had aggregated log exports enabled. That ensured that even if you were the owner of a project and had the ability to redirect exports, you wouldn’t be able to do so with specific exports, because those rights were set at the folder level, where most team members didn’t have access. So by using aggregated log exports we were able to scoop up all the logs, store them in cloud storage, and create a record of who is running what query at what time against what dataset.

Preparing for Performance and Optimization

Modeling business processes for analysis and optimization

One of the themes in ML is to start simply and build to production. Shown is the general progression of building an ML solution. You can start with big data, go through feature engineering, then create the model and deploy it.

Your exam tip, consider using data where it is in place, maybe from cloud storage rather than using extract transform load ETL.

Your exam tip, grouping the work can be efficient and give additional control over the processing of the data. 

Feature Engineering and performance

Feature engineering is unique discipline. Selecting which feature or features to use in a model is critical, and there are a lot of things to consider including, whether the data of the feature is dense or sparse, and if the value is numeric, whether the magnitude is meaningful or abstract. Also, a good feature needs to have enough examples available to train, validate, and evaluate the model. Hyper parameters can determine whether your model converges on the truth quickly or not at all.

Schema and performance

The trade off between relationalism and flat structure is called normalization. This process of breaking out fields into another lookup table increasing the relations between the tables is called normalization. Normalization represents relations between tables. 

The trade off is performance versus efficiency. Normalized is more efficient denormalized is more performant.

BigQuery can use nested schemas for highly scalable queries. In the example shown, the company field has multiple nested transactions.

Pipeline and performance

Understand fields you’re using for keys when you’re using join. Limit the use of user-defined functions, use native SQL whenever possible. 

Dividing work

Partitioning by time is often a way to evenly distribute work and avoid hotspots in processing data. In this example, a partition table includes a pseudo column named partition time, that contains a database timestamp for data loaded into the table. This improves efficiency for BigQuery.

But the opposite advice is recommended for selecting training data for machine learning. When you identify data for training ML, be careful to randomize rather than organize by time. Because, you might train the model on the first part for example on summer data, and test using the second part which might be winter data and it will appear that the model isn’t working.

Combine, allows cloud data flow to distribute a key to multiple workers and process it in parallel. In this example, CombineByKey, first aggregates values and then processes the aggregates with multiple workers.Consider when data is unbounded or streaming that using Windows to divide the data into groups can make the processing of the data much more manageable. Of course, then you have to consider the size of the window and whether the windows are overlapping.

Bigtable performance

A Bigtable table is sharded into blocks of contiguous rows called tablets to help balance the workload of queries. Now data is never stored in Cloud Bigtable nodes themselves. Each node has pointers, two set of tablets that are stored in a storage service.

Spanner organization which uses wide table design. In other words, it’s optimized for explicit columns, and for the columns to be grouped under column families. Bigtable on the other hand is a sparse table design meaning that it’s optimized for using single row key and undifferentiated column data.

Because the data is stored sequentially in Bigtable, events starting with the same timestamp will all be stored on the same tablet and that means the processing isn’t distributed.

So Bigtable knows to use copies on other nodes in the file system to improve performance. Here’s some tips to growing a Bigtable cluster. There are a number of steps you can take to increase the performance. One item I would highlight is that there can be a delay of several minutes to hours when you add nodes to a cluster before you see the performance improvement.

Price estimation

So, you don’t pay to load the data for ingest, but you do pay to store the data once it’s loaded. Exam tip is to be suspicious of anything with “Select All.” You need to understand the use of wildcards. The pricing calculator can be used with BigQuery to estimate the cost of a query before you submit it. The query validator also has estimates about how much data will be used by the query before you run it. You can plug this into the pricing calculator to get an estimate of how much you’ll spend before you run the query.

Outline review

Case Study 3

The overall business requirement was to migrate to Cloud a on-premise reporting solution aimed to produce daily reports to regulators.

The on-premises solution was coded using a ‘SQL-like language’ and it was run on a Hadoop cluster using Spark/MapReduce (leveraging proprietary third-party software).

The client wanted to optimize the target cloud solution to:

  • Minimize changes to their processes and codebase.
  • Leverage PaaS and native cloud solution (i.e., avoid third-party software).
  • Significantly improve performance (from hours to minutes).

We mapped that to technical requirements like this…

Business Requirements

  • Minimum changes to their processes and codebase
  • Leverage PaaS and native Cloud Solution (i.e., avoid third-party software)
  • Significantly improve performance (from hours to minutes)

Technical Requirements

  • Programmatically convert ‘SQL-like’ into ANSI SQL
  • No changes to source/target input structures
  • Automated ‘black box’ regression testing
  • Minimize number of systems/interfaces, aim for full execution in BigQuery (i.e., remove the need for a Hadoop cluster)
  • Analyze and (manually) optimize the SQL code in BigQuery for performance

This is how we implemented that technical requirement. Source data is ingested in cloud storage, no changes to sourcing. Output reports are generated entirely in BigQuery. Cloud Analytics allows users to review the reports. Data is egressed on-prem, so there’s no changes to the target structures, no changes downstream. Logs and controls out of the box in Stackdriver. We ended up with a fairly simple solution. The most notable part is what’s not in the picture and that’s Hadoop. Recall that the customer’s application that was performing the processing was on Hadoop with some MapReduce. We were able to port the data out of the Hadoop Cluster to Cloud Storage.

Reliability, Policy, and Security

Reliable means that the service produces consistent outputs and operates as expected.If we were to quantify it, it would be a measure of how long the service performs its intended function. Available and durable are real-world values and they’re usually not a 100 percent.Available means that the service is accessible on demand, a measure of the percentage of time that the item is in an operable state.Durable has to do with data loss. It means the data does not disappear and information is not lost over time. 

 So, the important thing to consider is what are the business requirements to recover from different kinds of problems and how much time is allowed for each kind of recovery? For example, disaster recovery of a week might be acceptable for flood damage to a store front. On the other hand, loss of a financial transaction might be completely unacceptable. So, the transaction itself needs to be atomic, backed up and redundant.

The exam tip here is that you can monitor infrastructure and data services with Stackdriver.

TensorBoard is a collection of visualization tools designed specifically to help you visualize TensorFlow.

The exam tip here is that service specific monitoring may be available. TensorBoard is an example of monitoring tailored to TensorFlow.

So, the overall exam tip is that there might be quality processes or reliability processes built into the technology such as this demonstrates.

Practice Exam Questions 4

Question 1

Use Data Studio to visualize YouTube titles and aggregated view counts summarized over 30 days and segmented by Country Code in the fewest steps.

In this case, you would use a connector. Country Code is filter with simply dropout not segment, dimensions describe in group data, so it would have the effect of segmenting the report. However, Data Studio includes a feature called segments which is set separately for using Google Analytics segments. B is correct because there’s no need to export. You can use the existing YouTube data source. Country Code is a dimension because it’s a string and should be displayed as such, that is showing all countries instead of filtering.

Preparing for Accountability

Data visualization and reporting tools

The exam tip here is that troubleshooting and improving data quality, and processing performance is distributed through all the technologies.Security and troubleshooting are the lateral subjects that cut across all technologies.

Designing for security and compliance 

Security is a broad term includes: privacy, authentication, and authorization, and identity, and access management. It could include intrusion detection, attack mitigation, resilience, and recovery.

Outline

Tip: IAM—Understand permissions and custom roles. Under what conditions are custom roles preferred over standard predefined roles?

Tip: Data security, data loss prevention—Cloud DLP allows you to minimize what you collect, store, expose, or copy. Classify or automatically redact sensitive data from text streams before you write to disk, generate logs, or perform analysis.

Be familiar with all these:
● Cloud IAM
● Encryption, Key Management
● Data Loss Prevention API
● HIPAA, COPPA, FedRAMP, GDPR

Tip: A lot of resource administration is presented in the GCP Console, but a lot of runtime information, such as logs and performance, is presented and reported in Stackdriver. Stackdriver provides information for troubleshooting both functional and performance issues.
● Stackdriver

Tip: Establishing standard data quality at ingress by using Cloud Dataprep or by running an ETL pipeline can prevent many problems later in processing that would be difficult to troubleshoot.

Tip: Remember the business purpose of the data processing. How resilient does the application need to be? For example, financial transactions usually cannot be dropped and must not be duplicated, but a statistical analysis might be equally valid if a small amount of data is lost. These assumptions influence the approach to rerunning failed jobs.

Study these:
● Cloud Dataprep
● Fault-tolerance
● Rerunning failed jobs
● Performing retrospective re-analysis

Tip: Where is the official authoritative data (sometimes called the source of truth) and where are the replicas? How frequently does data need to be shared or updated? Can smaller parts of the data be synchronized to reduce costs?

Tip: Where is the data stored? Where is the data going to be processed? Can data storage and data processing be in locations near each other?

Tip: When will the data need to be exported? How difficult and expensive will it be to export data? For example, you might want to store data in a different location or in a different type of storage to meet business requirements for portability.

● Multi-cloud data residency requirements

Practice Exam Questions 5

Question 1

Groups analyst one and analyst two should not have access to each other’s BigQuery data.

that’s because BigQuery access is controlled at the dataset level. So, you can’t lock a user to specific tables in the dataset, but you don’t have to give them access to all the resources in a project either.

Question 2

Provide analyst three secure access to BigQuery query results, but not the underlying tables or datasets.

You need to copy or store the query results in a separate dataset and provide authorization to view and/or use that dataset. The other solutions are not secure.

Secure data infrastructure and legal compliance

Remember that Google Cloud platform does a lot of security work behind the scenes, so your data solution inherits a lot of that automatically. Here’s an exam tip. Know the default behavior of GCP, so you don’t try to duplicate it unnecessarily. For example, a client used disk encryption on their computers in the data center. When they migrated their application to the Cloud, they plan to implement disk encryption again on the VMs, only to discover that the encryption requirement was already met by default on the platform.

Categories: CloudTech

Brax

Dude in his 30s starting his digital notepad