Professional Data Engineer on Google Cloud Platform
17%
278 QUESTIONS AS TOTAL
Question 41
You are planning to migrate your current on-premises Apache Hadoop deployment to the cloud. You need to ensure that the deployment is as fault-tolerant and cost-effective as possible for long-running batch jobs. You want to use a managed service.
What should you do?
Deploy a Cloud Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
Deploy a Cloud Dataproc cluster. Use an SSD persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
Install Hadoop and Spark on a 10-node Compute Engine instance group with standard instances. Install the Cloud Storage connector, and store the data in Cloud Storage. Change references in scripts from hdfs:// to gs://
Install Hadoop and Spark on a 10-node Compute Engine instance group with preemptible instances. Store data in HDFS. Change references in scripts from hdfs:// to gs://
Answer is Deploy a Cloud Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
Cloud Dataproc for Managed Cloud native application and HDD for cost-effective solution.
Question 42
You need to choose a database for a new project that has the following requirements:
- Fully managed
- Able to automatically scale up
- Transactionally consistent
- Able to scale up to 6 TB
- Able to be queried using SQL
Which database do you choose?
Cloud SQL
Cloud Bigtable
Cloud Spanner
Cloud Datastore
Answer is Cloud SQL
It asks for scaling up which can be done in cloud sql, horizontal scaling is not possible in cloud sql
Automatic storage increase
If you enable this setting, Cloud SQL checks your available storage every 30 seconds. If the available storage falls below a threshold size, Cloud SQL automatically adds additional storage capacity. If the available storage repeatedly falls below the threshold size, Cloud SQL continues to add storage until it reaches the maximum of 30 TB.
Question 43
What are two of the benefits of using denormalized data structures in BigQuery?
Reduces the amount of data processed, reduces the amount of storage required
Increases query speed, makes queries simpler
Reduces the amount of storage required, increases query speed
Reduces the amount of data processed, increases query speed
Answer is Increases query speed, makes queries simpler
Cannot be A or C because:
"Denormalized schemas aren't storage-optimal, but BigQuery's low cost of storage addresses concerns about storage inefficiency."
Cannot be D because the amount of data processed is the same.
As for why is it "simpler", I don't see it directly stated but it is hinted at: "Expressing records by using nested and repeated fields simplifies data load using JSON or Avro files." and "Expressing records using nested and repeated structures can provide a more natural representation of the underlying data."
Which of the following are feature engineering techniques? (Select 2 answers)
Hidden feature layers
Feature prioritization
Crossed feature columns
Bucketization of a continuous feature
Answer are;
C. Crossed feature columns
D. Bucketization of a continuous feature
Selecting and crafting the right set of feature columns is key to learning an effective model.
Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into.
Using each base feature column separately may not be enough to explain the data. To learn the differences between different feature combinations, we can add crossed feature columns to the model.
You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?
Both batch and streaming
BigQuery cannot be used as a sink
Only batch
Only streaming
Answer is Both batch and streaming
When you apply a BigQueryIO.Write transform in batch mode to write to a single table, Dataflow invokes a BigQuery load job. When you apply a BigQueryIO.Write transform in streaming mode or in batch mode using a function to specify the destination table, Dataflow uses BigQuery's streaming inserts.
You have a job that you want to cancel. It is a streaming pipeline, and you want to ensure that any data that is in-flight is processed and written to the output.
Which of the following commands can you use on the Dataflow monitoring console to stop the pipeline job?
Cancel
Drain
Stop
Finish
Answer is Drain
Using the Drain option to stop your job tells the Dataflow service to finish your job in its current state. Your job will immediately stop ingesting new data from input sources, but the Dataflow service will preserve any existing resources (such as worker instances) to finish processing and writing any buffered data in your pipeline.
Which of the following statements is NOT true regarding Bigtable access roles?
Using IAM roles, you cannot give a user access to only one table in a project, rather than all tables in a project.
To give a user access to only one table in a project, grant the user the Bigtable Editor role for that table.
You can configure access control only at the project level.
To give a user access to only one table in a project, you must configure access through your application.
Answer is To give a user access to only one table in a project, grant the user the Bigtable Editor role for that table.
For Cloud Bigtable, you can configure access control at the project level. For example, you can grant the ability to: Read from, but not write to, any table within the project. Read from and write to any table within the project, but not manage instances. Read from and write to any table within the project, and manage instances.
What is the general recommendation when designing your row keys for a Cloud Bigtable schema?
Include multiple time series values within the row key
Keep the row keep as an 8 bit integer
Keep your row key reasonably short
Keep your row key as long as the field permits
Answer is
A general guide is to, keep your row keys reasonably short. Long row keys take up additional memory and storage and increase the time it takes to get responses from the Cloud Bigtable server.
All Google Cloud Bigtable client requests go through a front-end server ______ they are sent to a Cloud Bigtable node.
before
after
only if
once
Answer is before
In a Cloud Bigtable architecture all client requests go through a front-end server before they are sent to a Cloud Bigtable node. The nodes are organized into a Cloud Bigtable cluster, which belongs to a Cloud Bigtable instance, which is a container for the cluster. Each node in the cluster handles a subset of the requests to the cluster. When additional nodes are added to a cluster, you can increase the number of simultaneous requests that the cluster can handle, as well as the maximum throughput for the entire cluster.