Professional Data Engineer on Google Cloud Platform

39%

Question 101

Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of data. Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate?
Encrypted on Cloud Storage with user-supplied encryption keys. A separate decryption key will be given to each authorized user.
In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability.
In Cloud SQL, with separate database user names to each user. The Cloud SQL Admin activity logs will be used to provide the auditability.
In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket.




Answer is In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket.

Bigquery is not the best option here to store the auditable record of access to certain types of data, because to comply with the government regulations, businesses have to keep records of for up to 7 years. Cloud Storage is cheaper for file storages and can be used as external federated source and queried through big query.

Question 102

After migrating ETL jobs to run on BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original. You've loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison.

What should you do?
Select random samples from the tables using the RAND() function and compare the samples.
Select random samples from the tables using the HASH() function and compare the samples.
Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.
Create stratified random samples using the OVER() function and compare equivalent samples from each table.




Answer is Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.

Using Cloud Storage with big data

Cloud Storage is a key part of storing and working with Big Data on Google Cloud. Examples include:

Loading data into BigQuery.

Using Dataproc, which automatically installs the HDFS-compatible Cloud Storage connector, enabling the use of Cloud Storage buckets in parallel with HDFS.
Using a bucket to hold staging files and temporary data for Dataflow pipelines.
For Dataflow, a Cloud Storage bucket is required. For BigQuery and Dataproc, using a Cloud Storage bucket is optional but recommended.
gsutil is a command-line tool that enables you to work with Cloud Storage buckets and objects easily and robustly, in particular in big data scenarios. For example, with gsutil you can copy many files in parallel with a single command, copy large files efficiently, calculate checksums on your data, and measure performance from your local computer to Cloud Storage.

Question 103

You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account.

What should you do?
Convert your batch BQ queries into interactive BQ queries.
Create an additional project to overcome the 2K on-demand per-project quota.
Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.




Answer is Switch to flat-rate pricing and establish a hierarchical priority model for your projects.

Fixed price doesn’t exhaust limits

Reference:
https://cloud.google.com/blog/products/gcp/busting-12-myths-about-bigquery

Question 104

You have an Apache Kafka cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.

What should you do?
Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read from PubSub and write to GCS.
Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read from PubSub and write to GCS.




Answer is Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.

The solution specifically mentions mirroring and minimizing the use of Kafka Connect plugin.

Reference:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27846330

Question 105

Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).

What should you do?
Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
Add a try... catch block to your DoFn that transforms the data, extract erroneous rows from logs.
Add a try... catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.
Add a try... catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.




Answer is Add a try... catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.

The error records are directly written to PubSub from the DoFn (it’s equivalent in python).
You cannot directly write a PCollection to PubSub. You have to extract each record and write one at a time. Why do the additional work and why not write it using PubSubIO in the DoFn itself?
You can write the whole PCollection to Bigquery though, as explained in

Reference:
https://medium.com/google-cloud/dead-letter-queues-simple-implementation-strategy-for-cloud-pub-sub-80adf4a4a800

Question 106

You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes. You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.

What should you do?
Export Bigtable dump to GCS and run your analytical job on top of the exported files.
Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.




Answer is Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.

You need to perform an hourly batch job on the cluster that already has high workload. in such cases, the best option is to replicate the cluster with single cluster routing. The original cluster can continue its read and writes. the replicated cluster can be used for analytical workload without impacting original cluster. Multi cluster routing is beneficial in cases where high availability is needed but requirement is only to isolate analytical workload from existing cluster.

Reference:
https://cloud.google.com/bigtable/docs/replication-overview#use-cases

Question 107

You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL.

What should you do?
Use Cloud Dataflow with Beam to detect errors and perform transformations.
Use Cloud Dataprep with recipes to detect errors and perform transformations.
Use Cloud Dataproc with a Hadoop job to detect errors and perform transformations.
Use federated tables in BigQuery with queries to detect errors and perform transformations.




Answer is Use Cloud Dataprep with recipes to detect errors and perform transformations.

The keyword here is no programming skills required.

Question 108

Your company needs to upload their historic data to Cloud Storage. The security rules don't allow access from external IPs to their on-premises resources. After an initial upload, they will add new data from existing on-premises applications every day.

What should they do?
Execute gsutil rsync from the on-premises servers.
Use Cloud Dataflow and write the data to Cloud Storage.
Write a job template in Cloud Dataproc to perform the data transfer.
Install an FTP server on a Compute Engine VM to receive the files and move them to Cloud Storage.




Answer is Execute gsutil rsync from the on-premises servers.

Dataflow is on cloud is external; "don't allow access from external IPs to their on-premises resources" so no dataflow.

Reference:
https://cloud.google.com/solutions/migration-to-google-cloud-transferring-your-large-datasets#options_available_from_google

Question 109

You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient.

What should you do?
Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.




Answer is Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

You are transferring sensitive patient information, so C & D are ruled out. Choice comes down to A & B. Here it gets tricky. How to choose Transfer Appliance: (https://cloud.google.com/transfer-appliance/docs/2.0/overview)
Without knowing the bandwidth, it is not possible to determine whether the upload can be completed within 7 days, as recommended by Google. So the safest and most performant way is to use Transfer Appliance.

Question 110

You used Cloud Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes.

What should you do?
Create a cron schedule in Cloud Dataprep.
Create an App Engine cron job to schedule the execution of the Cloud Dataprep job.
Export the recipe as a Cloud Dataprep template, and create a job in Cloud Scheduler.
Export the Cloud Dataprep job as a Cloud Dataflow template, and incorporate it into a Cloud Composer job.




Answer is Export the Cloud Dataprep job as a Cloud Dataflow template, and incorporate it into a Cloud Composer job.

Dataprep can be run on Dataflow using template and cloud composer will create dependency on previous job

Reference:
https://cloud.google.com/blog/products/data-analytics/how-to-orchestrate-cloud-dataprep-jobs-using-cloud-composer

< Previous PageNext Page >

Quick access to all questions in this exam