Professional Data Engineer on Google Cloud Platform

10%

Question 21

You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally.

You also want to optimize data for range queries on non-key columns.

What should you do?
Use Cloud SQL for storage. Add secondary indexes to support query patterns.
Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.




Answer is Use Cloud Spanner for storage. Add secondary indexes to support query patterns.

Spanner allows transaction tables to scale horizontally and secondary indexes for range queries

Reference:
https://cloud.google.com/spanner/docs/secondary-indexes

Question 22

Your financial services company is moving to cloud technology and wants to store 50 TB of financial time-series data in the cloud. This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.

Which product should they use to store the data?
Cloud Bigtable
Google BigQuery
Google Cloud Storage
Google Cloud Datastore




Answer is Cloud Bigtable

Bigtable is GCP’s managed wide-column database. It is also a good option for migrat- ing on-premises Hadoop HBase databases to a managed database because Bigtable has an HBase interface.

Cloud Bigtable is a wide-column NoSQL database used for high-volume databases that require low millisecond (ms) latency. Cloud Bigtable is used for IoT, time-series, finance, and similar applications.

Reference:
https://cloud.google.com/blog/products/databases/getting-started-with-time-series-trend-predictions-using-gcp

Question 23

You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?
PigLatin using Pig
HiveQL using Hive
Java using MapReduce
Python using MapReduce




Answer is PigLatin using Pig

Pig is scripting language which can be used for checkpointing and splitting pipelines

Question 24

You need to migrate a 2TB relational database to Google Cloud Platform. You do not have the resources to significantly refactor the application that uses this database and cost to operate is of primary concern.

Which service do you select for storing and serving your data?
Cloud Spanner
Cloud Bigtable
Cloud Firestore
Cloud SQL




Answer is Cloud SQL

Cloud SQL supports MySQL 5.6 or 5.7, and provides up to 624 GB of RAM and 30 TB of data storage, with the option to automatically increase the storage size as needed.

Reference:
https://cloud.google.com/sql/docs/features

Question 25

You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to BigQuery for analysis.

Which job type and transforms should this pipeline use?
Batch job, PubSubIO, side-inputs
Streaming job, PubSubIO, JdbcIO, side-outputs
Streaming job, PubSubIO, BigQueryIO, side-inputs
Streaming job, PubSubIO, BigQueryIO, side-outputs




Answer is Streaming job, PubSubIO, BigQueryIO, side-inputs

You need pubsubIO and BigQueryIO for streaming data and writing enriched data back to BigQuery. side-inputs are a way to enrich the data

Reference:
https://cloud.google.com/architecture/e-commerce/patterns/slow-updating-side-inputs

Question 26

You want to analyze hundreds of thousands of social media posts daily at the lowest cost and with the fewest steps.
You have the following requirements:
- You will batch-load the posts once per day and run them through the Cloud Natural Language API.
- You will extract topics and sentiment from the posts.
- You must store the raw posts for archiving and reprocessing.
- You will create dashboards to be shared with people both inside and outside your organization.

You need to store both the data extracted from the API to perform analysis as well as the raw social media posts for historical archiving.

What should you do?
Store the social media posts and the data extracted from the API in BigQuery.
Store the social media posts and the data extracted from the API in Cloud SQL.
Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery.
Feed to social media posts into the API directly from the source, and write the extracted data from the API into BigQuery.




Answer is Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery.

Social media posts can images/videos which cannot be stored in bigquery

Question 27

You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day.

Which tool should you use?
cron
Cloud Composer
Cloud Scheduler
Workflow Templates on Cloud Dataproc




Answer is Cloud Composer

Cloud Composer is an Apache Airflow managed service, it serves well when orchestrating interdependent pipelines, and Cloud Scheduler is just a managed Cron service.

Reference:
https://stackoverflow.com/questions/59841146/cloud-composer-vs-cloud-scheduler

Question 28

You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit recipients' personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems.

What should you do?
Create an authorized view in BigQuery to restrict access to tables with sensitive data.
Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.




Answer is Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.

Protection of sensitive data, like personally identifiable information (PII), is critical to your business. Deploy de-identification in migrations, data workloads, and real-time data collection and processing.

Reference:
https://cloud.google.com/dlp

Question 29

You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems.

Which solutions should you choose?
Cloud Speech-to-Text API
Cloud Natural Language API
Dialogflow Enterprise Edition
Cloud AutoML Natural Language




Answer is Dialogflow Enterprise Edition

since we need to recognize both voice and intent

Reference:
https://cloud.google.com/blog/products/gcp/introducing-dialogflow-enterprise-edition-a-new-way-to-build-voice-and-text-conversational-apps

Question 30

You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour. How should you design the solution?
Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.




Answer is Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.

Dataflow has autoscaling feature and pubsub is best solution

< Previous PageNext Page >

Quick access to all questions in this exam