Professional Data Engineer on Google Cloud Platform
10%
278 QUESTIONS AS TOTAL
Question 21
You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally.
You also want to optimize data for range queries on non-key columns.
What should you do?
Use Cloud SQL for storage. Add secondary indexes to support query patterns.
Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.
Answer is Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
Spanner allows transaction tables to scale horizontally and secondary indexes for range queries
Your financial services company is moving to cloud technology and wants to store 50 TB of financial time-series data in the cloud. This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?
Cloud Bigtable
Google BigQuery
Google Cloud Storage
Google Cloud Datastore
Answer is Cloud Bigtable
Bigtable is GCP’s managed wide-column database. It is also a good option for migrat-
ing on-premises Hadoop HBase databases to a managed database because Bigtable has
an HBase interface.
Cloud Bigtable is a wide-column NoSQL database used for high-volume databases that
require low millisecond (ms) latency. Cloud Bigtable is used for IoT, time-series, finance,
and similar applications.
You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?
PigLatin using Pig
HiveQL using Hive
Java using MapReduce
Python using MapReduce
Answer is PigLatin using Pig
Pig is scripting language which can be used for checkpointing and splitting pipelines
Question 24
You need to migrate a 2TB relational database to Google Cloud Platform. You do not have the resources to significantly refactor the application that uses this database and cost to operate is of primary concern.
Which service do you select for storing and serving your data?
Cloud Spanner
Cloud Bigtable
Cloud Firestore
Cloud SQL
Answer is Cloud SQL
Cloud SQL supports MySQL 5.6 or 5.7, and provides up to 624 GB of RAM and 30 TB of data storage, with the option to automatically increase the storage size as needed.
You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to BigQuery for analysis.
Which job type and transforms should this pipeline use?
Batch job, PubSubIO, side-inputs
Streaming job, PubSubIO, JdbcIO, side-outputs
Streaming job, PubSubIO, BigQueryIO, side-inputs
Streaming job, PubSubIO, BigQueryIO, side-outputs
Answer is Streaming job, PubSubIO, BigQueryIO, side-inputs
You need pubsubIO and BigQueryIO for streaming data and writing enriched data back to BigQuery. side-inputs are a way to enrich the data
You want to analyze hundreds of thousands of social media posts daily at the lowest cost and with the fewest steps.
You have the following requirements:
- You will batch-load the posts once per day and run them through the Cloud Natural Language API.
- You will extract topics and sentiment from the posts.
- You must store the raw posts for archiving and reprocessing.
- You will create dashboards to be shared with people both inside and outside your organization.
You need to store both the data extracted from the API to perform analysis as well as the raw social media posts for historical archiving.
What should you do?
Store the social media posts and the data extracted from the API in BigQuery.
Store the social media posts and the data extracted from the API in Cloud SQL.
Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery.
Feed to social media posts into the API directly from the source, and write the extracted data from the API into BigQuery.
Answer is Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery.
Social media posts can images/videos which cannot be stored in bigquery
Question 27
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day.
Which tool should you use?
cron
Cloud Composer
Cloud Scheduler
Workflow Templates on Cloud Dataproc
Answer is Cloud Composer
Cloud Composer is an Apache Airflow managed service, it serves well when orchestrating interdependent pipelines, and Cloud Scheduler is just a managed Cron service.
You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit recipients' personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems.
What should you do?
Create an authorized view in BigQuery to restrict access to tables with sensitive data.
Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.
Answer is Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.
Protection of sensitive data, like personally identifiable information (PII), is critical to your business. Deploy de-identification in migrations, data workloads, and real-time data collection and processing.
You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems.
You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour. How should you design the solution?
Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.
Answer is Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.
Dataflow has autoscaling feature and pubsub is best solution