Professional Data Engineer on Google Cloud Platform
14%
278 QUESTIONS AS TOTAL
Question 31
You need to set access to BigQuery for different departments within your company. Your solution should comply with the following requirements:
- Each department should have access only to their data.
- Each department will have one or more leads who need to be able to create and update tables and provide them to their team.
- Each department has data analysts who need to be able to query but not modify data.
How should you set access to the data in BigQuery?
Create a dataset for each department. Assign the department leads the role of OWNER, and assign the data analysts the role of WRITER on their dataset.
Create a dataset for each department. Assign the department leads the role of WRITER, and assign the data analysts the role of READER on their dataset.
Create a table for each department. Assign the department leads the role of Owner, and assign the data analysts the role of Editor on the project the table is in.
Create a table for each department. Assign the department leads the role of Editor, and assign the data analysts the role of Viewer on the project the table is in.
Answer is Create a dataset for each department. Assign the department leads the role of WRITER, and assign the data analysts the role of READER on their dataset.
The permissions are required at dataset levels hence READER, WRITER & OWNER which are the primitive roles for dataset to be used.
You decided to use Cloud Datastore to ingest vehicle telemetry data in real time. You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Cloud Datastore in a different environment. You want to archive these snapshots for a long time.
Which two methods can accomplish this?
(Choose two.)
Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.
Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export.
Use managed export, and then import the data into a BigQuery table created just for that export, and delete temporary export files.
Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.
Write an application that uses Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.
Answers are;
A. Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.
B. Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export.
You are designing a cloud-native historical data processing system to meet the following conditions:
- The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
- A streaming data pipeline stores new data daily.
- Peformance is not a factor in the solution.
- The solution design should maximize availability.
How should you design data storage for this solution?
Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
Store the data in BigQuery. Access the data using the BigQuery Connector on Cloud Dataproc and Compute Engine.
Store the data in a regional Cloud Storage bucket. Access the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
Answer is Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
Multi-region increases high availability and pdf can be stored in gcs
Question 34
Your United States-based company has created an application for assessing and responding to user actions. The primary table's data volume grows by 250,000 records per second. Many third parties use your application's APIs to build the functionality into their own frontend applications. Your application's APIs should comply with the following requirements:
- Single global endpoint
- ANSI SQL support
- Consistent access to the most up-to-date data
What should you do?
Implement BigQuery with no region selected for storage or processing.
Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.
Implement Cloud SQL for PostgreSQL with the master in Norht America and read replicas in Asia and Europe.
Implement Cloud Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.
Answer is Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.
Cloud Spanner has three types of replicas: read-write replicas, read-only replicas, and witness replicas. Bigquery cannot support 250K data ingestion/second , as ANSI SQL support is required , no other options left except Spanner.
Question 35
You are building an application to share financial market data with consumers, who will receive data feeds. Data is collected from the markets in real time.
Consumers will receive the data in the following ways:
- Real-time event stream
- ANSI SQL access to real-time stream and historical data
- Batch historical exports
Which solution should you use?
Cloud Dataflow, Cloud SQL, Cloud Spanner
Cloud Pub/Sub, Cloud Storage, BigQuery
Cloud Dataproc, Cloud Dataflow, BigQuery
Cloud Pub/Sub, Cloud Dataproc, Cloud SQL
Answer is Cloud Pub/Sub, Cloud Storage, BigQuery
It says the data is collected from the market, and the problem is that the methods are defined as requirements. Therefore, a close answer is B.
Real-time Event Stream: Cloud Pub/Sub is a managed messaging service that can handle real-time event streams efficiently. You can use Pub/Sub to ingest and publish real-time market data to consumers.
ANSI SQL Access: BigQuery supports ANSI SQL queries, making it suitable for both real-time and historical data analysis. You can stream data into BigQuery tables from Pub/Sub and provide ANSI SQL access to consumers.
Batch Historical Exports: Cloud Storage can be used for batch historical exports. You can export data from BigQuery to Cloud Storage in batch, making it available for consumers to download.
You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners.
Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones.
What should you do?
Create an API using App Engine to receive and send messages to the applications
Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them
Create a table on Cloud SQL, and insert and delete rows with the job information
Create a table on Cloud Spanner, and insert and delete rows with the job information
Answer is Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them
Pub/sub will be used to streaming data between application
Question 37
You need to move 2 PB of historical data from an on-premises storage appliance to Cloud Storage within six months, and your outbound network capacity is constrained to 20 Mb/sec. How should you migrate this data to Cloud Storage?
Use Transfer Appliance to copy the data to Cloud Storage
Use gsutil cp ""J to compress the content being uploaded to Cloud Storage
Create a private URL for the historical data, and then use Storage Transfer Service to copy the data to Cloud Storage
Use trickle or ionice along with gsutil cp to limit the amount of bandwidth gsutil utilizes to less than 20 Mb/sec so it does not interfere with the production traffic
Answer is Use Transfer Appliance to copy the data to Cloud Storage
Huge amount of data with log network bandwidth, Transfer applicate is best for moving data over 100TB
Question 38
You receive data files in CSV format monthly from a third party. You need to cleanse this data, but every third month the schema of the files changes. Your requirements for implementing these transformations include:
- Executing the transformations on a schedule
- Enabling non-developer analysts to modify transformations
- Providing a graphical tool for designing transformations
What should you do?
Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis
Load each month's CSV data into BigQuery, and write a SQL query to transform the data to a standard schema. Merge the transformed tables together with a SQL query
Help the analysts write a Cloud Dataflow pipeline in Python to perform the transformation. The Python code should be stored in a revision control system and modified as the incoming data's schema changes
Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file before creating a Dataframe. Then implement the transformations in Spark SQL before writing the data out to Cloud Storage and loading into BigQuery
Answer is Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis
Dataprep is used by non developers
Question 39
You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit.
Which solution should you choose?
Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function. Most Voted
Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
Answer is B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
For this scenario, where you need to automate the detection of damaged packages in real time while they are in transit, the most suitable solution among the provided options would be B.
Here's why this option is the most appropriate:
Real-Time Analysis: AutoML provides the capability to train a custom model specifically tailored to recognize patterns of damage in packages. This model can process images in real-time, which is essential in your scenario.
Integration with Existing Systems: By building an API around the AutoML model, you can seamlessly integrate this solution with your existing package tracking applications. This ensures that the system can flag damaged packages for human review efficiently.
Customization and Accuracy: Since the model is trained on your specific corpus of images, it can be more accurate in detecting damages relevant to your use case compared to pre-trained models.
Let's briefly consider why the other options are less suitable:
A. Use BigQuery machine learning: BigQuery is great for handling large-scale data analytics but is not optimized for real-time image processing or complex image recognition tasks like damage detection on packages.
C. Use the Cloud Vision API: While the Cloud Vision API is powerful for general image analysis, it might not be as effective for the specific task of detecting damage on packages, which requires a more customized approach.
D. Use TensorFlow in Cloud Datalab: While this is a viable option for creating a custom model, it might be more complex and time-consuming compared to using AutoML. Additionally, setting up a real-time analysis system through a Python notebook might not be as straightforward as an API integration.
Question 40
You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second.
What should you do?
Consume the stream of data in Cloud Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed time window of 1 hour. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Cloud Bigtable in the last hour. If that number falls below 4000, send an alert.
Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below 4000, send an alert.
Answer is Consume the stream of data in Cloud Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
Kafka IO and Dataflow is a valid option for interconnect (needless where Kafka is located - On Prem/Google Cloud/Other cloud)
Sliding Window will help to calculate average.