Professional Data Engineer on Google Cloud Platform

32%

Question 81

The Development and External teams have the project viewer Identity and Access Management (IAM) role in a folder named Visualization. You want the Development Team to be able to read data from both Cloud Storage and BigQuery, but the External Team should only be able to read data from BigQuery.

What should you do?
Remove Cloud Storage IAM permissions to the External Team on the acme-raw-data project.
Create Virtual Private Cloud (VPC) firewall rules on the acme-raw-data project that deny all ingress traffic from the External Team CIDR range.
Create a VPC Service Controls perimeter containing both projects and BigQuery as a restricted API. Add the External Team users to the perimeter's Access Level.
Create a VPC Service Controls perimeter containing both projects and Cloud Storage as a restricted API. Add the Development Team users to the perimeter's Access Level.




Answer is Create a VPC Service Controls perimeter containing both projects and Cloud Storage as a restricted API. Add the Development Team users to the perimeter's Access Level.

Development team: needs to access both Cloud Storage and BQ -> therefore we put the Development team inside a perimeter so it can access both the Cloud Storage and the BQ
External team: allowed to access only BQ -> therefore we put Cloud Storage behind the restricted API and leave the external team outside of the perimeter, so it can access BQ, but is prohibited from accessing the Cloud Storage
"The grouping of GCP Project(s) and Service API(s) in the Service Perimeter result in restricting unauthorized access outside of the Service Perimeter to Service API endpoint(s) referencing resources inside of the Service Perimeter."

Reference:
https://scalesec.com/blog/vpc-service-controls-in-plain-english/

Question 82

Your startup has a web application that currently serves customers out of a single region in Asia. You are targeting funding that will allow your startup to serve customers globally.
Your current goal is to optimize for cost, and your post-funding goal is to optimize for global presence and performance. You must use a native JDBC driver.

What should you do?
Use Cloud Spanner to configure a single region instance initially, and then configure multi-region Cloud Spanner instances after securing funding.
Use a Cloud SQL for PostgreSQL highly available instance first, and Bigtable with US, Europe, and Asia replication after securing funding.
Use a Cloud SQL for PostgreSQL zonal instance first, and Bigtable with US, Europe, and Asia after securing funding.
Use a Cloud SQL for PostgreSQL zonal instance first, and Cloud SQL for PostgreSQL with highly available configuration after securing funding.




Answer is Use Cloud Spanner to configure a single region instance initially, and then configure multi-region Cloud Spanner instances after securing funding.

When you create a Cloud Spanner instance, you must configure it as either regional (that is, all the resources are contained within a single Google Cloud region) or multi-region (that is, the resources span more than one region).
You can change the instance configuration to multi-regional (or global) at anytime.

Reference:
https://cloud.google.com/spanner/docs/jdbc-drivers
https://cloud.google.com/spanner/docs/instance-configurations#tradeoffs_regional_versus_multi-region_configurations

Question 83

An aerospace company uses a proprietary data format to store its flight data.
You need to connect this new data source to BigQuery and stream the data into BigQuery. You want to efficiently import the data into BigQuery while consuming as few resources as possible.

What should you do?
Write a shell script that triggers a Cloud Function that performs periodic ETL batch jobs on the new data source.
Use a standard Dataflow pipeline to store the raw data in BigQuery, and then transform the format later when the data is used.
Use Apache Hive to write a Dataproc job that streams the data into BigQuery in CSV format.
Use an Apache Beam custom connector to write a Dataflow pipeline that streams the data into BigQuery in Avro format.




Answer is Use an Apache Beam custom connector to write a Dataflow pipeline that streams the data into BigQuery in Avro format.

The key reasons:
• Dataflow provides managed resource scaling for efficient stream processing
• Avro format has schema evolution capabilities and efficient serialization for flight telemetry data
• Apache Beam connectors avoid having to write much code to integrate proprietary data sources
• Streaming inserts data efficiently compared to periodic batch jobs

In contrast, option A uses Cloud Functions which lack native streaming capabilities. Option B stores data in less efficient JSON format. Option C uses Dataproc which requires manual cluster management.
So leveraging Dataflow + Avro + Beam provides the most efficient way to stream proprietary flight data into BigQuery while using minimal resources.

Question 84

An online brokerage company requires a high volume trade processing architecture.
You need to create a secure queuing system that triggers jobs.
The jobs will run in Google Cloud and call the company's Python API to execute trades. You need to efficiently implement a solution.

What should you do?
Use a Pub/Sub push subscription to trigger a Cloud Function to pass the data to the Python API.
Write an application hosted on a Compute Engine instance that makes a push subscription to the Pub/Sub topic.
Write an application that makes a queue in a NoSQL database.
Use Cloud Composer to subscribe to a Pub/Sub topic and call the Python API.




Answer is Use a Pub/Sub push subscription to trigger a Cloud Function to pass the data to the Python API.

Because trading platform requires securely transmission to queuing.
If you use cloud compose then we need some other job to trigger composer, would that be cloud composer api or cloud function.

Reference:
https://cloud.google.com/functions/docs/calling/pubsub#deployment

Question 85

You have 15 TB of data in your on-premises data center that you want to transfer to Google Cloud.
Your data changes weekly and is stored in a POSIX-compliant source.
The network operations team has granted you 500 Mbps bandwidth to the public internet.
You want to follow Google-recommended practices to reliably transfer your data to Google Cloud on a weekly basis.

What should you do?
Use Cloud Scheduler to trigger the gsutil command. Use the -m parameter for optimal parallelism.
Use Transfer Appliance to migrate your data into a Google Kubernetes Engine cluster, and then configure a weekly transfer job.
Install Storage Transfer Service for on-premises data in your data center, and then configure a weekly transfer job.
Install Storage Transfer Service for on-premises data on a Google Cloud virtual machine, and then configure a weekly transfer job.




Answer is Install Storage Transfer Service for on-premises data in your data center, and then configure a weekly transfer job.

Like gsutil, Storage Transfer Service for on-premises data enables transfers from network file system (NFS) storage to Cloud Storage. Although gsutil can support small transfer sizes (up to 1 TB), Storage Transfer Service for on-premises data is designed for large-scale transfers (up to petabytes of data, billions of files).

Reference:
https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#storage-transfer-service-for-large-transfers-of-on-premises-data

Question 86

You are using BigQuery and Data Studio to design a customer-facing dashboard that displays large quantities of aggregated data.
You expect a high volume of concurrent users.
You need to optimize the dashboard to provide quick visualizations with minimal latency.

What should you do?
Use BigQuery BI Engine with materialized views.
Use BigQuery BI Engine with logical views.
Use BigQuery BI Engine with streaming data.
Use BigQuery BI Engine with authorized views.




Answer is Use BigQuery BI Engine with materialized views.

In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever possible reads only delta changes from the base tables to compute up-to-date results. Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.

Queries that use materialized views are generally faster and consume fewer resources than queries that retrieve the same data only from the base tables. Materialized views can significantly improve the performance of workloads that have the characteristic of common and repeated queries.

Reference:
https://cloud.google.com/bigquery/docs/materialized-views-intro

Question 87

Government regulations in the banking industry mandate the protection of clients' personally identifiable information (PII).
Your company requires PII to be access controlled, encrypted, and compliant with major data protection standards. In addition to using Cloud Data Loss Prevention (Cloud DLP), you want to follow Google-recommended practices and use service accounts to control access to PII.

What should you do?
Assign the required Identity and Access Management (IAM) roles to every employee, and create a single service account to access project resources.
Use one service account to access a Cloud SQL database, and use separate service accounts for each human user.
Use Cloud Storage to comply with major data protection standards. Use one service account shared by all users.
Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group.




Answer is Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group.

To align with Google's recommended practices for managing access to personally identifiable information (PII) in compliance with banking industry regulations, let's analyze the options:
A. Assign the required IAM roles to every employee, and create a single service account to access project resources: While assigning specific IAM roles to employees is a good practice for access control, using a single service account for all access to PII is not ideal. Service accounts should be used for applications and automated processes, not as a shared account for multiple users or employees.

B. Use one service account to access a Cloud SQL database, and use separate service accounts for each human user: Again, service accounts are intended for automated tasks or applications, not for individual human users. Assigning separate service accounts to each human user is not a recommended practice and does not align with the principle of least privilege.

C. Use Cloud Storage to comply with major data protection standards. Use one service account shared by all users: Using Cloud Storage can indeed help comply with data protection standards, especially when configured correctly with encryption and access controls. However, sharing a single service account among all users is not a best practice. It goes against the principle of least privilege and does not provide adequate granularity for access control.

D. Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group: This approach is more aligned with best practices. Using Cloud Storage can ensure compliance with data protection standards. Creating multiple service accounts, each with specific access controls attached to different IAM groups, allows for more granular and controlled access to PII. This setup adheres to the principle of least privilege, ensuring that each service (or group of services) only has access to the resources necessary for its function.

Based on these considerations, option D is the most appropriate choice. It ensures compliance with data protection standards, uses Cloud Storage for secure data management, and employs multiple service accounts tied to IAM groups for granular access control, aligning well with Google-recommended practices and regulatory requirements in the banking industry.

Question 88

You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?
Disable caching by editing the report settings.
Disable caching in BigQuery by editing table details.
Refresh your browser tab showing the visualizations.
Clear your browser history for the past hour then reload the tab showing the virtualizations.




Answer is Disable caching by editing the report settings.

Reference: https://support.google.com/datastudio/answer/7020039?hl=en

Question 89

Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse.

You need to discover what everyone is doing. What should you do first?
Use Google Stackdriver Audit Logs to review data access.
Get the identity and access management IIAM) policy of each table
Use Stackdriver Monitoring to see the usage of BigQuery query slots.
Use the Google Cloud Billing API to see what account the warehouse is being billed to.




Answer is Use Google Stackdriver Audit Logs to review data access.

'SLOTS' are just Virtual CPU used by BigQuery to process a query.

Reference:
https://cloud.google.com/bigquery/docs/slots

Question 90

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive.

What should you do?
Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.
Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.
Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.




Answer is E. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Creating a new table from existing table in BigQuery with new transformed column will be simple and will not involve and migration effort. Also future query performance will improve.

Reference:
https://cloud.google.com/bigquery/docs/manually-changing-schemas#changing_a_columns_data_type

< Previous PageNext Page >

Quick access to all questions in this exam