Professional Data Engineer on Google Cloud Platform
7%
278 QUESTIONS AS TOTAL
Question 11
You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:
- The user profile: What the user likes and doesn't like to eat
- The user account information: Name, address, preferred meal times
- The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the product. You want to optimize the data schema.
Which Google Cloud Platform product should you use?
BigQuery Most Voted
Cloud SQL
Cloud Bigtable
Cloud Datastore
Answer is Cloud SQL
The database will be used to store all the transactional data of the product.
what we need is the database only for store transactional data, not for analysis and ML.
so the answer should be "the database that stores transactional data", which means, Cloud SQL.
if you want to analyze or do ML you just specify Cloud SQL as a federated data source.
A: it's good for analysis but it costs too much to input/output data frequently.
C: BigTable is not good for transactional data.
D: okay datastore supports transactions, but it is weaker than RDB, and also, in this case, the data schema has already defined , you should use RDB.
Question 12
Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.
You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (Choose two.)
Introduce data compression for each file to increase the rate file of file transfer.
Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premises data to the designated storage bucket.
Answers are;
A. Introduce data compression for each file to increase the rate file of file transfer.
C. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
B - wrong (we need to provide solution without changing internet speed)
D - if we TAR 1000 files, its okay, but Volume is getting increased continously..How we define the number ?
E - Bandwidth already low, so storage Transfer service will not help here.
Follow these rules of thumb when deciding whether to use gsutil or Storage Transfer Service:
Transfer scenario Recommendation
Transferring from another cloud storage provider Use Storage Transfer Service.
Transferring less than 1 TB from on-premises Use gsutil.
Transferring more than 1 TB from on-premises Use Transfer service for on-premises data.
Transferring less than 1 TB from another Cloud Storage region Use gsutil.
Transferring more than 1 TB from another Cloud Storage region Use Storage Transfer Service.
You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID).
However, high availability and low latency are required.
You need to analyze the data by querying against individual fields.
Which three databases meet your requirements? (Choose three.)
Redis
HBase
MySQL
MongoDB
Cassandra
F. HDFS with Hive
Answer is HBase, D. MongoDB, E. Cassandra
A. Redis - Redis is an in-memory non-relational key-value store. Redis is a great choice for implementing a highly available in-memory cache to decrease data access latency, increase throughput, and ease the load off your relational or NoSQL database and application. Since the question does not ask cache, A is discarded.
B. HBase - Meets reqs
C. MySQL - they do not need ACID, so not needed.
D. MongoDB - Meets reqs
E. Cassandra - Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
F. HDFS with Hive - Hive allows users to read, write, and manage petabytes of data using SQL. Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data. HIVE IS NOT A DATABSE.
Question 14
You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query: SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country
You check the query plan for the query and see the following output in the Read section of Stage:1:
What is the most likely cause of the delay for this query?
Users are running too many concurrent queries in the system
The [myproject:mydataset.mytable] table has too many partitions
Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values
Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew
Answer is Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew
Purple is reading, Blue is writing. so majority is reading.
Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can't share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition.
Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp.
You want to collate those bid events into a single location in real time to determine which user bid first.
What should you do?
Create a file on a shared file and have the application servers write all bid events to that file. Process the file with Apache Hadoop to identify which user bid first.
Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL. Most Voted
Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first.
Answer is Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first.
The need is to collate the messages in real-time. We need to de-dupe the messages based on timestamp of when the event occurred. This can be done by publishing ot Pub-Sub and consuming via Dataflow.
Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of data. The view is described in legacy SQL. Next month, existing applications will be connecting to BigQuery to read the events data via an ODBC connection. You need to ensure the applications can connect.
Which two actions should you take? (Choose two.)
Create a new view over events using standard SQL
Create a new partitioned table using a standard SQL query
Create a new view over events_partitioned using standard SQL
Create a service account for the ODBC connection to use for authentication
Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared "events"
Answers are;
C. Create a new view over events_partitioned using standard SQL
D. Create a service account for the ODBC connection to use for authentication
C = A standard SQL query cannot reference a view defined using legacy SQL syntax.
D = For the ODBC drivers is needed a service account which will get a standard Bigquery role.
Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly.
How should you optimize the cluster for cost?
Migrate the workload to Google Cloud Dataflow
Use pre-emptible virtual machines (VMs) for the cluster
Use a higher-memory node so that the job runs faster
Use SSDs on the worker nodes so that the job can run faster
Answer is Use pre-emptible virtual machines (VMs) for the cluster
Hadoop/Spark jobs are run on Dataproc, and the pre-emptible machines cost 80% less
Your infrastructure includes a set of YouTube channels. You have been tasked with creating a process for sending the YouTube channel data to Google Cloud for analysis. You want to design a solution that allows your world-wide marketing teams to perform ANSI SQL and other types of analysis on up-to-date YouTube channels log data.
How should you set up the log data transfer into Google Cloud?
Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Regional bucket as a final destination.
Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination. Most Voted
Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Regional storage bucket as a final destination.
Answer is Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
Destination is GCS and having multi-regional so A is the best option available.
Even since BigQuery Data Transfer Service initially supports Google application sources like Google Ads, Campaign Manager, Google Ad Manager and YouTube but it does not support destination anything other than bq data set
Question 19
You are designing storage for very large text files for a data pipeline on Google Cloud. You want to support ANSI SQL queries. You also want to support compression and parallel load from the input locations using Google recommended practices.
What should you do?
Transform text files to compressed Avro using Cloud Dataflow. Use BigQuery for storage and query.
Transform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and BigQuery permanent linked tables for query.
Compress text files to gzip using the Grid Computing Tools. Use BigQuery for storage and query.
Compress text files to gzip using the Grid Computing Tools. Use Cloud Storage, and then import into Cloud Bigtable for query.
Answer is Transform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and BigQuery permanent linked tables for query.
A and B are correct, but B is the best answer
The advantages of creating external tables are that they are fast to create so you skip the part of importing data and no additional monthly billing storage costs are accrued to your account since you only get charged for the data that is stored in the data lake, which is comparatively cheaper than storing it in BigQuery
A : Importing data into BigQuery may take more time compared to creating external tables on data. Additional storage costs by BigQuery is another issue which can be more expensive than Google Storage.
Question 20
You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines.
Which storage service and schema design should you use?
Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.
Answer is Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
BigQuery can access data in external sources, known as federated sources. Instead of first
loading data into BigQuery, you can create a reference to an external source. External
sources can be Cloud Bigtable, Cloud Storage, and Google Drive.
When accessing external data, you can create either permanent or temporary external
tables. Permanent tables are those that are created in a dataset and linked to an external
source. Dataset-level access controls can be applied to these tables. When you are using a
temporary table, a table is created in a special dataset and will be available for approxi-
mately 24 hours. Temporary tables are useful for one-time operations, such as loading data
into a data warehouse.