Use Professional-Data-Engineer Exam Dumps (2024 PDF Dumps) To Have Reliable Professional-Data-Engineer Test Engine
Professional-Data-Engineer PDF Recently Updated Questions Dumps to Improve Exam Score
NEW QUESTION # 139
By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?
- A. Windows at every 1 minute
- B. Single, Global Window
- C. Windows at every 10 minutes
- D. Windows at every 100 MB of data
Answer: B
Explanation:
Dataflow's default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections Reference: https://cloud.google.com/dataflow/model/pcollection
NEW QUESTION # 140
You are deploying a new storage system for your mobile application, which is a media streaming service.
You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of
which can take on multiple values. For example, in the entity 'Movie'the property 'actors'and the
property 'tags' have multiple values but the property 'date released' does not. A typical query
would ask for all movies with actor=<actorname>ordered by date_releasedor all movies with
tag=Comedyordered by date_released. How should you avoid a combinatorial explosion in the
number of indexes?
- A. Manually configure the index in your index config as follows:

- B. Set the following in your entity options: exclude_from_indexes = 'date_published'
- C. Manually configure the index in your index config as follows:

- D. Set the following in your entity options: exclude_from_indexes = 'actors, tags'
Answer: C
NEW QUESTION # 141
Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow.
Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs.
BigQueryIO.Read
.named("ReadLogData")
.from("clouddataflow-readonly:samples.log_data")
You want to improve the performance of this data read. What should you do?
- A. Call a transform that returns TableRow objects, where each element in the PCollection represents a single row in the table.
- B. Use .fromQuery operation to read specific fields from the table.
- C. Specify the TableReference object in the code.
- D. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
Answer: A
NEW QUESTION # 142
You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second. What should you do?
- A. Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed time window of 1 hour. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
- B. Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below
4000, send an alert. - C. Consume the stream of data in Cloud Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
- D. Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Cloud Bigtable in the last hour. If that number falls below 4000, send an alert.
Answer: D
NEW QUESTION # 143
You are designing a cloud-native historical data processing system to meet the following conditions:
* The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
* A streaming data pipeline stores new data daily.
* Peformance is not a factor in the solution.
* The solution design should maximize availability.
How should you design data storage for this solution?
- A. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
- B. Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
- C. Store the data in a regional Cloud Storage bucket. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
- D. Store the data in BigQuery. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
Answer: A
NEW QUESTION # 144
What are two of the benefits of using denormalized data structures in BigQuery?
- A. Reduces the amount of data processed, increases query speed
- B. Reduces the amount of storage required, increases query speed
- C. Increases query speed, makes queries simpler
- D. Reduces the amount of data processed, reduces the amount of storage required
Answer: C
Explanation:
Denormalization increases query speed for tables with billions of rows because BigQuery's performance degrades when doing JOINs on large tables, but with a denormalized data structure, you don't have to use JOINs, since all of the data has been combined into one table. Denormalization also makes queries simpler because you do not have to use JOIN clauses. Denormalization increases the amount of data processed and the amount of storage required because it creates redundant data.
https://cloud.google.com/solutions/bigquery-data-warehouse#denormalizing_data
NEW QUESTION # 145
Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the dat
- A. Maintain a database table to store the hash value and other metadata for each data entry.
- B. How should you deduplicate the data most efficiency?
- C. Compute the hash value of each data entry, and compare it with all historical data.
- D. Assign global unique identifiers (GUID) to each data entry.
- E. Store each data entry as the primary key in a separate database and apply an index.
Answer: E
NEW QUESTION # 146
Which methods can be used to reduce the number of rows processed by BigQuery?
- A. Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
- B. Splitting tables into multiple tables; putting data in partitions
- C. Putting data in partitions; using the LIMIT clause
- D. Splitting tables into multiple tables; using the LIMIT clause
Answer: B
Explanation:
If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day. If you use the LIMIT clause, BigQuery will still process the entire table.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
NEW QUESTION # 147
You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters are running on Compute Engine instances. You need to encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed. What should you do?
- A. Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
- B. Create encryption keys in Cloud Key Management Service. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances.
- C. Create a dedicated service account, and use encryption at rest to reference your data stored in your Compute Engine cluster instances as part of your API service calls.
- D. Create encryption keys locally. Upload your encryption keys to Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
Answer: D
Explanation:
Explanation/Reference:
NEW QUESTION # 148
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?
- A. Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.
- B. Create a Google Cloud Dataflow job to process the data.
- C. Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
- D. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
- E. Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
Answer: D
Explanation:
Dataproc is used to migrate Hadoop and Spark jobs on GCP. Dataproc with GCS connected through Google Cloud Storage connector helps store data after the life of the cluster. When the job is high I/O intensive, then we need to create a small persistent disk.
NEW QUESTION # 149
You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second. What should you do?
- A. Consume the stream of data in Cloud Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
- B. Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to Cloud Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Cloud Bigtable in the last hour. If that number falls below 4000, send an alert.
- C. Use Kafka Connect to link your Kafka message queue to Cloud Pub/Sub. Use a Cloud Dataflow template to write your messages from Cloud Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below 4000, send an alert.
- D. Consume the stream of data in Cloud Dataflow using Kafka IO. Set a fixed time window of 1 hour.
Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
Answer: B
NEW QUESTION # 150
You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud.
Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?
- A. Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
- B. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
- C. Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.
- D. Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
Answer: D
NEW QUESTION # 151
Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service
in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If
there are any concerns about a transmission, the system re-transmits the data. How should you
deduplicate the data most efficiency?
- A. Compute the hash value of each data entry, and compare it with all historical data.
- B. Maintain a database table to store the hash value and other metadata for each data entry.
- C. Assign global unique identifiers (GUID) to each data entry.
- D. Store each data entry as the primary key in a separate database and apply an index.
Answer: B
NEW QUESTION # 152
You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors.
You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?
- A. Have the data acquisition devices publish data to Cloud Pub/Sub.
- B. Establish a Cloud Interconnect between all remote data centers and Google.
- C. Write a Cloud Dataflow pipeline that aggregates all data in session windows.
- D. Deploy small Kafka clusters in your data centers to buffer events.
Answer: A
NEW QUESTION # 153
Which of the following is NOT one of the three main types of triggers that Dataflow supports?
- A. Trigger that is a combination of other triggers
- B. Trigger based on element count
- C. Trigger based on element size in bytes
- D. Trigger based on time
Answer: C
Explanation:
Explanation
There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2. Data-driven triggers.
You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data-driven triggers in some logical way Reference: https://cloud.google.com/dataflow/model/triggers
NEW QUESTION # 154
You operate a database that stores stock trades and an application that retrieves average stock price for a given company over an adjustable window of time. The data is stored in Cloud Bigtable where the datetime of the stock trade is the beginning of the row key. Your application has thousands of concurrent users, and you notice that performance is starting to degrade as more stocks are added. What should you do to improve the performance of your application?
- A. Use Cloud Dataflow to write summary of each day's stock trades to an Avro file on Cloud Storage. Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses.
- B. Change the row key syntax in your Cloud Bigtable table to begin with a random number per second.
- C. Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.
- D. Change the data pipeline to use BigQuery for storing stock trades, and update your application.
Answer: C
NEW QUESTION # 155
You are building a model to make clothing recommendations. You know a user's fashion pis likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?
- A. Continuously retrain the model on a combination of existing data and the new data.
- B. Train on the new data while using the existing data as your test set.
- C. Continuously retrain the model on just the new data.
- D. Train on the existing data while using the new data as your test set.
Answer: A
Explanation:
We have to use a combination of old and new test data as well as training data.
NEW QUESTION # 156
Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?
- A. Serialization
- B. Dimensionality Reduction
- C. Dropout Methods
- D. Threading
Answer: C
NEW QUESTION # 157
Your company is implementing a data warehouse using BigQuery and you have been tasked with designing the data model You move your on-premises sales data warehouse with a star data schema to BigQuery but notice performance issues when querying the data of the past 30 days Based on Google's recommended practices, what should you do to speed up the query without increasing storage costs?
- A. Denormalize the data
- B. Shard the data by customer ID
- C. Materialize the dimensional data in views
- D. Partition the data by transaction date
Answer: D
NEW QUESTION # 158
......
Professional-Data-Engineer Dumps Full Questions with Free PDF Questions to Pass: https://www.prep4king.com/Professional-Data-Engineer-exam-prep-material.html
Free Google Cloud Certified Professional-Data-Engineer Official Cert Guide PDF Download: https://drive.google.com/open?id=1EgJ2dsAzC4XOySvUZc-xcAeYWwKVTh_V

