GCP Data Engineer Certification Preparation Guide04 Dec 2021 by dzlab
I recently passed Google Professional Data Engineer Certification, during the preparation I went throught lot resources about Google Cloud. I also read this book but as Google update its services very often lot of the information in the book become out dated. The book is still a good read if you have little knowledge on Google Cloud services but make sure to also read the official documentation.
In this article, I compile the different resources I found most usefull/accurate during the preparation for the exam which can be useful to someone preparing for the exam. The exam itself is not very tough, although most of the questions are ambiguous and hence you need to be well prepared. You can learn more about the certification in the official page - link.
Data Engineer role is all about data, hence the focus on storage technologies Google Cloud provides.
BigQuery takes big chunk of the exam, lot questions are around how to design you table, optimize performance, migrated data into bigquery and how to use other Google cloud resources along with BigQuery.
- Moving BigQuery data between locations - link
- Partitioning tables. Based on what are they partitioned — ingestion time, timestamp, date. How are they named? How are they then accessed in queries? Using
_PARTITIONTIME. - link
- How to cluster parititioned tables - link
- User defined functions in BigQuery - link
- Geospatial data analytics in BigQuery - link
- Accessing historical data using time travel - link
- How to manage BigQuery flat-rate slots within a project - link
MERGEexamples - link
- Bigquery. Know what a federated table is. While you are at it, learn also about clustered tables. - link
- BigQuery + GCS. Know how to link tables between GCS and BigQuery as permanent tables and temporary tables. - link
- BigQuery query plan. BigQuery allows you to see the query plan and execution profile for queries that you run. Know the phases, difference between average and max time, why there can be skew in the plan, and how to optimize for it. - link
- table date range for bq. Accessing tables with dates and partitioned tables with functions like
TABLE_QUERY. - link
- Syntax for wildcards in big query names. And in legacy SQL? - link
_TABLE_SUFFIXcontains the values matched by the table wildcard
#standardSQL SELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, year FROM `bigquery-public-data.noaa_gsod.gsod194*` WHERE max != 9999.9 AND ( _TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4' ) ORDER BY max DESC
- BigQuery Data Transfer Service - link
- Transaction support in Cloud Spanner - link
- Schema design best practices - link
- Secondary index for cloud spanner. How indexes are created for you and how you can create secondary indexes. - link
- Understand architecture of bigtable - link
- key reasons for high performance and ways to optimize - link
- Know Key Visualiser - link
- Know when to scale BigTable - link
- Know performant key/schema design: row key scheme. What are the recommended ways for creating the row key? How do you avoid hotspotting? Should you use timestamp, and where? - link
- Scaling up BigTable - link
- If you need to double your reads for a prolonged period, what can you do to guarantee the same read latency?
- Dev to Prod cluster promotion
- HDD to SSD data migration
- How to use multiple indexes for datastore. Default indexes. Syntax for creating custom, composite indexes. - link
- Export and import entities - link
- Migrate from Kafka to PubSub - link
- Know potential reasons for PubSub ingesting applications being busier than initially planned
- What PubSub metrics are available in Stackdriver and how to debug producers/consumers - link
- Ordering messages - link
- Dealing with duplicate messages - link
- Monitoring - link
- At-Least-Once delivery - link
- Replay - link
- Know when to use Data Transfer Appliance. Hint - slow network, huge dataset, no in-between refreshes. - link
- When to use Transfer Service and what are its limitations. - link
- Know the cost of storage and availability for various products: BigQuery, BigTable, Cloud SQL, GCS to be able to find the cheapest product for a set of availability/durability criteria.
- How Dedicated Interconnect impacts your data transfer decisions? - link
- How to continuously sync data between on-prem and GCP - link
Data processing also takes big part of the exam, good knowledge of Dataflow/Beam operators may be required, and less for Dataproc/Hadoop/Spark.
- Understand Apache Beam building blocks - Pipeline, PCollection, PTransform, ParDO - link
- Know Side Inputs - link
- Exactly once processing of PubSub messages - link
- Handling invalid inputs - link
- Templates https://cloud.google.com/dataflow/docs/guides/templates/overview
- Dataflow developer mode. https://cloud.google.com/dataflow/docs/concepts/access-control
- Preemptible workers - link
- Scaling clusters - link
- Cloud Storage connector: usage of gcs instead of existing file system. It is a best practice to use Google Cloud Storage instead of using HDFS. You can destroy the compute nodes after data crunching and save cost on them.
- Dataproc: how to control scaling? Configure autoscaling? https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling
There was few ML questions, but AI platform (now called vertext) may save you from surprises.
- feature crosses. Learn what these are and what issues it solves. - link
- Go through the Coursera course on machine learning. - link
- Dealing with overfitting. - link
- Regularization. What does it mean to increase or decrease regularization? - link
- AutoML Vision Beginner’s guide - link
Cloud Data Loss Prevention (DLP)
- Supported cryptographic methods in Cloud DLP (only read this section) - link
- Kubeflow - link
- Edge TPU - link
The remaining of the exam can touch more or less the following services, you need to at least read the overview of each one of the following services:
- Overview - link
- Overview - link
- Overview - link
- Dataprep: jobs. How are Dataprep jobs created and run? What permissions do you need? A term I saw was that this is a more ‘casual’ way of data cleaning. As Dataproc/Dataflow requires programming knowledge - link
- DataStudio: visualisation. What are the causes of stale data? And how do you get the latest? What caching options do you need to set?
- BigQuery+DataStudio — caching/pre-fetch cache. Learn how you connect DataStudio to storage solutions. Learn the difference between default caching (which cannot be disabled) and pre-fetch caching (which can be disabled). What is the difference between doing that with Viewer credentials and Owner credentials. - link
- How to allow cross team data access to BigQuery and GCS in a large organisation
- Key Management Service. Using KMS with non-GCP products. Note that there is a default key management where Google manages all the keys, then there is a customer managed encryption keys, and also a customer supplied encryption keys. - link