GCP Data Engineer Certification Preparation Guide

04 Dec 2021 by dzlab Professional Data Engineer Certification

I recently passed Google Professional Data Engineer Certification, during the preparation I went throught lot resources about Google Cloud. I also read this book but as Google update its services very often lot of the information in the book become out dated. The book is still a good read if you have little knowledge on Google Cloud services but make sure to also read the official documentation.

In this article, I compile the different resources I found most usefull/accurate during the preparation for the exam which can be useful to someone preparing for the exam. The exam itself is not very tough, although most of the questions are ambiguous and hence you need to be well prepared. You can learn more about the certification in the official page - link.

Storage

Data Engineer role is all about data, hence the focus on storage technologies Google Cloud provides.

BigQuery

BigQuery takes big chunk of the exam, lot questions are around how to design you table, optimize performance, migrated data into bigquery and how to use other Google cloud resources along with BigQuery.

Overview

Moving BigQuery data between locations - link
Partitioning tables. Based on what are they partitioned — ingestion time, timestamp, date. How are they named? How are they then accessed in queries? Using _PARTITIONTIME. - link
How to cluster parititioned tables - link
User defined functions in BigQuery - link
Geospatial data analytics in BigQuery - link
Accessing historical data using time travel - link
How to manage BigQuery flat-rate slots within a project - link
SQL MERGE examples - link
Bigquery. Know what a federated table is. While you are at it, learn also about clustered tables. - link
BigQuery + GCS. Know how to link tables between GCS and BigQuery as permanent tables and temporary tables. - link
BigQuery query plan. BigQuery allows you to see the query plan and execution profile for queries that you run. Know the phases, difference between average and max time, why there can be skew in the plan, and how to optimize for it. - link
table date range for bq. Accessing tables with dates and partitioned tables with functions like TABLE_DATE_RANGE, _TABLE_SUFFIX, TABLE_QUERY. - link
Syntax for wildcards in big query names. And in legacy SQL? - link

Pseudo columns

_TABLE_SUFFIX contains the values matched by the table wildcard

#standardSQL
SELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, year
FROM `bigquery-public-data.noaa_gsod.gsod194*`
WHERE max != 9999.9 AND ( _TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4' )
ORDER BY max DESC

Security

Basic roles for datasets - link
Access at Table level - link

Data Transfer

BigQuery Data Transfer Service - link

BigQuery ML

Overview - link
Data transformations with BigQuery ML - link

Cloud Spanner

Transaction support in Cloud Spanner - link
Schema design best practices - link
Secondary index for cloud spanner. How indexes are created for you and how you can create secondary indexes. - link

Cloud Bigtable

Understand architecture of bigtable - link
key reasons for high performance and ways to optimize - link
Know Key Visualiser - link
Know when to scale BigTable - link
Know performant key/schema design: row key scheme. What are the recommended ways for creating the row key? How do you avoid hotspotting? Should you use timestamp, and where? - link
Scaling up BigTable - link
If you need to double your reads for a prolonged period, what can you do to guarantee the same read latency?
Dev to Prod cluster promotion
HDD to SSD data migration

Cloud Datastore

How to use multiple indexes for datastore. Default indexes. Syntax for creating custom, composite indexes. - link
Export and import entities - link

Pub/Sub

Migrate from Kafka to PubSub - link
Know potential reasons for PubSub ingesting applications being busier than initially planned
What PubSub metrics are available in Stackdriver and how to debug producers/consumers - link
Ordering messages - link
Dealing with duplicate messages - link
Monitoring - link
At-Least-Once delivery - link
Replay - link

Data migrations

Know when to use Data Transfer Appliance. Hint - slow network, huge dataset, no in-between refreshes. - link
When to use Transfer Service and what are its limitations. - link
Know the cost of storage and availability for various products: BigQuery, BigTable, Cloud SQL, GCS to be able to find the cheapest product for a set of availability/durability criteria.
How Dedicated Interconnect impacts your data transfer decisions? - link
How to continuously sync data between on-prem and GCP - link

Processing

Data processing also takes big part of the exam, good knowledge of Dataflow/Beam operators may be required, and less for Dataproc/Hadoop/Spark.

Cloud Dataflow

Understand Apache Beam building blocks - Pipeline, PCollection, PTransform, ParDO - link
Know Side Inputs - link
Exactly once processing of PubSub messages - link
Handling invalid inputs - link
Templates https://cloud.google.com/dataflow/docs/guides/templates/overview
Dataflow developer mode. https://cloud.google.com/dataflow/docs/concepts/access-control

Cloud Dataproc

Preemptible workers - link
Scaling clusters - link
Cloud Storage connector: usage of gcs instead of existing file system. It is a best practice to use Google Cloud Storage instead of using HDFS. You can destroy the compute nodes after data crunching and save cost on them.
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
Dataproc: how to control scaling? Configure autoscaling? https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling

Machine Learning

There was few ML questions, but AI platform (now called vertext) may save you from surprises.

ML Concepts

feature crosses. Learn what these are and what issues it solves. - link
Go through the Coursera course on machine learning. - link
Dealing with overfitting. - link
Regularization. What does it mean to increase or decrease regularization? - link

Cloud AutoML

AutoML Vision Beginner’s guide - link

Cloud Data Loss Prevention (DLP)

Supported cryptographic methods in Cloud DLP (only read this section) - link

Kubeflow

Kubeflow - link

Edge TPU

Edge TPU - link

Cloud services

The remaining of the exam can touch more or less the following services, you need to at least read the overview of each one of the following services:

Cloud Composer

Overview - link

Data Catalog

Overview - link

Cloud Dataprep

Overview - link
Dataprep: jobs. How are Dataprep jobs created and run? What permissions do you need? A term I saw was that this is a more ‘casual’ way of data cleaning. As Dataproc/Dataflow requires programming knowledge - link

Data Studio

DataStudio: visualisation. What are the causes of stale data? And how do you get the latest? What caching options do you need to set?
BigQuery+DataStudio — caching/pre-fetch cache. Learn how you connect DataStudio to storage solutions. Learn the difference between default caching (which cannot be disabled) and pre-fetch caching (which can be disabled). What is the difference between doing that with Viewer credentials and Owner credentials. - link

IAM

How to allow cross team data access to BigQuery and GCS in a large organisation
Key Management Service. Using KMS with non-GCP products. Note that there is a default key management where Google manages all the keys, then there is a customer managed encryption keys, and also a customer supplied encryption keys. - link

All things

GCP Data Engineer Certification Preparation Guide

Storage

BigQuery

Cloud Spanner

Cloud Bigtable

Cloud Datastore

Pub/Sub

Data migrations

Processing

Cloud Dataflow

Cloud Dataproc

Machine Learning

ML Concepts

Cloud AutoML

Cloud Data Loss Prevention (DLP)

Kubeflow

Edge TPU

Cloud services

Cloud Composer

Data Catalog

Cloud Dataprep

Data Studio

IAM

Other topics

All things

GCP Data Engineer Certification Preparation Guide

Storage

BigQuery

Cloud Spanner

Cloud Bigtable

Cloud Datastore

Pub/Sub

Data migrations

Processing

Cloud Dataflow

Cloud Dataproc

Machine Learning

ML Concepts

Cloud AutoML

Cloud Data Loss Prevention (DLP)

Kubeflow

Edge TPU

Cloud services

Cloud Composer

Data Catalog

Cloud Dataprep

Data Studio

IAM

Other topics

Related Posts

07 Jun 2025

Reverse Engineering Zed's AI Coding Assistant with mitmproxy 07 Jun 2025

Advanced Retrieval Techniques to Supercharge Your RAG 24 May 2025