GCP Data Engineer Certification Preparation Guide
04 Dec 2021 by dzlabI recently passed Google Professional Data Engineer Certification, during the preparation I went throught lot resources about Google Cloud. I also read this book but as Google update its services very often lot of the information in the book become out dated. The book is still a good read if you have little knowledge on Google Cloud services but make sure to also read the official documentation.
In this article, I compile the different resources I found most usefull/accurate during the preparation for the exam which can be useful to someone preparing for the exam. The exam itself is not very tough, although most of the questions are ambiguous and hence you need to be well prepared. You can learn more about the certification in the official page - link.
Storage
Data Engineer role is all about data, hence the focus on storage technologies Google Cloud provides.
BigQuery
BigQuery takes big chunk of the exam, lot questions are around how to design you table, optimize performance, migrated data into bigquery and how to use other Google cloud resources along with BigQuery.
Overview
- Moving BigQuery data between locations - link
- Partitioning tables. Based on what are they partitioned — ingestion time, timestamp, date. How are they named? How are they then accessed in queries? Using
_PARTITIONTIME
. - link - How to cluster parititioned tables - link
- User defined functions in BigQuery - link
- Geospatial data analytics in BigQuery - link
- Accessing historical data using time travel - link
- How to manage BigQuery flat-rate slots within a project - link
- SQL
MERGE
examples - link - Bigquery. Know what a federated table is. While you are at it, learn also about clustered tables. - link
- BigQuery + GCS. Know how to link tables between GCS and BigQuery as permanent tables and temporary tables. - link
- BigQuery query plan. BigQuery allows you to see the query plan and execution profile for queries that you run. Know the phases, difference between average and max time, why there can be skew in the plan, and how to optimize for it. - link
- table date range for bq. Accessing tables with dates and partitioned tables with functions like
TABLE_DATE_RANGE
,_TABLE_SUFFIX
,TABLE_QUERY
. - link - Syntax for wildcards in big query names. And in legacy SQL? - link
Pseudo columns
_TABLE_SUFFIX
contains the values matched by the table wildcard#standardSQL SELECT max, ROUND((max-32)*5/9,1) celsius, mo, da, year FROM `bigquery-public-data.noaa_gsod.gsod194*` WHERE max != 9999.9 AND ( _TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4' ) ORDER BY max DESC
Security
Data Transfer
- BigQuery Data Transfer Service - link
BigQuery ML
Cloud Spanner
- Transaction support in Cloud Spanner - link
- Schema design best practices - link
- Secondary index for cloud spanner. How indexes are created for you and how you can create secondary indexes. - link
Cloud Bigtable
- Understand architecture of bigtable - link
- key reasons for high performance and ways to optimize - link
- Know Key Visualiser - link
- Know when to scale BigTable - link
- Know performant key/schema design: row key scheme. What are the recommended ways for creating the row key? How do you avoid hotspotting? Should you use timestamp, and where? - link
- Scaling up BigTable - link
- If you need to double your reads for a prolonged period, what can you do to guarantee the same read latency?
- Dev to Prod cluster promotion
- HDD to SSD data migration
Cloud Datastore
- How to use multiple indexes for datastore. Default indexes. Syntax for creating custom, composite indexes. - link
- Export and import entities - link
Pub/Sub
- Migrate from Kafka to PubSub - link
- Know potential reasons for PubSub ingesting applications being busier than initially planned
- What PubSub metrics are available in Stackdriver and how to debug producers/consumers - link
- Ordering messages - link
- Dealing with duplicate messages - link
- Monitoring - link
- At-Least-Once delivery - link
- Replay - link
Data migrations
- Know when to use Data Transfer Appliance. Hint - slow network, huge dataset, no in-between refreshes. - link
- When to use Transfer Service and what are its limitations. - link
- Know the cost of storage and availability for various products: BigQuery, BigTable, Cloud SQL, GCS to be able to find the cheapest product for a set of availability/durability criteria.
- How Dedicated Interconnect impacts your data transfer decisions? - link
- How to continuously sync data between on-prem and GCP - link
Processing
Data processing also takes big part of the exam, good knowledge of Dataflow/Beam operators may be required, and less for Dataproc/Hadoop/Spark.
Cloud Dataflow
- Understand Apache Beam building blocks - Pipeline, PCollection, PTransform, ParDO - link
- Know Side Inputs - link
- Exactly once processing of PubSub messages - link
- Handling invalid inputs - link
- Templates https://cloud.google.com/dataflow/docs/guides/templates/overview
- Dataflow developer mode. https://cloud.google.com/dataflow/docs/concepts/access-control
Cloud Dataproc
- Preemptible workers - link
- Scaling clusters - link
- Cloud Storage connector: usage of gcs instead of existing file system. It is a best practice to use Google Cloud Storage instead of using HDFS. You can destroy the compute nodes after data crunching and save cost on them.
- https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
- Dataproc: how to control scaling? Configure autoscaling? https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling
Machine Learning
There was few ML questions, but AI platform (now called vertext) may save you from surprises.
ML Concepts
- feature crosses. Learn what these are and what issues it solves. - link
- Go through the Coursera course on machine learning. - link
- Dealing with overfitting. - link
- Regularization. What does it mean to increase or decrease regularization? - link
Cloud AutoML
- AutoML Vision Beginner’s guide - link
Cloud Data Loss Prevention (DLP)
- Supported cryptographic methods in Cloud DLP (only read this section) - link
Kubeflow
- Kubeflow - link
Edge TPU
- Edge TPU - link
Cloud services
The remaining of the exam can touch more or less the following services, you need to at least read the overview of each one of the following services:
Cloud Composer
- Overview - link
Data Catalog
- Overview - link
Cloud Dataprep
- Overview - link
- Dataprep: jobs. How are Dataprep jobs created and run? What permissions do you need? A term I saw was that this is a more ‘casual’ way of data cleaning. As Dataproc/Dataflow requires programming knowledge - link
Data Studio
- DataStudio: visualisation. What are the causes of stale data? And how do you get the latest? What caching options do you need to set?
- BigQuery+DataStudio — caching/pre-fetch cache. Learn how you connect DataStudio to storage solutions. Learn the difference between default caching (which cannot be disabled) and pre-fetch caching (which can be disabled). What is the difference between doing that with Viewer credentials and Owner credentials. - link
IAM
- How to allow cross team data access to BigQuery and GCS in a large organisation
- Key Management Service. Using KMS with non-GCP products. Note that there is a default key management where Google manages all the keys, then there is a customer managed encryption keys, and also a customer supplied encryption keys. - link