Google Cloud – Professional Data Engineer Certification Exam Notes

Google Cloud Data Engineer Study Guide

Section 1: Designing Data Processing Systems (~22%)

1.1 Designing for Security and Compliance

  • Identity and Access Management (e.g., Cloud IAM and organization policies)
  • Data security (encryption and key management)
  • Privacy (e.g., personally identifiable information, and Cloud Data Loss Prevention API)
  • Regional considerations (data sovereignty) for data access and storage
  • Legal and regulatory compliance

1.2 Designing for Reliability and Fidelity

  • Preparing and cleaning data (e.g., Dataprep, Dataflow, and Cloud Data Fusion)
  • Monitoring and orchestration of data pipelines
  • Disaster recovery and fault tolerance
  • Making decisions related to ACID compliance and availability
  • Data validation

1.3 Designing for Flexibility and Portability

  • Mapping current and future business requirements to the architecture
  • Designing for data and application portability (multi-cloud, data residency)
  • Data staging, cataloging, and discovery (data governance)

1.4 Designing Data Migrations

  • Analyzing current stakeholder needs and planning for desired state
  • Planning migration to Google Cloud (e.g., BigQuery, Datastream)
  • Designing migration validation strategy
  • Designing dataset and table architecture for proper governance

Important points

  • BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

  • Sample query to load data from cloud storage to big query
  • bq load \ --source_format=CSV \ --autodetect \ --noreplace \ nyctaxi.2018trips \ gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_2.csv
  • Datastream for BigQuery features seamless replication from operational database sources such as AlloyDB, MySQL, PostgreSQL, and Oracle, directly into BigQuery, Google Cloud's serverless data warehouse. With a serverless, auto-scaling architecture, Datastream allows you to easily set up an ELT (Extract, Load, Transform) pipeline for low-latency data replication enabling real-time insights.






Data services in clouds

Cloud Service Equivalents – GCP vs AWS vs Azure
Category GCP Service AWS Equivalent Azure Equivalent Purpose
Data Warehouse BigQuery Amazon Redshift Azure Synapse Analytics Analytical SQL warehouse for large-scale queries.
Data Lake Storage Cloud Storage (GCS) Amazon S3 Azure Data Lake Storage (ADLS) Gen2 / Blob Storage Object storage for raw, semi-/structured data.
Batch/Stream Processing Dataflow (Apache Beam) Kinesis Data Analytics / AWS Glue Streaming ETL Azure Stream Analytics / Data Factory Mapping Data Flows Serverless batch + streaming ETL.
Managed Hadoop/Spark Dataproc Amazon EMR Azure HDInsight / Synapse Spark Pools Managed Hadoop/Spark/Hive/Presto clusters.
Data Orchestration Cloud Composer (Airflow) AWS MWAA / Step Functions Azure Data Factory Workflow orchestration and scheduling.
Real-time Messaging Pub/Sub Kinesis Data Streams / SNS Event Hubs / Service Bus Pub/sub messaging for real-time ingestion.
ETL/ELT Service Dataprep (Trifacta) AWS Glue DataBrew ADF Wrangling Data Flows No-code data prep for analytics.
Database Migration DMS AWS DMS Azure Database Migration Service Migrate DBs to the cloud.
NoSQL Wide-Column Store Cloud Bigtable Amazon DynamoDB Cosmos DB (Cassandra API) Low-latency, high-throughput NoSQL store.
Machine Learning Vertex AI SageMaker Azure Machine Learning Managed ML platform for training, deployment, MLOps.