PostgreSQL To Snowflake
Let’s look at how we can seamlessly perform Change-Data Capturing (CDC) from PostgreSQL to Snowflake.
Scenario
Suppose you have a banking application running on PostgreSQL. There are two tables: “users” and “transactions.” You want to sync these tables in real-time to Snowflake for analytics purposes, such as real-time fraud detection. Let’s see how we can make this happen within a few minutes and a few SQL commands using PeerDB.
Demo
This demo shows PeerDB syncing a table with 100 million rows from PostgreSQL to Snowflake in few minutes.
Prerequisites
- Enable logical decoding in Postgres. Ensure that the following settings/GUCs are properly configured:
- wal_level: logical
- max_wal_senders: >1
- max_replication_slots: 4
- Enable replication access for a PostgreSQL user - ALTER USER pg_user REPLICATION;
- Ensure that both tables have primary keys. Composite primaries are also fine. If not, make sure your tables have REPLICA IDENTITY FULL.
- If you are using PostgreSQL on the cloud, below links capture how to enable logical replication for each cloud:
- Ensure that permissions for the Snowflake user and database you wish to use is properly configured. More details are available here.
Step 1: Add Postgres and Snowflake Peers
Run the following commands to let PeerDB know about the existing Postgres and Snowflake Peers.
-- Connect to PeerDB
psql "port=9900 host=localhost password=peerdb"
-- Add Postgres and Snowflake peers
CREATE PEER postgres_peer FROM postgres (...);
CREATE PEER snowflake_peer FROM snowflake (...);
Make sure to replace (…)
with the appropriate connection details for both the PostgreSQL and Snowflake instances. More details on adding PEERs are available here.
Step 2: Real-Time CDC from PostgreSQL to Snowflake
With the peers set up, you can create a mirror that facilitates real-time CDC from PostgreSQL to Snowflake.
Create MIRROR using SQL
-- Real-time CDC from PostgreSQL to Snowflake
CREATE MIRROR real_time_cdc
FROM postgres_peer TO snowflake_peer
WITH TABLE MAPPING (public.transactions:public.transactions, public.users:public.users) -- make sure tables are schema qualified
WITH (
do_initial_copy = true,
snapshot_sync_mode='avro',
snapshot_num_rows_per_partition = 500000,
snapshot_max_parallel_workers = 4,
snapshot_num_tables_in_parallel = 4,
snapshot_staging_path = ''
);
Since no CDC sync mode has been specified above, CDC will be performed in sql
mode.
To perform CDC via AVRO mode
, you must set the following:
cdc_sync_mode = 'avro'
cdc_staging_path = '<stage>'
You must set the staging path to be an existing S3 bucket URL, or an empty string for PeerDB to stage the AVRO files internally.
If you observe, TABLE MAPPING represents the table name mapping between the two Postgres peers. The final WITH
clause captures if you wanted to include initial snapshot as a part of the MIRROR. If you don’t include that WITH
, PeerDB assumes that you don’t want to perform an initial snapshot. If just reads the slot and replays the changes to the target.
- Data type mapping between Postgres and Snowflake can be found here.
- If you want additional types to be supported or want to alter the existing data type mapping, please reach out to us. We can aim to support that within a few days. Also, PeerDB is fully open source, so feel free to submit a PR.
PeerDB also supports replicating TOAST columns very efficiently. Unlike most CDC tools, you don’t need to set up REPLICA IDENTITY FULL for replicating TOAST columns. This PR captures the infrastructural optimizations that PeerDB takes to support TOAST columns.
Create MIRROR using UI
If you prefer a UI, you can easily create a mirror using the PeerDB UI (localhost:3000). Refer to the below video:
Step 3: Validate the Mirror
Through the same PeerDB’s Postgres-compatible SQL interface, you can quickly validate the MIRROR (real-time CDC).
-- Validate the mirror
SELECT COUNT(*) FROM snowflake_peer.transactions;
SELECT COUNT(*) FROM postgres_peer.transactions;
Step 4: Monitor the MIRROR
You can use the UI (localhost:3000) to monitor the status of the initial load. Refer to the below video:
You can use the UI (localhost:3000) to monitor the status of the Change Data Capture (CDC). Refer to the below video:
For deeper monitoring, you can connect to localhost:8085
to get full visibility into the different jobs and steps that PeerDB is taking under the covers to manage the MIRROR.
Step 5: DROP MIRROR
To make it easy in your development and test environments, PeerDB also introduces the DROP MIRROR command. DROP MIRROR drops all the underlying objects that CREATE MIRROR generates. More details are available in this PR.
-- drop the mirror
DROP MIRROR real_time_cdc;