PostgreSQL to Snowflake
Scenario
Let’s consider a scenario where we have an “events” table and a “users” table in PostgreSQL. We want to periodically sync data from these two tables in PostgreSQL to Snowflake. However, we only want to sync a few columns filtered based on the country (USA) and create a single denormalized view of the data in Snowflake. Let’s see how we can achieve this within a few minutes and a few SQL commands using PeerDB.
Demo
This demo shows PeerDB syncing a table with 100 million rows from PostgreSQL to Snowflake in few minutes.
Step 1: Add PostgreSQL and Snowflake Peers
Run the following commands to add the PostgreSQL and Snowflake Peers to PeerDB:
Ensure that permissions for the Snowflake user and database you wish to use is properly configured.
Make sure to replace (…)
with the appropriate connection details for both the PostgreSQL and Snowflake instances. More details on adding peers are available here.
Step 2: Set up MIRROR to Transform and Sync Data
With the peers set up, you can create a mirror that facilitates periodic sync from PostgreSQL to Snowflake with custom transformations.
- In the SELECT query, we address the scenario by choosing only a few columns, enriching some columns using the JOIN operation, and filtering based on the country.
- The OPTIONS clause provides granular control when configuring the MIRROR:
- watermark_column (Required) represents a sequentially incrementing integer or timestamp column. PeerDB uses this column to keep track of the processed rows and the ones that need to be processed.
- watermark_table (Required) is the table that PeerDB iterates sequentially based on the watermark_column. The fact table is a good candidate for the watermark_table.
- mode (Required) tells PeerDB whether it should append data to the target or perform an upsert operation (update if the row exists).
- If your data is append-only, a sequential ID or a created_at column can be a good watermark, and you can choose the mode append_only.
- If your data is updated, make sure to have an “updated_at” column for PeerDB to identify the most recent version of the row. Choose the mode upsert to update already existing rows on the target.
- unique_key: In the upsert mode, you need to specify a set of columns (unique_key) that uniquely identify a row. This helps perform the upsert operation.
- parallelism (Optional) represents the number of threads used to read from the source and sync to the target. This helps with workload management on the source and target.
- If you have a powerful PostgreSQL server or if it is a read-replica, you can configure a higher parallelism value. In scenarios where you want to put less load on the source, you can choose a lower value.
- refresh_interval (Optional) helps configure the frequency (in minutes) at which the sync should run. If not specified, PeerDB syncs the data continuously without pauses.\
- More details on the OPTIONs can be found here.
updated_at BETWEEN {{.start}} AND {{.end}}
in the SELECT statement provides input to PeerDB on which part(s) of the query should be scoped by the watermark column for the rows to be synced.
Step 3: Validate the Mirror
Using the same PostgreSQL-compatible SQL interface of PeerDB, you can quickly validate the MIRROR.
Step 4: Monitor the MIRROR
You can connect to localhost:8085
to gain full visibility into the different jobs and steps that PeerDB performs under the hood to manage the MIRROR.
Support
If you run into any issues, join our slack channel and reach out to us. You can file an issue on our github repository or reach out to founders@peerdb.io . We will follow up!