Scenario
Let’s consider a scenario where we have an “events” table and a “users” table in PostgreSQL. We want to periodically sync data from these two tables in PostgreSQL to Snowflake. However, we only want to sync a few columns filtered based on the country (USA) and create a single denormalized view of the data in Snowflake. Let’s see how we can achieve this within a few minutes and a few SQL commands using PeerDB.Demo
This demo shows PeerDB syncing a table with 100 million rows from PostgreSQL to Snowflake in few minutes.Step 1: Add PostgreSQL and Snowflake Peers
Run the following commands to add the PostgreSQL and Snowflake Peers to PeerDB:(…)
with the appropriate connection details for both the PostgreSQL and Snowflake instances. More details on adding peers are available here.
Step 2: Set up MIRROR to Transform and Sync Data
With the peers set up, you can create a mirror that facilitates periodic sync from PostgreSQL to Snowflake with custom transformations.- In the SELECT query, we address the scenario by choosing only a few columns, enriching some columns using the JOIN operation, and filtering based on the country.
- The OPTIONS clause provides granular control when configuring the MIRROR:
- watermark_column (Required) represents a sequentially incrementing integer or timestamp column. PeerDB uses this column to keep track of the processed rows and the ones that need to be processed.
- watermark_table (Required) is the table that PeerDB iterates sequentially based on the watermark_column. The fact table is a good candidate for the watermark_table.
- mode (Required) tells PeerDB whether it should append data to the target or perform an upsert operation (update if the row exists).
- If your data is append-only, a sequential ID or a created_at column can be a good watermark, and you can choose the mode append_only.
- If your data is updated, make sure to have an “updated_at” column for PeerDB to identify the most recent version of the row. Choose the mode upsert to update already existing rows on the target.
- unique_key: In the upsert mode, you need to specify a set of columns (unique_key) that uniquely identify a row. This helps perform the upsert operation.
- parallelism (Optional) represents the number of threads used to read from the source and sync to the target. This helps with workload management on the source and target.
- If you have a powerful PostgreSQL server or if it is a read-replica, you can configure a higher parallelism value. In scenarios where you want to put less load on the source, you can choose a lower value.
- refresh_interval (Optional) helps configure the frequency (in minutes) at which the sync should run. If not specified, PeerDB syncs the data continuously without pauses.\
- More details on the OPTIONs can be found here.
updated_at BETWEEN {{.start}} AND {{.end}}
in the SELECT statement provides input to PeerDB on which part(s) of the query should be scoped by the watermark column for the rows to be synced.
Step 3: Validate the Mirror
Using the same PostgreSQL-compatible SQL interface of PeerDB, you can quickly validate the MIRROR.Step 4: Monitor the MIRROR
You can connect tolocalhost:8085
to gain full visibility into the different jobs and steps that PeerDB performs under the hood to manage the MIRROR.
