PeerDB support Change-Data-Capture (CDC) from PostgreSQL to S3 and GCS buckets in the form of AVRO files in the destination. We utilise the interoperability between GCS and S3 here.

Scenario

Suppose you have a banking application running on PostgreSQL. There are two tables: “users” and “transactions.” You want to sync these tables in real-time to AVRO files in your S3 or GCS buckets which other services or clients can pick up.

Demo

Let’s look at a short video showcasing CDC from PostgreSQL to Google Cloud Storage with initial load.

Prerequisites

Postgres Peer

  1. Enable logical decoding in PostgreSQL. Ensure that the following settings/GUCs are properly configured:
    1. wal_level: logical
    2. max_wal_senders: >1
    3. max_replication_slots: 4
  2. Enable replication access for a PostgreSQL user - ALTER USER pg_user REPLICATION;
  3. Ensure that both tables have primary keys. Composite primaries are also fine. If not, make sure your tables have REPLICA IDENTITY FULL.
  4. If you are using PostgreSQL on the cloud, below links capture how to enable logical replication for each cloud:
    1. AWS RDS and Aurora PostgreSQL
    2. Azure Database for PostgreSQL - Flexible Server
    3. GCP Cloud SQL PostgreSQL

S3 peer

  1. Ensure you have an existing bucket to use.

  2. For PeerDB to access the S3 Peer, you can either use an existing AWS user or create a new AWS user.

  3. Create access keys for that user through AWS Console or AWS CLI or AWS API

  4. For the same user, create and attach a policy as below using JSON editor. Sharing AWS docs for reference. PeerDB requires s3:ListAllMyBuckets, s3:GetObject, s3:PutObject, s3:ListBucket and s3:DeleteObject on that bucket.

       {
           "Version": "2012-10-17",
           "Statement": [
               {
                   "Effect": "Allow",
                   "Action": "s3:ListAllMyBuckets",
                   "Resource": "*"
               },
               {
                   "Effect": "Allow",
                   "Action": [
                       "s3:GetObject",
                       "s3:PutObject",
                       "s3:ListBucket",
                       "s3:DeleteObject"
                   ],
                   "Resource": [
                       "arn:aws:s3:::<your-bucket-name>",
                       "arn:aws:s3:::<your-bucket-name>/*"
                   ]
               }
           ]
       }
    

GCS peer

  1. Ensure you have an existing bucket to use.
  2. For GCS, you must create a HMAC key-pair and use this for ACCESS_KEY_ID and SECRET_ACCESS_KEY in the Docker compose file.

Step 1: Add PostgreSQL and S3/GCS Peers

Run the following commands to let PeerDB know about the existing PostgreSQL and S3.

-- Connect to PeerDB
psql "port=9900 host=localhost password=peerdb"

-- Add PostgreSQL and Event Hubs peers
CREATE PEER postgres_peer FROM PostgreSQL (...);
CREATE PEER s3_peer FROM S3 (...);
-- Or, CREATE PEER gcs_peer FROM S3 (...);

Please refer to our CREATE PEER documentation for storage peers.

Step 2: Real-Time CDC from PostgreSQL to S3/GCS by creating a MIRROR

Create MIRROR using UI

If you prefer a UI, you can easily create a mirror using the PeerDB UI. Refer to the below video:

Create MIRROR using SQL

To facilitate real-time Change Data Capture (CDC) from PostgreSQL to S3 or GCS, set up your peers and then create a mirror using the following SQL syntax:

CREATE MIRROR IF NOT EXISTS <mirror-name>
FROM <postgres-peer-name> TO <storage-peer-name>
WITH TABLE MAPPING(
    <source-schema>.<table>:<destination_name>,
    ... -- Repeat as required for multiple tables
)
WITH(
    max_batch_size = <number>,
    publication_name = '<publication-name>'
);

Example:

CREATE MIRROR IF NOT EXISTS test_mirror_interop
FROM test_pg_peer TO test_gcs_peer
WITH TABLE MAPPING(
    schema1.table1:dest_table1,
    schema1.table2:dest_table2
    -- Add more tables as required
)
WITH(
    max_batch_size = 300000,
    publication_name = 'test_publication'
);

Parameters:

  • mirror-name: Desired name for the mirror.
  • postgres-peer-name: Name of the PostgreSQL peer.
  • storage-peer-name: Name of the S3/GCS peer.
  • max_batch_size: Maximum number of records in a batch.
  • publication_name: Name of the publication.

Remember to adjust placeholder values (<…>) with your specific details and preferences. The example above has been abbreviated for clarity; ensure you provide all the necessary mappings and configurations in practice.

Step 3: Monitor the MIRROR

You can connect to localhost:8085 to get full visibility into the different jobs and steps that PeerDB is taking under the covers to manage the MIRROR.

Managing Mirror

Step 5: DROP MIRROR

To make it easy in your development and test environments, PeerDB also introduces the DROP MIRROR command. DROP MIRROR drops all the underlying objects that CREATE MIRROR generates. More details are available in this PR.

-- drop the mirror
DROP MIRROR test_mirror_interop;