Creating Mirrors
Guide to creating synchronization jobs in PeerDB
PeerDB introduces CREATE MIRROR command for blazing fast syncs between Peers. This section captures different types of MIRRORs and how to kick them off.
Command Reference
MIRROR for CDC
This type of MIRROR streams change feed in real-time from source peer to target peer.
The table mapping section supports two syntaxes, show below, and both can be used in the same mapping.
The second syntax with the JSON-like structure is useful when you want to exclude certain columns from the sync with exclude:[]
Note that for BigQuery and Clickhouse mirrors, the target tables should not be schema qualified.
CREATE MIRROR [IF NOT EXISTS] <mirror_name>
FROM <source_peer> TO <target_peer>
WITH TABLE MAPPING
(
<schema_name>.<source_table_name>:<schema_name>.<target_table_name>,
<schema_name>.<source_table_name>:<schema_name>.<target_table_name>,
{
from: <schema_name>.<source_table_name>,
to: <schema_name>.<target_table_name>,
exclude: [column1, column2 ,column3...]
}
)
WITH (
do_initial_copy = <true|false>, -- required
max_batch_size = <max_batch_size>,
sync_interval = <sync_interval>,
publication_name = '<publication_name>',
replication_slot_name = '<replication_slot_name>', -- applicable only when do_initial_copy is false
snapshot_num_rows_per_partition = <number_of_rows_in_each_partition>,
snapshot_max_parallel_workers = <number_of_threads_per_table>,
snapshot_num_tables_in_parallel = <number_of_tables_in_parallel>,
soft_delete = <true|false>,
synced_at_col_name = '<column_name>',
soft_delete_col_name = '<column_name>'
);
- do_initial_copy:
true
if you want to include initial snapshot of tables as a part of the MIRROR, elsefalse
. - max_batch_size: Maximum number of rows to be synced in a single batch.
- sync_interval: Interval in seconds at which the mirror should sync the changes.
- publication_name: Name of the publication on the source peer. This is optional as PeerDB creates the publication it needs anyway. By creating a publication and providing it here, however, you do not need to provide write access to the database and tables (since that’s needed for PeerDB to create the publication).
- replication_slot_name: Name of the replication slot on the source peer. Do not set this if initial load is enabled. This is optional as PeerDB creates the slot it needs by default. A use-case for this setting is when you want to resume a change-data capture from an existing replication slot.
- snapshot_num_tables_in_parallel: In the initial snapshot, number of tables to be snapshotted at in parallel.
- snapshot_num_rows_per_partition: In the initial snapshot, PeerDB divides the rows of each table into partitions of these many rows, and syncs those partitions.
- snapshot_max_parallel_workers: In the initial snapshot, number of threads used per table. Max number of concurrent connections on source will be
snapshot_num_tables_in_parallel * snapshot_max_parallel_workers
- soft_delete:
true
if you want to soft delete rows on the target peer when they are deleted on the source peer. - synced_at_col_name: Column name to be used for tracking the last synced timestamp. Default is
_PEERDB_SYNCED_AT
. - soft_delete_col_name: Column name to be used for tracking the soft delete status. Default is
_PEERDB_IS_DELETED
.
MIRROR for Streaming Query Results
This type of MIRROR periodically streams query results on the source peer to the target peer.
CREATE MIRROR <mirror_name> [IF NOT EXISTS] FROM
<source_peer> TO <target_peer> FOR
$$
SELECT * FROM <source_table_name> WHERE
<watermark_column> BETWEEN {{.start}} AND {{.end}}
$$
WITH (
destination_table_name = '<schema_qualified_destination_table_name>',
watermark_column = '<watermark_column>',
watermark_table_name = '<table_on_source_on_which_watermark_filter_should_be_applied>',
mode = '<mode>',
unique_key_columns = '<unique_key_columns>',
parallelism = <parallelism>,
refresh_interval = <refresh_interval_in_seconds>,
num_rows_per_partition = 100000,
initial_copy_only = <true|false>,
setup_watermark_table_on_destination = <true|false>,
);
The OPTIONS
clause provides granular control when configuring the MIRROR:
- watermark_column (Required) represents a sequentially incrementing integer or timestamp column. PeerDB uses this column to keep track of the processed rows and the ones that need to be processed.
- watermark_table (Required) is the table that PeerDB iterates sequentially based on the watermark_column. The fact table is a good candidate for the watermark_table.
- mode (Required) tells PeerDB whether it should append data to the target or perform an upsert operation (update if the row exists).
- If your data is append-only, a sequential ID or a created_at column can be a good watermark, and you can choose the mode append.
- If your data is updated, make sure to have an “updated_at” column for PeerDB to identify the most recent version of the row. Choose the mode upsert to update already existing rows on the target.
- unique_key: In the upsert mode, you need to specify a set of columns (unique_key) that uniquely identify a row. This helps perform the upsert operation.
- parallelism (Optional) represents the number of threads used to read from the source and sync to the target. This helps with workload management on the source and target.
- If you have a powerful PostgreSQL server or if it is a read-replica, you can configure a higher parallelism value. In scenarios where you want to put less load on the source, you can choose a lower value.
- refresh_interval (Optional) helps configure the frequency (in seconds) at which the sync should run. If not specified, PeerDB attempts to sync new rows every 10 seconds.
- initial_copy_only (Optional): If only a one-time copy needs to be done, this can be set to true. The mirror stops after a one-time data load. Apart from specifying append or upsert modes, a new mode overwrite can be specified when this option is set to true. This truncates all destination tables before starting the mirror, and is functionally identical to append otherwise.
- num_rows_per_parition (Required): PeerDB internally splits a table into partitions of a certain size which it distributes amongst the worker threads. This option allows you to tune the size of each partition based on memory or network requirements.
- setup_watermark_table_on_destination (Optional): If set to true, a table with the same schema as the watermark table is created on the destination peer. Its name will be set to destination_table_name.
- More details on the OPTIONs can be found here.
if_not_exists
: If specified, the MIRROR will be created only if it does not exist.
XMIN Query Replication
PeerDB supports xmin
as watermark column. Note that DELETEs are not supported in XMIN Query Replication.
CREATE MIRROR xmin_mirror_1 FROM pg_test_peer TO sf_test_peer FOR
$$ SELECT * FROM users $$
WITH (
destination_table_name = 'public.users',
watermark_column = 'xmin',
watermark_table_name = 'public.users',
mode = 'append', -- upsert (only updates) and overwrite are also supported
parallelism = 10,
refresh_interval = 30,
num_rows_per_partition = 100
);