Creating Peers
Guide to creating Peers in PeerDB
PeerDB currently supports the below types of Peers. To connect a database to PeerDB, you need to create a Peer. A Peer is a connection to a database that PeerDB can query. Peers are created using the CREATE PEER
command.
- BigQuery
- Snowflake
- PostgreSQL
- SQL Server
- MySQL
- Storage Peers - S3 and Google Cloud
- Azure EventHubs
- ClickHouse
BigQuery Peer
Create Peer using UI
Using the PeerDB UI (localhost:3000), create the Snowflake Peer. See below video as reference:
Create Peer using connection string
CREATE PEER bq_peer FROM BIGQUERY WITH
(
type = 'service_account',
project_id = '<project_id>',
private_key_id = '<private_key_id>',
private_key = '<private_key>',
client_email = '<client_email>',
client_id = '<client_id>',
auth_uri = '<auth_uri>',
token_uri = '<token_uri>',
auth_provider_x509_cert_url = '<auth_provider_x509_cert_url>',
client_x509_cert_url = '<client_x509_cert_url>',
dataset_id = '<dataset_id>'
);
-- Query away tables in BigQuery
SELECT * FROM bq_peer.test_table;
Authentication
PeerDB authenticates the creation of a BigQuery peer based on the service account. All the fields within the CREATE PEER
command can be acquired from a service account JSON.
If you have a service account key, you can use that or create a new one dedicated for PeerDB.
Make sure you have the right permissions for the service account, PeerDB requires at least read permissions for querying. For MIRROR it requires both read and write.
Considerations
- Dataset Specific: PeerDB only supports querying tables within a dataset. Cross dataset joins are not supported yet.
- Supported Datatypes: All string, numeric and timestamp based datatypes are supported for querying. Record and Struct data-types are projected as JSONB. JSON datatype in BQ is not supported yet.
- SQL Coverage: Most SQL constructs in reads incl. Simple Selects, JOINs, aggregations, window functions, CTEs etc are supported. You can run both Postgres compatible and BQ compatible queries through PeerDBs interface.
- Error Handling: If a query fails on the BigQuery side because of lack of auth or query coverage or timeout, PeerDB handles that error and propagates the entire message as a JSON text to the end-user. We also capture this ERROR message within PeerDB logs.
Snowflake Peer
Create Peer using UI
Using the PeerDB UI (localhost:3000), create the Snowflake Peer. See below video as reference:
Create Peer using connection string
CREATE PEER sf_peer FROM SNOWFLAKE WITH
(
account_id = '<snowflake_account_identifier>',
username = '<user_name>',
private_key ='<private_key>',
password = '<password>' -- only provide when the private key is encrypted
database = '<database_name>',
schema = '<schema>',
warehouse = '<ware_house>',
role = '<role>',
query_timeout = '<query_timeout_in_seconds>'
);
-- Query away tables in Snowflake
SELECT * FROM sf_peer.<schema>.<tablename>;
Granting permissions
PeerDB needs to be able to create, read and write to objects in the database you provide. The following commands ensure a smooth mirror:
GRANT ALL PRIVILEGES ON DATABASE <dbname> TO ROLE <role_used_while_creating_peer>;
GRANT USAGE ON SCHEMA <schema> TO <role_used_while_creating_peer>;
GRANT ALL ON SCHEMA <schema> TO <role_used_while_creating_peer>;
More information can be found here.
Authentication
PeerDB authenticates the creation of a Snowflake peer based key pair authentication. Encrypted private keys are also supported. Refer to this doc to create key pair, assign them to the right role and help populate the above fields.
Considerations
- Database Specific: PeerDB only supports querying tables within a database. Cross database joins are not supported yet.
- Supported Datatypes: All string, numeric and timestamp based datatypes are supported for querying.
- SQL Coverage: Most SQL constructs in reads incl. Simple Selects, JOINs, aggregations, window functions, CTEs etc are supported. You can run both Postgres compatible and SF compatible queries through PeerDBs interface.
- Error Handling: If a query fails on the Snowflake side because of lack of auth or query coverage or timeout, PeerDB handles that error and propagates the entire message as a JSON text to the end-user. We also capture this ERROR message within PeerDB logs.
PostgreSQL Peer
Create Peer using UI
Using the PeerDB UI (localhost:3000), create the Postgres Peer. See below video as reference:
Create Peer using connection string
CREATE PEER postgres_peer FROM POSTGRES WITH
(
host = '<hostname>',
port = '<port>',
user = '<user>',
password = '<password>',
database = '<dbname>'
);
-- Query away tables in Postgres
SELECT * FROM postgres_peer.<schema>.<tablename>;
If you are connecting to a Docker Postgres instance on the same docker network as PeerDB, then the hostname will be the service name of the Postgres server and the port will be 5432
.
Considerations
- Supported Datatypes: All string, numeric and timestamp based datatypes are supported for querying. We are actively addding support for other types.
- Unsupported COPY command: COPY command is not supported yet.
SSH Tunneling Configuration (Optional)
You can connect to a PostgreSQL database through a bastion host using SSH tunneling. Here’s a guide on setting up an SSH tunnel for PeerDB.
Then, create a PostgreSQL peer using an SSH tunnel:
CREATE PEER ssh_postgres_peer FROM POSTGRES WITH
(
host = '<hostname>',
port = 5432,
user = '<user>',
password = '<password>',
database = '<dbname>',
ssh_config = '{
"host": "<ssh_host>",
"port": 22,
"user": "<ssh_user>",
"password": "<ssh_password>",
"private_key": "<base64_encoded_ssh_private_key>"
"host_key": "host_key" -- optional
}'
);
Note that the input for the private key here expects a base64 encoded private key. The host key input is optional and is for preventing MITM attacks.
Storage Peers - S3 and GCS
Create Peer using UI
Using the PeerDB UI (localhost:3000), create the Snowflake Peer. See below video as reference:
Create Peer using connection string
For CDC, you would require a PostgreSQL instance to use as an external metadata store. The CREATE PEER command for S3/GCS depends on how you set the following parameters.
CREATE PEER storage_peer FROM S3 WITH
(
url = 's3://<gcs/s3_bucket_name>/<prefix>', -- bucket should exist
access_key_id = '<hmac_key>', -- or AWS equivalent
secret_access_key = '<hmac_secret>', -- or AWS equivalent
region = 'auto', -- for S3, use the region of your bucket
endpoint = 'https://storage.googleapis.com', -- or empty for S3
metadata_db = 'host=<host> port=<port> user=<user> password=<password> database=<database>'
);
Note that if you leave any of these parameters empty, PeerDB will use the corresponding Docker compose environment variable instead:
x-flow-worker-env: &flow-worker-env
...
# For GCS, these will be your HMAC keys instead
# For more information:
# https://cloud.google.com/storage/docs/authentication/managing-hmackeys
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
# For GCS, set this to "auto" without the quotes
AWS_REGION: ${AWS_REGION:-}
# For GCS, set this as: https://storage.googleapis.com
AWS_ENDPOINT: ${AWS_ENDPOINT:-}
...
Considerations
- The bucket specified in the URL must already exist.
- Only
.avro
file format is currently supported. - Currently only data-movement is supported for this type of peer. Querying this peer from PeerDB’s interface is not supported.
Clickhouse Peer
You can create a Clickhouse peer using the following SQL syntax:
CREATE PEER <clickhouse_peer_name> FROM CLICKHOUSE
WITH
(
host='<host>',
port=<port>,
user='<user>',
password='<password>',
database='<database>',
-- disable_tls = true
);
- For local docker installations of Clickhouse and if you’re using PeerDB OSS, the host will be
host.docker.internal
, with the port being the external port. - The
disable_tls
parameter is optional and can be set totrue
if you are using a Clickhouse instance without TLS. - PeerDB maintains an ephemeral internal stage for Clickhouse using a min.io bucket.
You can configure your own bucket (S3/min.io/GCS) by passing the following optional parameters:
CREATE PEER <my_staged_clickhouse_peer> FROM CLICKHOUSE
WITH
(
host='<host>',
port=<port>,
user='<user>',
password='<password>',
database='<database>',
-- (Optional) S3 stage of yours:
s3_path = 's3://<bucket-name>',
access_key_id = '<access_key_id>',
secret_access_key = '<secret_access_key>',
region = '<region>',
endpoint = '<endpoint>' -- for GCS: https://storage.googleapis.com
);
SQLServer Peer
CREATE PEER sqlserver_peer FROM SQLSERVER WITH (
server='<your-hostname>',
port='<your-port>',
user='<user>',
password='<password>',
database='<database>'
);
If you are connecting to a Docker SQL Server/Azure SQL Edge instance on the same docker network as PeerDB, then the hostname will be the service name of the SQL server and the port will be 5432
.
Considerations
- Currently only data-movement is supported for this type of peer. Querying this peer from PeerDB’s interface is not supported.
MySQL Peer
CREATE PEER mysql_peer FROM MYSQL WITH (
host='<your-hostname>',
port=3006,
user='<user>'
password='<password>',
database='<database>',
setup='set session transaction level read only',
disable_tls=false
);
Considerations
- Currently only querying is supported for this type of peer. Mirrors are not supported.
setup
is optional, but can be used to run queries when connection setup. Useful for setting transactions as read only by default, or setting catalog/database withuse catalog.database
for StarRocks
EventHub Peer
To create an Eventhub peer, use the following SQL syntax:
CREATE PEER eventhubs_peer_x FROM EVENTHUBS WITH (
eventhubs = '[{"subscription_id":"my-sub-id",
"resource_group":"myresource",
"namespace":"mynamespace-1",
"location":"eastus",
"partition_count":5,
"message_retention_in_days":2
},
{"subscription_id":"my-sub-id",
"resource_group":"myresource",
"namespace":"mynamespace-2",
"location":"eastus",
"partition_count":3,
"message_retention_in_days":1
}
]',
unnest_columns = [] -- can omit this field too
);
Parameters:
eventhubs
: A JSON array of details of namespaces. Each item should have the following fields:subscription_id
: The ID of the Azure subscription.resource_group
: The name of the Azure resource group.namespace
: The namespace for the Eventhub.location
: The location of the Eventhub namespace’s resource group, e.g., eastus.partition_count
: The number of partitions for the Eventhub.message_retention_in_days
: The number of days the messages will be retained.
unnest_columns
: Specifies the columns you’d like to unnest
In order to use this peer in a self-hosted PeerDB instance, your Azure credentials must be filled in our PeerDB docker-compose file in the environments section of the flow-worker
service:
environment:
AZURE_CLIENT_ID: <client-id>
AZURE_CLIENT_SECRET: <secret>
AZURE_TENANT_ID: <tenant-id>
Considerations
- Currently only CDC data-movement is supported for this type of peer. Querying this peer from PeerDB’s interface is not supported.