Replicating to analytics v1.7

Suggest edits

Use Postgres Analytics Accelerator (PGAA) within a Postgres Distributed (PGD) cluster to synchronize local heap tables with an analytical data lake by converting them to Hybrid Analytics and Transactional Processing (HTAP) tables.

In this state, every INSERT, UPDATE, and DELETE operation is captured and streamed in near real-time to object storage. While the table remains a standard transactional endpoint for applications, PGD maintains a high-fidelity analytical replica in the background, making data immediately available for BI tools and query engines without impacting production performance.

The key benefits of this process are:

Near real-time analytics: Access the latest transactional data in your data lake without waiting for batch ETL processes.
Dual-state accessibility: Maintain local high-performance storage for transactions while providing a columnar version for heavy analytical queries.
Unified management: Manage your data lifecycle and replication settings directly through standard SQL and PGAA functions.
Efficiency with Merge-on-Read: PGAA uses Iceberg Merge-on-Read (MoR) to synchronize data. Instead of rewriting large files every time a row changes, it simply records the change in a delete file. This keeps replication fast and significantly reduces the load on your object storage.

Process overview

To enable continuous replication for a table, follow these steps:

Meet the prerequisites and point to object storage by choosing either a storage location or a catalog service.
Enable replication to start the data stream.
Monitor replication status and health.
Disable replication if you no longer require synchronization.

Enabling replication

Use the pgaa.enable_analytics_replication() function to configure an existing heap table to stream data to object storage. The function converts the heap table to an analytical HTAP table and starts the logical replication worker.

CALL pgaa.enable_analytics_replication('my_table'::regclass, purge_analytics_target := TRUE);

Set purge_analytics_target:= TRUE to purge existing data at the target location before replication begins (requires PGD 6.3 or later).

Alternatively, switch the replication state directly on the table object:

ALTER TABLE my_table SET (pgd.replicate_to_analytics = TRUE);

You can enable the bdr.prefer_analytics_engine parameter to route queries to the PGAA engine (Seafowl or Spark):

SET bdr.prefer_analytics_engine = TRUE;

If disabled, queries are executed by the standard Postgres engine against the local heap.

Monitoring replication

Once replication is enabled, you can verify the HTAP status by checking that the table appears in both the local relation list and the analytics metadata.

View mappings between local tables and analytics targets:

SELECT * FROM bdr.analytics_table;

List all tables currently being managed by PGAA:

SELECT * FROM pgaa.list_analytics_tables();

Disabling replication

Disabling replication stops the logical worker from capturing changes. The local table reverts to a standard heap table, and the remote copy in object storage becomes a static snapshot of the data.

Use the pgaa.disable_analytics_replication() function to stop the synchronization:

CALL pgaa.disable_analytics_replication('my_table'::regclass);

Alternatively, use ALTER TABLE:

ALTER TABLE my_table SET (pgd.replicate_to_analytics = FALSE);

SET bdr.prefer_analytics_engine= FALSE;

Note

If you re-enable replication on a table where it was previously disabled, the system performs a full reload. This process effectively removes the existing analytics data in object storage, re-uploads the current local heap data in its entirety, and starts continuous synchronization.

← Prev

Implementing tiered tables

↑ Up

Replicating with PGD

Offloading to analytics