Postgres Analytics Accelerator (PGAA) v1.7

Suggest edits

Postgres Analytics Accelerator (PGAA) is a high-performance extension that enables Postgres to query large-scale data stored in open table formats like Delta Lake, Apache Iceberg, and Parquet. By offloading heavy analytical queries to a vectorized execution engine, PGAA bridges the gap between operational databases and data lakes.

Get started

Compatibility: Check supported PostgreSQL versions, operating systems, and other requirements.
Architecture: Understand the core architecture and how the vectorized engine works.
Core concepts: Understand the fundamental principles of vectorized execution, data lake integration, and DirectScan.
Quickstart guide: Install PGAA, create a storage location and read table from our sample benchmark datasets.

Using PGAA

Installation: Step-by-step instructions for installing the extension and enabling the Seafowl background worker.
Configure storage locations: How to securely connect PGAA to AWS S3, GCS, and Azure Blob storage.
Read from object storage: Connect directly to S3, GCS, or Azure Blob Storage to query Parquet, Delta, or Iceberg files via a PGFS storage location.
Read using Iceberg catalogs: Integrate with external Iceberg REST catalogs to manage table metadata.
Write to object storage using CTAS: Use CREATE TABLE AS SELECT (CTAS) to export Postgres data into optimized lakehouse formats in your object store.

Replicating with PGD

Implementing tiered tables: Combine PGD AutoPartition and PGAA to create an automated data lifecycle. Move older partitions to object storage while keeping recent data in Postgres tables.

Replicating to analytics: Convert standard heap tables into HTAP tables. Use continuous logical replication to maintain a real-time analytical copy of your transactional data in the data lake.

Offloading to analytics: Perform surgical storage management by manually moving entire HTAP tables to the cold tier, truncating local data to reclaim disk space immediately.

Performance & optimization

Accelerate with Spark: Offload massive datasets and complex distributed joins to a remote Spark cluster via Spark Connect. PGAA offers two integration modes depending on your performance requirements:
- Standard Spark integration: Leverage a remote Spark cluster for high-concurrency analytical queries and distributed processing.
- GPU-Accelerated Spark: Integrate with the NVIDIA RAPIDS Accelerator for Apache Spark to leverage GPU acceleration.
Monitor and maintain your analytical tables: Audit storage utilization, monitor table health, and perform table maintenance tasks for PGAA-managed tables.
Optimize query performance: Maximize query speeds by managing DirectScan execution, configuring compute pushdowns, and troubleshooting path fallbacks.

Reference

Configuration parameters: The behavior of the PGAA extension is governed by Grand Unified Configuration (GUC) variables. These parameters allow you to switch executors, enable performance optimizations, and manage security credentials.
Functions: PGAA introduces a suite of SQL functions for administrative tasks, such as mapping new tables, monitoring storage health, and launching maintenance background jobs.
Table options: When mapping or creating analytical tables, specific options allow you to define how data is read from or written to your object store.
Data types: PGAA maps native Postgres data types to optimized columnar formats in the data lake.
Datasets: Access pre-configured schemas and data loading instructions for analytical datasets to baseline your performance.