Postgres Analytics Accelerator (PGAA) v1.7
Postgres Analytics Accelerator (PGAA) is a high-performance extension that enables Postgres to query large-scale data stored in open table formats like Delta Lake, Apache Iceberg, and Parquet. By offloading heavy analytical queries to a vectorized execution engine, PGAA bridges the gap between operational databases and data lakes.
Get started
Compatibility: Check supported PostgreSQL versions, operating systems, and other requirements.
Architecture: Understand the core architecture and how the vectorized engine works.
Core concepts: Understand the fundamental principles of vectorized execution, data lake integration, and DirectScan.
Quickstart guide: Install PGAA, create a storage location and read table from our sample benchmark datasets.
Using PGAA
Installation: Step-by-step instructions for installing the extension and enabling the Seafowl background worker.
Configure storage locations: How to securely connect PGAA to AWS S3, GCS, and Azure Blob storage.
Read from object storage: Connect directly to S3, GCS, or Azure Blob Storage to query Parquet, Delta, or Iceberg files via a PGFS storage location.
Read using Iceberg catalogs: Integrate with external Iceberg REST catalogs to manage table metadata.
Write to object storage using CTAS: Use
CREATE TABLE AS SELECT(CTAS) to export Postgres data into optimized lakehouse formats in your object store.
Replicating with PGD
- Implementing tiered tables: Combine PGD AutoPartition and PGAA to create an automated data lifecycle. Move older partitions to object storage while keeping recent data in Postgres tables.
- Replicating to analytics: Convert standard heap tables into HTAP tables. Use continuous logical replication to maintain a real-time analytical copy of your transactional data in the data lake.
- Offloading to analytics: Perform surgical storage management by manually moving entire HTAP tables to the cold tier, truncating local data to reclaim disk space immediately.
Performance & optimization
Accelerate with Spark: Offload massive datasets and complex distributed joins to a remote Spark cluster via Spark Connect. PGAA offers two integration modes depending on your performance requirements:
Standard Spark integration: Leverage a remote Spark cluster for high-concurrency analytical queries and distributed processing.
GPU-Accelerated Spark: Integrate with the NVIDIA RAPIDS Accelerator for Apache Spark to leverage GPU acceleration.
Monitor and maintain your analytical tables: Audit storage utilization, monitor table health, and perform table maintenance tasks for PGAA-managed tables.
Optimize query performance: Maximize query speeds by managing DirectScan execution, configuring compute pushdowns, and troubleshooting path fallbacks.
Reference
Configuration parameters: The behavior of the PGAA extension is governed by Grand Unified Configuration (GUC) variables. These parameters allow you to switch executors, enable performance optimizations, and manage security credentials.
Functions: PGAA introduces a suite of SQL functions for administrative tasks, such as mapping new tables, monitoring storage health, and launching maintenance background jobs.
Table options: When mapping or creating analytical tables, specific options allow you to define how data is read from or written to your object store.
Data types: PGAA maps native Postgres data types to optimized columnar formats in the data lake.
Datasets: Access pre-configured schemas and data loading instructions for analytical datasets to baseline your performance.