Accelerating with Spark v1.7

Suggest edits

By default, the Postgres Analytics Accelerator (PGAA) utilizes Seafowl, an embedded analytical engine, to accelerate queries. However, for large-scale data processing that exceeds the resources of a single Postgres instance, you can offload execution to a remote Apache Spark cluster via Spark Connect.

Spark Connect is a thin client-server protocol for Apache Spark that decouples the application from the Spark driver. It acts as a high-speed bridge, allowing Postgres to send query instructions to a remote, distributed Spark cluster. This enables you to leverage the massive compute power of an external cluster without requiring Spark to run on the same machine as your database.

Choosing your executor engine

The pgaa.executor_engine configuration parameter determines where the heavy lifting of your analytical queries happens. When moving beyond the default Seafowl engine, you have two primary options for distributed execution:

Distributed execution options

To leverage Spark, you can configure your environment in one of two ways:

Distributed Spark execution: Use standard Apache Spark clusters to process massive datasets using distributed CPU cores. This is the standard approach for large-scale data processing and heavy maintenance tasks.
GPU-accelerated Spark: Integrate the NVIDIA RAPIDS Accelerator for Apache Spark. This option offloads query execution to GPUs, significantly reducing processing time and infrastructure costs for compute-intensive joins and aggregations.

Feature	Seafowl	Spark Connect (CPU)	Spark + RAPIDS (GPU)
Architecture	Process alongside Postgres.	Remote Spark cluster.	Remote Spark with NVIDIA GPUs.
Best for	Small/medium datasets.	Petabyte-scale, complex ETL.	Massive joins & aggregations.
Scalability	Single host limits.	Horizontal (Multi-node).	Horizontal + GPU Acceleration.
Complexity	Zero-config.	Requires Spark endpoint.	Requires GPU-enabled nodes.
Cost efficiency	Best for datasets < 1 TB.	Best for datasets between 1 TB and 3 TB.	Best for datasets 3 TB or higher.

While Seafowl is highly optimized for single-node performance, consider switching to a Spark-based executor if:

Memory constraints: Your queries exceed the pgaa.autostart_seafowl_max_memory_mb limit.
Maintenance heavy: You are performing resource-intensive operations like large-scale Compaction or Z-Ordering on Iceberg tables.
Extreme scale: You need to process joins and aggregations across datasets that require the distributed compute of an external cluster.

← Prev

Offloading to analytics

↑ Up

Postgres Analytics Accelerator (PGAA)

Distributed Spark execution

Accelerating with Spark v1.7

Choosing your executor engine

Distributed execution options

← Prev

↑ Up

Next →