Accelerating with Spark v1.7

By default, the Postgres Analytics Accelerator (PGAA) utilizes Seafowl, an embedded analytical engine, to accelerate queries. However, for large-scale data processing that exceeds the resources of a single Postgres instance, you can offload execution to a remote Apache Spark cluster via Spark Connect.

Spark Connect is a thin client-server protocol for Apache Spark that decouples the application from the Spark driver. It acts as a high-speed bridge, allowing Postgres to send query instructions to a remote, distributed Spark cluster. This enables you to leverage the massive compute power of an external cluster without requiring Spark to run on the same machine as your database.

Choosing your executor engine

The pgaa.executor_engine configuration parameter determines where the heavy lifting of your analytical queries happens. When moving beyond the default Seafowl engine, you have two primary options for distributed execution:

Distributed execution options

To leverage Spark, you can configure your environment in one of two ways:

  • Distributed Spark execution: Use standard Apache Spark clusters to process massive datasets using distributed CPU cores. This is the standard approach for large-scale data processing and heavy maintenance tasks.
  • GPU-accelerated Spark: Integrate the NVIDIA RAPIDS Accelerator for Apache Spark. This option offloads query execution to GPUs, significantly reducing processing time and infrastructure costs for compute-intensive joins and aggregations.
FeatureSeafowlSpark Connect (CPU)Spark + RAPIDS (GPU)
ArchitectureProcess alongside Postgres.Remote Spark cluster.Remote Spark with NVIDIA GPUs.
Best forSmall/medium datasets.Petabyte-scale, complex ETL.Massive joins & aggregations.
ScalabilitySingle host limits.Horizontal (Multi-node).Horizontal + GPU Acceleration.
ComplexityZero-config.Requires Spark endpoint.Requires GPU-enabled nodes.
Cost efficiencyBest for datasets < 1 TB.Best for datasets between 1 TB and 3 TB.Best for datasets 3 TB or higher.

While Seafowl is highly optimized for single-node performance, consider switching to a Spark-based executor if:

  • Memory constraints: Your queries exceed the pgaa.autostart_seafowl_max_memory_mb limit.
  • Maintenance heavy: You are performing resource-intensive operations like large-scale Compaction or Z-Ordering on Iceberg tables.
  • Extreme scale: You need to process joins and aggregations across datasets that require the distributed compute of an external cluster.