Amazon Redshift out-of-the-box performance innovations for data lake queries

Amazon Redshift out-of-the-box performance innovations for data lake queries

Databases and query engines, including Amazon Redshift, often rely on different statistics about the underlying data to determine the most effective way to execute a query, such as the number of distinct values and which values have low selectivity. When Amazon Redshift receives a query, such as

SELECT insert_date, sum(sales)
FROM receipts
WHERE insert_date BETWEEN '2024-12-01' AND '2024-12-31'
GROUP BY insert_date

the query planner uses statistics to make an educated guess on the most effective method to load and process data from storage. More statistics about the underlying data can often help a query planner select a plan that leads to the best query performance, but this can require a tradeoff among the cost of computing, storing, and maintaining statistics, and might require additional query planning time.

Data lakes are a powerful architecture to organize data for analytical processing, because they let builders use efficient analytical columnar formats like Apache Parquet, while letting them continue to modify the shape of their data as their applications evolve with open table formats like Apache Iceberg. One challenge with data lakes is that they don’t always have statistics about their underlying data, making it difficult for query engines to determine the optimal execution path. This can lead to issues, including slow queries and unexpected changes in query performance.

In 2024, Amazon Redshift customers queried over 77 EB (exabytes) of data residing in data lakes. Given this usage, the Amazon Redshift team works to innovate on data lake query performance to help customers efficiently access their open data to get near real-time insights to make critical business decisions. In 2024, Amazon Redshift launched several features that improve query performance for data lakes, including faster query times when a data lake doesn’t have statistics. With Amazon Redshift patch 190, the TPC-DS 3TB benchmark showed an overall 2x query performance improvement on Apache Iceberg tables without statistics, including TPC-DS Query #72, which improved by 125 times from 690 seconds to 5.5 seconds.

In this post, we first briefly review how planner statistics are collected and what impact they have on queries. Then, we discuss Amazon Redshift features that deliver optimal plans on Iceberg tables and Parquet data even with the lack of statistics. Finally, we review some example queries that now execute faster because of these latest Amazon Redshift innovations.

Prerequisites

The benchmarks in this post were run using the following environment:

  • Amazon Redshift Serverless with a base capacity of 88 RPU (Amazon Redshift processing unit)
  • The Cloud Data Warehouse Benchmark derived from the TPC-DS 3TB dataset. The following tables were partitioned in this dataset (the rest were unpartitioned):
    • catalog_returns on cr_returned_date_sk
    • catalog_sales on cs_sold_date_sk
    • store_returns on sr_returned_date_sk
    • store_sales on ss_sold_date_sk
    • web_returns on wr_returned_date_sk
    • web_saleson ws_sold_date_sk
    • inventory on inv_date_sk

For more information on loading the Cloud Data Warehouse Benchmark into your Amazon Redshift Serverless workgroup, see the Cloud Data Warehouse Benchmark documentation.

Now, let’s review how database statistics work and how they impact query performance.

Overview of the impact of planner statistics on query performance

To understand why database statistics are important, first let’s review what a query planner does. A query planner is the brain of a database: when you send a query to a database, the query planner must determine the most efficient way to load and compute all of the data required to answer the query. Having information about the underlying dataset, such as statistics about the number of rows in a dataset, or the distribution of data, can help the query planner generate an optimal plan for retrieving the data. Amazon Redshift uses statistics about the underlying data in tables and columns statistics to determine how to build an optimal query execution path.

Let’s see how this works in an example. Consider the following query to determine the top five sales dates in December 2024 for stores in North America:

SELECT insert_date, sum(sales) AS total_sales
FROM receipts
JOIN stores ON stores.id = receipts.store_id
WHERE
  stores.region = 'NAMER' AND
  receipts.insert_date BETWEEN '2024-12-01' AND '2024-12-31'
GROUP BY receipts.insert_date
ORDER BY total_sales DESC
LIMIT 5;

In this query, the query planner has to consider several factors, including:

  • Which table is larger, stores or receipts? Am I able to query the smaller table first to reduce the amount of searching on the larger table?
  • Which returns more rows, receipts.insert_date BETWEEN '2024-12-01' AND '2024-12-31' or stores.region = 'NAMER'?
  • Is there any partitioning on the tables? Can I search over a smaller set of data to speed up the query?

Having information about the underlying data can help to generate an optimal query plan. For example, stores.region = 'NAMER' might only return a few rows (that is, it’s highly selective), meaning it’s more efficient to execute that step of the query first before filtering through the receipts table. What helps a query planner make this decision is the statistics available on columns and tables.

Table statistics (also known as planner statistics) provide a snapshot of the data available in a table to help the query planner make an informed decision on execution strategies. Databases collect table statistics through sampling, which involves reviewing a subset of rows to determine the overall distribution of data. The quality of statistics, including the freshness of data, can significantly impact a query plan, which is why databases will reanalyze and regenerate statistics after a certain threshold of the underlying data changes.

Amazon Redshift supports several table and column level statistics to assist in building query plans. These include:

Statistic What it is Impact Query plan influence
Number of rows (numrows) Number of rows in a table Estimates the overall size of query results and JOIN sizes Decisions on JOIN ordering and algorithms, and resource allocation
Number of distinct values (NDV) Number of unique values in a column Estimates selectivity, that is, how many rows will be returned from predicates (for example, WHERE clause) and the size of JOIN results Decisions on JOIN ordering and algorithms
NULL count Number of NULL values in a column Estimates number of rows eliminated by IS NULL or IS NOT NULL Decisions on filter pushdown (that is, what nodes execute a query) and JOIN strategies
Min/max values Smallest and largest values in a column Helps range-based optimizations (for example, WHERE x BETWEEN 10 AND 20) Decisions on JOIN order and algorithms, and resource allocation
Column size Total size of column data in memory Estimates overall size of scans (reading data), JOINs, and query results Decisions on JOIN algorithms and ordering

Open formats such as Apache Parquet don’t have any of the preceding statistics by default and table formats like Apache Iceberg have a subset of the preceding statistics such as number of rows, NULL count and min/max values. This can make it challenging for query engines to plan efficient queries. Amazon Redshift has added innovations that improve overall query performance on data lake data stored in Apache Iceberg and Apache Parquet formats even when all or partial table or column-level statistics are unavailable. The next section reviews features in Amazon Redshift that help improve query performance on data lakes even when table statistics aren’t present or are limited.

Amazon Redshift features when data lakes don’t have statistics for Iceberg tables and Parquet

As mentioned previously, there are many cases where tables stored in data lakes lack statistics, which creates challenges for query engines to make informed decisions on selecting the best query plan. However, Amazon Redshift has released a series of innovations that improve performance for queries on data lakes even when there aren’t table statistics available. In this section, we review some of these enhancements and how they impact your query performance.

Dynamic partition elimination through distributed joins

Dynamic partition elimination is a query optimization technique that allows Amazon Redshift to skip reading data unnecessarily during query execution on a partitioned table. It does this by determining which partitions of a table are relevant to a query and only scanning those partitions, significantly reducing the amount of data that needs to be processed.

For example, imagine a schema that has two tables:

  • sales (fact table) with columns:
    • sale_id
    • product_id
    • sale_amount
    • sale_date
  • products (dimension table) with columns:
    • product_id
    • product_name
    • category

The sales table is partitioned by product_id. In the following example, you want to find the total sales amount for products in the Electronics category in December 2024.

SQL query:

SELECT SUM(s.sale_amount) 
FROM sales s
JOIN products p ON s.product_id = p.product_id
WHERE p.category = 'Electronics';

How Amazon Redshift improves this query:

  1. Filter on dimension table:
    • The query filters the products table to only include products in the Electronics category.
  2. Identify relevant partitions:
    • With the new improvements, Amazon Redshift analyzes this filter and determines which partitions of the sales table need to be scanned.
    • It looks at the product_id values in the products table that match the Electronics category and only scans those specific partitions in the sales table.
    • Instead of scanning the entire sales table, Amazon Redshift only scans the partitions that contain sales data for electronics products.
    • This significantly reduces the amount of data Amazon Redshift needs to process, making the query faster.

Previously, this optimization was only applied on broadcast joins when all child joins below the join were also broadcast joins. The Amazon Redshift team extended this capability to work on all broadcast joins, regardless if the child joins below them are broadcast. This allows more queries to benefit from dynamic partition elimination, such as TPC-DS Q64 and Q75 for Iceberg tables, and TPC-DS Q25 in Parquet.

Metadata caching for Iceberg tables

The Iceberg open table format employs a two-layer structure: a metadata layer and a data layer. The metadata layer has three levels of files (metadata.json, manifest lists, and manifests), which allows for performance features such as faster scan planning and advanced data filtering. Amazon Redshift uses the Iceberg metadata structure to efficiently identify the relevant data files to scan, using partition value ranges and column-level statistics and eliminating unnecessary data processing.

The Amazon Redshift team observed that Iceberg metadata is frequently fetched multiple times both within and across queries, leading to potential performance bottlenecks. We implemented an in-memory LRU (least recently used) cache for parsed metadata, manifest list files, and manifest files. This cache keeps the most recently used metadata so that we avoid fetching them repeatedly from Amazon Simple Storage Service (Amazon S3) across queries. This caching has helped with overall performance improvements of up to 2% in a TPC-DS 3TB workload. We observe more than 90% cache hits for these metadata structures, reducing the iceberg metadata processing times considerably.

Stats inference for Iceberg tables

As mentioned previously, the Apache Iceberg file format comes with some statistics such as number of rows, number of nulls, column min/max values and column storage size in the metadata files called manifest files. However, they don’t always provide all the statistics that we need especially average width which is important for the cost-based optimizer used by Amazon Redshift.

We delivered a feature to estimate average width for variable length columns such as string and binary from Iceberg metadata. We do this by using the column storage size and the number of rows, and we adjust for column compression when necessary. By inferring these additional statistics, our optimizer can make more accurate cost estimates for different query plans. This stats inference feature, released in Amazon Redshift patch 186, offers up to a 7% improvement in the TPC-DS benchmarks. We have also enhanced Amazon Redshift optimizer’s cost model. The enhancements include planner optimizations that improve the estimations of the different join distribution strategies to take into account the networking cost of distributing the data between the nodes of an Amazon Redshift cluster. The enhancements also include improvements to Amazon Redshift query optimizer. These enhancements, which are a culmination of several years of research, testing, and implementation demonstrated up to a 45% improvement in a collection of TPC-DS benchmarks.

Example: TPC-DS benchmark highlights on Amazon Redshift no stats queries on data lakes

One way to measure data lake query performance for Amazon Redshift is using the TPC-DS benchmark. The TPC-DS benchmark is a standardized benchmark designed to test decision support systems, specifically looking at concurrently accessed systems where queries can range from shorter analytical queries (for example, reporting, dashboards) to longer running ETL-style queries for moving and transforming data into a different system. For these tests, we used the Cloud Data Warehouse Benchmark derived from the TPC-DS 3TB to align our testing with many common analytical workloads, and provide a standard set of comparisons to measure improvements to Amazon Redshift data lake query performance.

We ran these tests across data stored both in the Apache Parquet data format, in addition to Apache Iceberg tables with data in Apache Parquet files. Because we focused these tests on out-of-the-box performance, none of these data sets had any table statistics available. We performed these tests using the specified Amazon Redshift patch versions in the following table, and used Amazon Redshift Serverless with 88 RPU without any additional tuning. The following results represent a power run, which is the sum of how long it took to run all the tests, from a warm run, which are the results of the power run after at least one execution of the workload:

P180 (12/2023) P190 (5/2025)
Apache Parquet (only numrows) 7,796 3,553
Apache Iceberg (out-of-the-box, no tuning) 4,411 1,937

We saw notable improvements in several query run times. For this post, we focus on the improvements we saw in query 82:

SELECT
    i_brand_id brand_id, i_brand brand,
    sum(ss_ext_sales_price) ext_price
FROM date_dim, store_sales, item
WHERE d_date_sk = ss_sold_date_sk AND
    ss_item_sk = i_item_sk AND
    i_manager_id = 83 AND
    d_moy = 12 AND
    d_year = 2002
GROUP BY i_brand, i_brand_id
ORDER BY ext_price desc, i_brand_id
LIMIT 100;

In this query, we’re searching for the top 100 selling brands from a specific manager in December 2002, which represents a typically dashboard-style analytical query. In our power run, we saw a reduction in query time from 512 seconds to 18.1 seconds for Apache Parquet data, or a 28.2x improvement in performance. The accelerated query performance for this query in a warm run is due to the improvements to the cost-based optimizer and dynamic partition elimination.

We saw query performance improvements across many of the queries found in the Cloud Data Warehouse Benchmark derived from the TPC-DS test suite. We encourage you to try your own performance tests using Amazon Redshift Serverless on your data lake data to see what performance gains you can observe.

Cleanup

If you ran these tests on your own and don’t need the resources anymore, you’ll need to delete your Amazon Redshift Serverless workgroup. See Shutting down and deleting a cluster. If you don’t need to store the Cloud Data Warehouse Benchmark data in your S3 bucket anymore, see Deleting Amazon S3 objects.

Conclusion

In this post, you learned how cost-based optimizers for databases work, and how statistical information about your data can help Amazon Redshift execute queries more efficiently. You can optimize query performance for Iceberg tables by automatically collecting Puffin statistics, which lets Amazon Redshift use these recent innovations to more efficiently query your data. Giving more info to your query planner—the brain of Amazon Redshift—helps to provide more predictable performance and helps you to further scale how you interact with your data in your data lakes and data lakehouses.


About the authors

Martin Milenkoski is a Software Development Engineer on the Amazon Redshift team, currently focusing on data lake performance and query optimization. Martin holds an MSc in Computer Science from the École Polytechnique Fédérale de Lausanne.

Kalaiselvi Kamaraj is a Sr. Software Development Engineer on the Amazon Redshift team. She has worked on several projects within the Amazon Redshift Query processing team and currently focusing on performance related projects for Amazon Redshift DataLake and query optimizer.

Jonathan Katz is a Principal Product Manager – Technical on the AWS Analytics team and is based in New York. He is a Core Team member of the open-source PostgreSQL project and an active open-source contributor, including to the pgvector project.

Leave a Reply

Your email address will not be published. Required fields are marked *