Skip to content

fix: cache object stores and bucket regions to reduce DNS query volume#3802

Draft
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:cache-object-store
Draft

fix: cache object stores and bucket regions to reduce DNS query volume#3802
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:cache-object-store

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Mar 26, 2026

Which issue does this PR close?

Part of #3799.

Rationale for this change

When reading Parquet files from S3, each call to initRecordBatchReader creates a new SessionContext, RuntimeEnv, and S3 ObjectStore client. Each ObjectStore instance creates a new reqwest HTTP client with its own connection pool, requiring fresh DNS resolution for every file opened.

In TPC-DS-scale workloads with thousands of Parquet files, this generates excessive DNS queries that can overwhelm DNS resolvers (e.g., Route53 Resolver limits in EKS environments), causing UnknownHostException errors, intermittent S3 connectivity failures, and job failures under high concurrency.

Native Spark does not have this problem because Hadoop's S3AFileSystem maintains singleton instances per filesystem URI (via FileSystem.get() cache) and benefits from JVM-level DNS caching. Comet's native Rust layer bypasses all of this.

Additionally, when no region is explicitly configured, each file read triggers a HeadBucket API call with a throwaway reqwest::Client to auto-detect the bucket region, doubling the DNS query volume.

What changes are included in this PR?

Two caches are introduced in the native Rust layer:

  1. ObjectStore cache (parquet_support.rs): A global cache keyed by (URL prefix, config hash) in prepare_object_store_with_configs(). Subsequent reads from the same bucket with the same configuration reuse the existing ObjectStore instance, enabling HTTP connection pooling via reqwest and eliminating redundant DNS lookups.

  2. Region resolution cache (s3.rs): A per-bucket cache in resolve_bucket_region() that ensures the HeadBucket API call (used to auto-detect bucket region when not explicitly configured) happens only once per bucket per JVM lifetime.

Together these changes reduce DNS query volume from O(num_files) to O(num_distinct_buckets) per executor.

Workaround for users on current versions: Set fs.s3a.endpoint.region explicitly to avoid the HeadBucket auto-detection entirely.

How are these changes tested?

  • Existing unit tests in parquet_support.rs continue to pass (they test local filesystem paths which exercise the same prepare_object_store_with_configs code path).
  • Clippy passes with no warnings.
  • The caching is transparent — it returns the same ObjectStore instances that would have been created without caching, so existing S3 integration tests cover correctness.
  • Manual validation recommended in an EKS environment with S3-backed TPC-DS workloads to confirm DNS query volume reduction.

When reading Parquet files from S3, each call to initRecordBatchReader
creates a new SessionContext, RuntimeEnv, and S3 ObjectStore client.
Each ObjectStore instance creates a new reqwest HTTP client with its own
connection pool, requiring fresh DNS resolution for every file opened.
In TPC-DS-scale workloads with thousands of files, this generates
excessive DNS queries that can overwhelm resolvers (e.g., Route53
Resolver limits in EKS environments), causing UnknownHostException
errors and intermittent S3 connectivity failures.

This commit introduces two caches:

1. A global ObjectStore cache keyed by (URL prefix, config hash) in
   prepare_object_store_with_configs(). Subsequent reads from the same
   bucket with the same configuration reuse the existing ObjectStore,
   enabling HTTP connection pooling and eliminating redundant DNS lookups.

2. A region resolution cache in resolve_bucket_region(). When no region
   is explicitly configured, each file read triggers a HeadBucket API
   call (with its own reqwest client) to determine the bucket region.
   The cache ensures this call happens only once per bucket.
@mbutrovich
Copy link
Copy Markdown
Contributor

Seems similar to the challenges I'm seeing in iceberg-rust apache/iceberg-rust#2177 apache/iceberg-rust#2172.

Path::parse re-encodes percent-encoded characters, causing
double-encoding (e.g. %27 → %2527) when reusing cached object stores.
Use Path::from_url_path which correctly handles already-encoded URL
paths, consistent with other call sites in the codebase.
@parthchandra
Copy link
Copy Markdown
Contributor

How would we verify that the number of DNS requests has gone down?
(Note: On EKS we can get CoreDNS metrics https://docs.aws.amazon.com/eks/latest/userguide/coredns-metrics.html)

@andygrove andygrove marked this pull request as ready for review March 26, 2026 23:21
Copy link
Copy Markdown
Member

@wForget wForget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove , LGTM

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating this @andygrove! I am afraid of global singletons. Please justify that life cycle and add documentation why it's not an issue, particularly in expired credentials scenarios.

/// excessive DNS queries that can overwhelm DNS resolvers (e.g., Route53 Resolver limits
/// in EKS environments).
fn object_store_cache() -> &'static ObjectStoreCache {
static CACHE: OnceLock<ObjectStoreCache> = OnceLock::new();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything static (global singleton) should be documented why static is the reasonable life cycle. In particular, I wonder about the unbounded size of a static cache, invalidation scenarios (what if a job runs long enough and needs new credentials passed into the object_store?), and why there was no other location with a reasonable life cycle to own this cache.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs, but I have not yet confirmed that the part about credential provider interaction is actually correct, so moved to draft for now.

@andygrove andygrove marked this pull request as draft March 27, 2026 14:04
@andygrove andygrove marked this pull request as draft March 27, 2026 14:04
…ches

Address review comment asking for justification of global singleton
caches. Add comments explaining:
- Why process lifetime is the right scope (JNI creates a new RuntimeEnv
  per file, leaving no executor-scoped Rust object to own the cache)
- Why unbounded size is acceptable (bounded by distinct bucket+config
  combinations, which is small in practice)
- Credential invalidation behaviour (dynamic providers refresh
  transparently; static-credential config changes produce a new hash
  and a new cache entry)
- Why region cache needs no invalidation (S3 bucket regions are
  immutable after bucket creation)
@andygrove
Copy link
Copy Markdown
Member Author

I tested with TPC-H @ 1TB in AWS and this does appear to have resolved the Generic S3 error: an error occurred while loading credentials error that I was previously seeing

In the standard Spark-on-Kubernetes deployment model each executor pod
is dedicated to a single Spark application, so the static cache
lifetime is effectively bounded by the application lifetime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants