Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HEAD-ache requests #272

Open
H-Plus-Time opened this issue Aug 1, 2023 · 2 comments
Open

HEAD-ache requests #272

H-Plus-Time opened this issue Aug 1, 2023 · 2 comments

Comments

@H-Plus-Time
Copy link
Contributor

TLDR: HEAD requests are the correct way to check content-length, but object stores (and overly restrictive policies) don't play nice.

The problem:

  1. Determine the content length of the target url
  2. Attempt to call read_metadata_async(signedUrl, contentLength) - works.
  3. Call read_row_group(signedUrl, /* etc sans contentLength */).
  4. Receive a 4xx response on the unavoidable HEAD request (because signed urls are only good for one method at a time).

Obviously the biggest contributor to this is S3 (it's the motivating example), but there are also plenty of servers in the wild configured to accept GET requests but deny HEAD requests (for whatever inane reason).

Since range requests support is mandatory for async reads, there's the option of falling back to a GET with bytes=0-0 to get the Content-Length header. The only question really is whether to do this via a reader option, via a try catch fallback (incurring an additional request), or restore direct contentLength as an option on read_row_group.

@kylebarron
Copy link
Owner

Ideally we wouldn't need to know the content length at all; it should be possible to fetch the last bytes of a parquet file to get the byte range of the metadata, and from there get the byte ranges of the columns. But arrow2 doesn't support that and I think arrow2 development mostly stopped.

Your suggestion is mostly to use a get request instead of a head request?

@kylebarron
Copy link
Owner

(arrow-rs might support async reads without knowing the content length; I haven't checked)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants