Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: S3 SDK users might believe that OpenDAL does not support multipart upload pre-sign URL #4627

Closed
TonyPythoneer opened this issue May 17, 2024 · 9 comments

Comments

@TonyPythoneer
Copy link

TonyPythoneer commented May 17, 2024

A Python code snippet for a multipart upload demo

In the OpenDAL SDK, I believe the following three APIs are missing to fully support multipart uploads, which are available in the AWS S3 Python SDK (boto3):

  • create_multipart_upload - To initiate a multipart upload and obtain an upload ID for the object that will receive multiple parts.
  • generate_presigned_url - OpenDAL already has the capability to generate presigned URLs, but it may need to extend its parameters to support multipart uploads by including the upload ID and part number.
  • complete_multipart_upload - To signal that the multipart upload for the object is complete after uploading all parts.
import boto3
import requests

s3 = boto3.client('s3')
max_size = 5 * 1024 * 1024 #you can define your own size

res = s3.create_multipart_upload(Bucket=bucket_name, Key=key)
upload_id = res['UploadId']

# please note this is for only 1 part of the file, you have to do it for all parts and store all the etag, partnumber in a list 

parts=[]
signed_url = s3.generate_presigned_url(ClientMethod='upload_part',Params={'Bucket': bucket_name, 'Key': key, 'UploadId': upload_id, 'PartNumber': part_no})

with target_file.open('rb') as f:
      file_data = f.read(max_size) #here reading content of only 1 part of file 

res = requests.put(signed_url, data=file_data)

etag = res.headers['ETag']
parts.append({'ETag': etag, 'PartNumber': part}) #you have to append etag and partnumber of each parts  

#After completing for all parts, you will use complete_multipart_upload api which requires that parts list 
res = s3.complete_multipart_upload(Bucket=bucket_name, Key=key, MultipartUpload={'Parts': parts},UploadId=upload_id)

Note: Copied from boto/boto3#2305 (comment)

The provided Python code snippet demonstrates how these APIs are used in boto3 for a multipart upload. This could serve as a reference for implementing similar functionality in OpenDAL.


Additional Reference

It's worth noting that for full compatibility with the AWS S3 Multipart Uploads API.
If the OpenDAL team needs an S3 API compatibility document for reference, it's the following.

@Xuanwo
Copy link
Member

Xuanwo commented May 17, 2024

In the OpenDAL SDK, I believe the following three APIs are missing to fully support multipart uploads, which are available in the AWS S3 Python SDK (boto3):

Hi, OpenDAL has its own abstractions that hide implementation details. By design, we do not expose these APIs to users. When users create a Writer on s3, OpenDAL uses create_multipart and complete_multipart under the hood.

This design allows us to build a unified API layer across azblob, gcs and many other storage services.

No matter which storage the users use, they only need to:

let mut w = op.writer(path).await?;
w.write(bs1).await?;
w.write(bs2).await?;
w.close()?;

@Xuanwo
Copy link
Member

Xuanwo commented May 17, 2024

I thought we discussed pre-sign support for multipart in #4282. It seems there's some misunderstanding here.

So:

  • OpenDAL does support S3 multipart uploads, although we don't directly expose those APIs to users.
  • We don't support presigned URLs for multipart operations yet.

@TonyPythoneer
Copy link
Author

TonyPythoneer commented May 17, 2024

My idea was to find a way to generate pre-signed URLs for multipart operations and let the client app run them, such as on the browser or mobile.

It will inevitably call underlying apis like create_multipart_upload and CompleteMultipartUpload.

So, I just dump my thoughts from other SDK usage experiences.

@Xuanwo
Copy link
Member

Xuanwo commented May 17, 2024

It will inevitably call underlying apis like create_multipart_upload and CompleteMultipartUpload.

That's what opendal wants to avoid.


The problem is finding a good API design that can generate such an upload URL for users without leaking the storage details.

@Xuanwo
Copy link
Member

Xuanwo commented May 17, 2024

The problem is finding a good API design that can generate such an upload URL for users without leaking the storage details.

I'm guessing the most important part is allowing the client to upload data, right? The initiate multipart and complete multipart can be called directly from the server.

Is supporting URL generation for part uploads sufficient for you?

@TonyPythoneer
Copy link
Author

TonyPythoneer commented May 17, 2024

I'm guessing the most important part is allowing the client to upload data, right? The initiate multipart and complete multipart can be called directly from the server.

Yes, it allows the client to upload data, but it's through multipart operations if the case is for a large object (like over 100 MB in size).

The purpose is that the upload operation can continue in any part if the network connection is suddenly broken. In addition, it may need ListParts to know the current progress and continue it.

Is supporting URL generation for part uploads sufficient for you?

No, my reason is above. But I know your point of view. I'm thinking about it.

@TonyPythoneer
Copy link
Author

TonyPythoneer commented May 19, 2024

I dump my rough idea here.
The following code is written in Python.

# trigger the underlying CreateMultipartUpload API and make a MultiPartUploadFile including UploadID
multipart_file = op.multipart("filename")


# Calling the part(...) will get a PartUploadFile
# Calling presign for a PartUploadFile object to expose a URL to serve the client
multipart_file.part(number).presign(expire_second)
# or 
multipart_file.presign_part(part_number: int, expire_second: int)


# Calling the underlying CompleteMultipartUpload API to tell the S3 server
multipart_file.complete([{'ETag': etag, 'PartNumber': part}...])

I'm happy to hear your feedback.
Thank you.

@Xuanwo
Copy link
Member

Xuanwo commented May 19, 2024

Hi, as I mentioned in previous comments, OpenDAL's vision is to provide free data access. Any design that violates this vision is unlikely to be accepted. In reality, opendal users should be able to write code without knowing which underlying storage service is being used.

In our current design, there are following places that doesn't work:

  • multipart: Not all storage services offer multipart abstraction, like gcs, azblob, hdfs, ..
  • The url generated by presign_part could return different responses that need users to handling.

I think it's hard to create such an abstraction for multipart presign since it needs the data returned by upload_part to do complete.

Other storage services like gcs, azblob maybe possible since users can just call uploading without parsing the returning response.

@TonyPythoneer
Copy link
Author

multipart: Not all storage services offer multipart abstraction, like gcs, azblob, hdfs, ..

Oops, I thought that gcs and azblob had the same ability, but they actually don't have that.
For this reason, I won't proceed with this discussion anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants