Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a problem of duplicate downloads in S3 segmented downloads? #1673

Open
FourSpaces opened this issue May 17, 2024 · 10 comments
Open
Assignees
Labels
question Further information is requested

Comments

@FourSpaces
Copy link

FourSpaces commented May 17, 2024

Question

Is there a problem of duplicate downloads in S3 segmented downloads?

When I was checking the logs of the default-worker, I found that the S3 version downloaded the same file multiple times in segments, and this data was duplicated, which may cause 2-3 copies of the same file to be pulled. I think this is unreasonable, please check it.

Logs of vw-default-1 node:

2024.05.16 13:28:08.198688 [ 525 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data
2024.05.16 13:28:08.232677 [ 525 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 13:28:08.232709 [ 525 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:08 GMT; Content-Type: application/xml; Content-Length: 18482198404; Connection: keep-
alive; Accept-Ranges: bytes; Content-Range: bytes 3254060092-21736258495/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-a
ge=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 54285e0ecff7a52155e28b256ef91a6942aba64f5275133a9915e5a03a2b0fe3; X-Amz-Request-Id: 17CFE0EC38B207BE; X-Content-Type-Options: nosn
iff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449534533240356868; 


2024.05.16 13:28:08.360277 [ 457 ] {} <Debug> AWSClient: AWS S3 slow read(100ms): http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data, ti
me = 20051ms, header = Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:08 GMT; Content-Type: application/xml; Content-Length: 62896; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 29185007
1-291912966/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-En
coding; X-Amz-Id-2: 941e76dd7d6fa756cdff7ccf88d4481bffbe769d1028d5b43704b4c55c73ddfa; X-Amz-Request-Id: 17CFE0E797717C07; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449
534533240356868; 


.16 13:28:08.360913 [ 457 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data
2024.05.16 13:28:08.391827 [ 457 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 13:28:08.391853 [ 457 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:08 GMT; Content-Type: application/xml; Content-Length: 21724365923; Connection: keep-
alive; Accept-Ranges: bytes; Content-Range: bytes 11892573-21736258495/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-age
=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: d4ff7959db658b9e0dffd743dd392e33bc300cd3bbd7a5bd3d29810f53c3c9c8; X-Amz-Request-Id: 17CFE0EC4277F777; X-Content-Type-Options: nosnif
f; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449534533240356868; 


2024.05.16 13:28:09.732969 [ 502 ] {} <Debug> AWSClient: AWS S3 slow read(100ms): http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data, ti
me = 20032ms, header = Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:09 GMT; Content-Type: application/xml; Content-Length: 62896; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 11623284
-11686179/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Enco
ding; X-Amz-Id-2: f26f03c24c5ec8801cb87012b4a127bedc053def9caa5a112cf1f83805ebe38e; X-Amz-Request-Id: 17CFE0E7EA1D555F; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#44953
4533240356868; 

2024.05.16 13:28:09.733863 [ 502 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data
2024.05.16 13:28:09.814513 [ 502 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 13:28:09.814544 [ 502 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:09 GMT; Content-Type: application/xml; Content-Length: 21727900351; Connection: keep-
alive; Accept-Ranges: bytes; Content-Range: bytes 8358145-21736258495/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-age=
31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 1b73111f7edd0bc74bca9ee528d43c48f91ab0a501a53129569d2af3a2fa2865; X-Amz-Request-Id: 17CFE0EC9475C151; X-Content-Type-Options: nosniff
; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449534533240356868;

Logs of vw-default-2 node:

2024.05.16 16:17:47.217211 [ 493 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.239594 [ 493 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.239661 [ 493 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 44640; Connection: keep-alive;
 Accept-Ranges: bytes; Content-Range: bytes 209293991-209338630/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=3153600
0; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 6b55756232eaebad29d4d8b397bf9c659f243dca8d7ab18592690d0cc4c34798; X-Amz-Request-Id: 17CFEA2E327FEC1A; X-Content-Type-Options: nosniff; X-Xss
-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.240946 [ 493 ] {} <Debug> DiskLocal: Reserving 192.41 MiB on disk `server_local_0`, having unreserved 1.34 TiB.
2024.05.16 16:17:47.241188 [ 493 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.255143 [ 493 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.255166 [ 493 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 15452667054; Connection: keep-
alive; Accept-Ranges: bytes; Content-Range: bytes 7541875-15460208928/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=3
1536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 941e76dd7d6fa756cdff7ccf88d4481bffbe769d1028d5b43704b4c55c73ddfa; X-Amz-Request-Id: 17CFEA2E339618C0; X-Content-Type-Options: nosniff;
 X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.335909 [ 470 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 44640; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 7497235-7541874/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 1b73111f7edd0bc74bca9ee528d43c48f91ab0a501a53129569d2af3a2fa2865; X-Amz-Request-Id: 17CFEA2E384E4EDD; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.336322 [ 470 ] {} <Debug> DiskLocal: Reserving 99.45 KiB on disk `server_local_0`, having unreserved 1.34 TiB.
2024.05.16 16:17:47.336536 [ 470 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.354031 [ 470 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.354055 [ 470 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 15452813534; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 7395395-15460208928/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 080f919c38a146b0c9e87ef5ccef4e7f57727f513de461329f04b2a946d47397; X-Amz-Request-Id: 17CFEA2E3951B7FB; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.360179 [ 470 ] {} <Debug> DiskLocal: Reserving 43.59 KiB on disk `server_local_0`, having unreserved 1.34 TiB.
2024.05.16 16:17:47.372596 [ 490 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.392572 [ 490 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.392605 [ 490 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 44640; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 7350755-7395394/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 080f919c38a146b0c9e87ef5ccef4e7f57727f513de461329f04b2a946d47397; X-Amz-Request-Id: 17CFEA2E3B908F6F; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.393722 [ 490 ] {} <Debug> DiskLocal: Reserving 2.22 MiB on disk `server_local_0`, having unreserved 1.34 TiB.
2024.05.16 16:17:47.393941 [ 490 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.419049 [ 490 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.419073 [ 490 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 15455181897; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 5027032-15460208928/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: a6df522adb4071567f8fb40dc8952caed1cdaa48348601461d84f5ba69b473c1; X-Amz-Request-Id: 17CFEA2E3CDA29CB; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339;

Extract the content of Content Length and Content Range:

vw-default-1 node:

Content-Length: 18482198404; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 3254060092-21736258495/21736258496;

Content-Length: 62896; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 291850071-291912966/21736258496

Content-Length: 21724365923; Connection: keep-
alive; Accept-Ranges: bytes; 
Content-Range: bytes 11892573-21736258495/21736258496; 

Content-Length: 62896; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 11623284-11686179/21736258496;

Content-Length: 21727900351; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 8358145-21736258495/21736258496;

vw-default-2 node:

Content-Length: 44640; Connection: keep-alive;Accept-Ranges: bytes; 
Content-Range: bytes 209293991-209338630/15460208929;

Content-Length: 15452667054; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7541875-15460208928/15460208929;

Content-Length: 44640; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7497235-7541874/15460208929;

Content-Length: 15452813534; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7395395-15460208928/15460208929;

Content-Length: 44640; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7350755-7395394/15460208929;

Content-Length: 15455181897; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 5027032-15460208928/15460208929;

It can be seen that there are overlapping parts when reading multiple segments of the same file, and this has a significant impact on S3 bandwidth.
Is it reasonable to use segmentation to read data segments exceeding 10G?

@FourSpaces FourSpaces added the question Further information is requested label May 17, 2024
@Alima777
Copy link
Collaborator

Alima777 commented May 21, 2024

Hi, don't worry, it's not a bug but an optimization.

You can see the following two request, they both use file size as right offset, which doesn't mean we will read all data of this request, but only read what we need. The following data will be discarded. They can be two columns reading concurrently from two threads in your case.

We use this method so that we can reduce the total GET requests of S3 if we will read lots of data.

Content-Length: 15452667054; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7541875-15460208928/15460208929;

Content-Length: 15452813534; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7395395-15460208928/15460208929;

By this way, which version you are using? Actually, we have optimized the accurate right offset rather than file size.

@FourSpaces
Copy link
Author

We are using version 0.4.0. If they are multi-threaded to read different columns, their total read count should be less than or equal to the total file size, rather than 1.5 to 2 times the file size. This has caused an increase in S3 bandwidth.

@Alima777
Copy link
Collaborator

How did you get the conclusion 1.5 to 2 times the file size?

@FourSpaces
Copy link
Author

FourSpaces commented May 22, 2024

I determined based on the length of Content Length, and the sum of the lengths of two Content Lengths is already greater than the file size.
In the example, the sum of two Content Lengths is 15452667054+15452813534=30905480588, while the file size is 15460208929, 30905480588/15460208929=1.999

@FourSpaces
Copy link
Author

We have set these parameters:

enable_io_scheduler: 1
enable_io_pfra: 1

@Alima777
Copy link
Collaborator

I determined based on the length of Content Length, and the sum of the lengths of two Content Lengths is already greater than the file size.

That's not true..Just like I said before, we send the request but not read all data from it. And S3 doesn't return all data either. It's like a stream.

@Alima777
Copy link
Collaborator

enable_io_scheduler: 1
enable_io_pfra: 1

You can try close these two, in this case, right offset will be accurate.

@FourSpaces
Copy link
Author

We can give it a try

@kevinthfang
Copy link
Contributor

any updates?

@FourSpaces
Copy link
Author

After our adjustment, the merging of data became slower and S3QPS increased significantly, so we rolled it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants