Allow to configure thread-pool while using Iceberg to read the data (plan files/tasks) #10335

osscm · 2024-05-14T08:32:05Z

Feature Request / Improvement

Right now there is only one static Thread Pool, which Iceberg library uses internally.
It picks a number based on the num of cores (https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SystemConfigs.java#L42)

From the engines like Trino and Spark, there is no way to control this.
Users uses Trino to access multiple catalogs, so it becomes much more important from Trino to have this support.

The, other problem with this approach is, that Engines using Iceberg will not have any control and trace of memory used by these Threads, and can it be directly proportional to the threads used.

When Trino is running many concurrent queries, this can unearth a unstable Coordinator.

iceberg/core/src/main/java/org/apache/iceberg/util/ThreadPools.java

Line 41 in a78aa2d

    
           private static final ExecutorService WORKER_POOL = MoreExecutors.getExitingExecutorService(

related issues:
trinodb/trino#11920
trinodb/trino#11708

Query engine

None

amogh-jahagirdar · 2024-05-14T20:59:30Z

Checking, I thought we already had an API which allowed users to pass in a custom thread pool during planning? If not, I think that makes sense to add.

amogh-jahagirdar · 2024-05-14T21:09:27Z

yeah we have a planWith API on Scans already and Trino is already using that when generating the splits here https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java#L91.

@osscm Do you mind elaborating what kind of configuration you had in mind? It seems to me that with a custom executor (whose size is controlled by the engine) a user could control the things you were talking about like memory consumption. Or if you see that we're not leveraging that threadpool in certain cases, when we should be but that's another problem

osscm · 2024-05-15T23:45:24Z

Thanks Amogh, For us, it the first problem, (didn’t debug if worker pool is not used for certain tasks, but for scanning it is using) I also have similar idea of passing executor from the engine or can provide the number of threads, we can give both the options. As if I can just trust Iceberg APIs to manage the executor, than will just pass the number of threads. Happy to add the change, as seems to be should be straightforward. Also, do you know if there is some test or analysis done to understand how much memory scan can take based on the snapshots/manifests/data-files?

…

On Tue, May 14, 2024 at 2:09 PM Amogh Jahagirdar ***@***.***> wrote: yeah we have a planWith API on Scans already and Trino is already using that when generating the splits here https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java#L91 . @osscm <https://github.com/osscm> Do you mind elaborating what kind of configuration you had in mind? It seems to me that with a custom executor (whose size is controlled by the engine) a user could control the things you were talking about like memory consumption. Or if you see that we're not leveraging that threadpool in certain cases, when we should be but that's another problem — Reply to this email directly, view it on GitHub <#10335 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXQ2PYXU32MHSZCMMHNXYGLZCJ4RZAVCNFSM6AAAAABHVWF4MSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRGE2DMOBUHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

osscm added the improvement PR that improves existing functionality label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to configure thread-pool while using Iceberg to read the data (plan files/tasks) #10335

Allow to configure thread-pool while using Iceberg to read the data (plan files/tasks) #10335

osscm commented May 14, 2024

amogh-jahagirdar commented May 14, 2024

amogh-jahagirdar commented May 14, 2024

osscm commented May 15, 2024 via email

Allow to configure thread-pool while using Iceberg to read the data (plan files/tasks) #10335

Allow to configure thread-pool while using Iceberg to read the data (plan files/tasks) #10335

Comments

osscm commented May 14, 2024

Feature Request / Improvement

Query engine

amogh-jahagirdar commented May 14, 2024

amogh-jahagirdar commented May 14, 2024

osscm commented May 15, 2024 via email