Memory leak in MessageChannelPartitionHandler when polling the database #4598

a-del-devision · 2024-05-17T17:15:06Z

Bug description
In remote partitioning jobs which use the MessageChannelPartitionHandler with database polling, on each poll of the database where there is 1 or more new completed worker StepExecutions, the partition handler loads and keeps in memory an additional copy of the corresponding JobInstance, JobExecution, and all StepExecutions and their ExecutionContexts until the partition step is completed.

This leads to high memory consumption during the partition step and can lead to out of memory errors if the poll interval is short enough and the number of partitions is high enough, especially since the ExecutionContexts are held in memory as well.

Environment
Any environment using spring-batch-integration 5.0.1 and above (93800c6) which also uses the MessageChannelPartitionHandler with database polling.

Steps to reproduce
Run a remote partitioning batch job with database polling, a short poll interval, high number of partitions, and limited available memory.

Expected behavior
Polling of the database in remote partitioning jobs does not lead to constant gradual increase of consumed memory until the partition step completes.

Minimal Complete Reproducible example
Minimal Complete Reproducible example is here: spring-batch-mcve-memory-leak.zip

The example runs a remote partitioning batch job with 1000 partitions, each having an ExecutionContext containing a single UUID. In order to exacerbate the memory consumption, the poll interval is set to very low (2ms), and the worker step sleeps for 50ms before completing. This allows a new copy of the JobInstance, JobExecution, StepExecutions, and ExecutionContexts to be loaded and held in memory each time a worker step completes.

Please run the example with the -Xmx64m -XX:+HeapDumpOnOutOfMemoryError jvm options:

MAVEN_OPTS='-Xmx64m -XX:+HeapDumpOnOutOfMemoryError' mvn package exec:java -Dexec.mainClass=org.springframework.batch.MyBatchJobConfiguration

This should cause an OutOfMemoryError to be thrown rather quickly and the resulting heap dump should contain the following:

60-65 instances of JobInstance, JobParameters, JobExecition
60k-65k instances of StepExecution, ExecutionContext

Analysis
In MessageChannelPartitionHandler#pollReplies, the callback calls JobExplorer#getJobExecution. The SimpleJobExplorer implementation loads the JobExecution and all of the StepExecutions as well as their ExecutionContexts. Each of these StepExecutions also contains a reference to the JobExecution and thus to all other StepExecutions indirectly. If any of the loaded StepExecutions is completed and not present in the result Set, they are added to it, and this causes the currently loaded JobExecution instance and all of the other StepExecution instances to be held in memory.

The text was updated successfully, but these errors were encountered:

hpoettker · 2024-05-19T18:08:15Z

Thanks for the very well written bug report. I think your description and analysis are absolutely on point, and the example files are very concise and useful. Very much appreciated!

I've opened a PR to resolve the issue: #4599

With this fix, your reproducer runs successfully even with -Xmx32m.

fmbenhassine · 2024-05-21T21:50:07Z

@a-del-devision Thank you for reporting this issue in details and for providing a minimal complete example!

In fact, there is no need to hold a reference to the entire object graph of each completed worker step in memory until the partition step is completed. The change in #4599 removes the intermediate result that holds these references, which fixes the memory leak.

I will plan that fix for the upcoming patch releases 5.1.2 and 5.0.6.

Resolves #4598

a-del-devision added status: waiting-for-triage Issues that we did not analyse yet type: bug labels May 17, 2024

hpoettker mentioned this issue May 19, 2024

Keep heap lean during remote polling #4599

Closed

fmbenhassine added in: infrastructure and removed status: waiting-for-triage Issues that we did not analyse yet labels May 21, 2024

fmbenhassine closed this as completed in 03a2b4d May 22, 2024

fmbenhassine added this to the 5.2.0-M1 milestone May 22, 2024

fmbenhassine added for: backport-to-5.0.x Issues that will be back-ported to the 5.0.x line for: backport-to-5.1.x Issues that will be back-ported to the 5.1.x line labels May 22, 2024

fmbenhassine pushed a commit that referenced this issue May 22, 2024

Keep heap lean during remote partition polling

93fd2c9

Resolves #4598

fmbenhassine pushed a commit that referenced this issue May 22, 2024

Keep heap lean during remote partition polling

0525fc6

Resolves #4598

fmbenhassine added in: integration and removed in: infrastructure labels May 22, 2024

This was referenced May 22, 2024

5.0.6 Backported issues #4563

Closed

5.1.2 Backported issues #4562

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in MessageChannelPartitionHandler when polling the database #4598

Memory leak in MessageChannelPartitionHandler when polling the database #4598

a-del-devision commented May 17, 2024

hpoettker commented May 19, 2024

fmbenhassine commented May 21, 2024

Memory leak in MessageChannelPartitionHandler when polling the database #4598

Memory leak in MessageChannelPartitionHandler when polling the database #4598

Comments

a-del-devision commented May 17, 2024

hpoettker commented May 19, 2024

fmbenhassine commented May 21, 2024