Skip to content

branch-4.0: [feature](multi-catalog) Add max_file_split_num session variable to prevent OOM in file scan #58759#60732

Merged
yiguolei merged 2 commits intoapache:branch-4.0from
suxiaogang223:pick-58759-branch-4.0-v2
Feb 14, 2026
Merged

branch-4.0: [feature](multi-catalog) Add max_file_split_num session variable to prevent OOM in file scan #58759#60732
yiguolei merged 2 commits intoapache:branch-4.0from
suxiaogang223:pick-58759-branch-4.0-v2

Conversation

@suxiaogang223
Copy link
Contributor

…revent OOM in file scan (apache#58759)

### What problem does this PR solve?

- Relate Pr: apache#58858

## Problem Summary

When querying external table catalog (Hive, Iceberg, Paimon, etc.),
Doris splits files into multiple splits for parallel processing. In some
cases, especially with numerous small files, this can generate an
excessive number of splits, potentially causing:

1. **Memory pressure**: Too many splits consume significant memory in FE
2. **OOM issues**: Excessive split generation can lead to
OutOfMemoryError
3. **Performance degradation**: Managing too many splits impacts query
planning overhead

Previously, there was no upper limit on the number of splits in
non-batch mode, which could lead to problems when querying tables with
many small files.

## Solution

This PR introduces a new session variable `max_file_split_num` to limit
the maximum number of splits allowed per table scan in non-batch mode.

### Changes

1. **New Session Variable**: `max_file_split_num`
   - Type: `int`
   - Default: `100000`
- Description: "在非 batch 模式下,每个 table scan 最大允许的 split 数量,防止产生过多 split
导致 OOM。"
   - Forward to BE: `true`

2. **Implementation in FileQueryScanNode**:
- Added method `applyMaxFileSplitNumLimit(long targetSplitSize, long
totalFileSize)`
- Dynamically calculates minimum split size to ensure split count
doesn't exceed the limit
- Formula: `minSplitSizeForMaxNum = (totalFileSize + maxFileSplitNum -
1) / maxFileSplitNum`
   - Returns: `Math.max(targetSplitSize, minSplitSizeForMaxNum)`

3. **Applied to multiple scan nodes**:
   - `HiveScanNode`
   - `IcebergScanNode`
   - `PaimonScanNode`
   - `TVFScanNode`

4. **Unit Tests**:
   - `FileQueryScanNodeTest`: Test base logic
   - `HiveScanNodeTest`: Test Hive-specific implementation
   - `IcebergScanNodeTest`: Test Iceberg-specific implementation
   - `PaimonScanNodeTest`: Test Paimon-specific implementation
   - `TVFScanNodeTest`: Test TVF-specific implementation

## Usage

Users can now control the maximum number of splits per table scan by
setting the session variable:

```sql
-- Set to 50000 splits maximum
SET max_file_split_num = 50000;

-- Disable the limit (set to 0 or negative)
SET max_file_split_num = 0;
```

(cherry picked from commit 3e5a70f)
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Contributor Author

run buildall

@suxiaogang223
Copy link
Contributor Author

run buildall

@suxiaogang223 suxiaogang223 force-pushed the pick-58759-branch-4.0-v2 branch from 4f944e4 to 75641e0 Compare February 13, 2026 04:21
@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 13, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@yiguolei yiguolei merged commit 3dccd75 into apache:branch-4.0 Feb 14, 2026
32 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants