DD pipeline: gate all relocation priorities, not just low-priority ones by gxglass · Pull Request #13381 · apple/foundationdb

gxglass · 2026-06-23T05:38:18Z

Previously, moves with priority >= PRIORITY_TEAM_UNHEALTHY (700) bypassed the DDQueue pipeline gate. In production, operator "exclude" operations emit moves at exactly PRIORITY_TEAM_UNHEALTHY, so the vast majority of real relocations escaped DD_MAX_PIPELINE_MOVES entirely -- defeating the back-pressure the gate was meant to provide. Remove the high-priority exemption so the gate applies to every relocation except cancellations (which reduce tracked metadata rather than adding to it).

Also:

Raise the simulation BUGGIFY value of DD_MAX_PIPELINE_MOVES from 5 to 20. A limit of 5 created degenerate artificial scarcity that is not representative of any real cluster. The prior pipeline exemption for high priority moves was likely a result of using this low value.
Add a "DD Pipeline Full" CODE_PROBE at the pipeline-full transition.
Add tests/fast/DDPipelineSaturation.toml, which forces >20 concurrent relocations (tiny shards + machine kills, limit pinned to 20) so the non-rare probe stays covered across the Joshua ensemble. Lack of this test was likely the reason for using a pipeline size of 5 earlier (in order to observe some coverage of the pipeline-full condition).

Testing:

Joshua correctness ensemble:
20260623-045806-gglass-c664dabd16a534de compressed=True data_size=37256856 duration=2909497 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:29:25 sanity=False stopped=20260623-052731 submitted=20260623-045806 timeout=5400 username=gglass

The single failure was tests/restarting/from_7.3.0_until_7.4.0/SnapTestRestart-2, a pre-existing known flake (filed 2025-10-08, predating this work) that times out in the phase-2 restart. It ran with buggify disabled, so DD_MAX_PIPELINE_MOVES was 1000 and the pipeline gate was inert -- the failure is in the restart/recovery path and unrelated to this change.

Previously, moves with priority >= PRIORITY_TEAM_UNHEALTHY (700) bypassed the DDQueue pipeline gate. In production, operator "exclude" operations emit moves at exactly PRIORITY_TEAM_UNHEALTHY, so the vast majority of real relocations escaped DD_MAX_PIPELINE_MOVES entirely -- defeating the back-pressure the gate was meant to provide. Remove the high-priority exemption so the gate applies to every relocation except cancellations (which reduce tracked metadata rather than adding to it). Also: - Raise the simulation BUGGIFY value of DD_MAX_PIPELINE_MOVES from 5 to 20. A limit of 5 created degenerate artificial scarcity that is not representative of any real cluster. - Add a "DD Pipeline Full" CODE_PROBE at the pipeline-full transition. - Add tests/fast/DDPipelineSaturation.toml, which forces >20 concurrent relocations (tiny shards + machine kills, limit pinned to 20) so the non-rare probe stays covered across the Joshua ensemble. Testing: Joshua correctness ensemble: 20260623-045806-gglass-c664dabd16a534de compressed=True data_size=37256856 duration=2909497 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:29:25 sanity=False stopped=20260623-052731 submitted=20260623-045806 timeout=5400 username=gglass The single failure was tests/restarting/from_7.3.0_until_7.4.0/SnapTestRestart-2, a pre-existing known flake (filed 2025-10-08, predating this work) that times out in the phase-2 restart. It ran with buggify disabled, so DD_MAX_PIPELINE_MOVES was 1000 and the pipeline gate was inert -- the failure is in the restart/recovery path and unrelated to this change.

foundationdb-ci · 2026-06-23T05:59:01Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: f3c4846
Duration 0:20:32
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-23T06:10:47Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: f3c4846
Duration 0:32:20
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-23T06:25:05Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: f3c4846
Duration 0:46:34
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-23T06:34:37Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: f3c4846
Duration 0:56:10
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-23T06:38:47Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: f3c4846
Duration 1:00:20
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-23T06:38:51Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: f3c4846
Duration 1:00:21
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

gxglass · 2026-06-23T06:40:48Z

Some review back and forth with an unbiased (but overly cautious) independent agent.
ddpipeline-review-discussion.md

foundationdb-ci · 2026-06-23T06:43:43Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: f3c4846
Duration 1:05:17
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

Ronitsabhaya75 · 2026-06-24T01:20:58Z

@gxglass this does look good to me.

Minor question for only exempt cancellations since they're actually freeing slots. DDPipelineSaturation.toml has no [configuration] block, so if the cluster randomizes to ≤5 machines, Attrition skips killing any and the pipeline-full path never gets hit.

I was thinking if Adding machineCount = 10 would help the code probe to be covered.

would love to have your opinion on it.

gxglass · 2026-06-24T01:38:35Z

@gxglass this does look good to me.

Minor question for only exempt cancellations since they're actually freeing slots. DDPipelineSaturation.toml has no [configuration] block, so if the cluster randomizes to ≤5 machines, Attrition skips killing any and the pipeline-full path never gets hit.

I was thinking if Adding machineCount = 10 would help the code probe to be covered.

would love to have your opinion on it.

@Ronitsabhaya75 thanks for having a look. The truth is that an agent fine-tuned the settings in this toml file based on its own execution of the test a number of times. I believe it looked at various log messages to check the size of the built up queue (I forgot the details already but I remember being pretty satisfied with its work). I did not adjust it further. BTW, all of the code in this PR was written by an agent with aggressive prompting from me about what I wanted it to do and why. For example I had to tell it "find a way to shrink the shard size" to ensure we got a lot of shards. It duly figured out how to do that. From the log messages it reported seeing I think the CODE_PROBE is hit often enough that we can be satisfied with it.

Ronitsabhaya75 · 2026-06-24T01:52:41Z

W, all of the code in this PR was written by an agent with aggressive prompting from me about what I wanted it to do and why. For example I had to tell it "find a way to shrink the shard size" to ensure we got a lot of shards. It duly figured out how to do that. From the log messages it reported seeing I think the CODE_PROBE is hit often enough that we can be satisfied with it.

I see and does make sense. I'm happy with this.

spraza · 2026-06-25T20:50:52Z

I'll review this by today or latest tomorrow morning.

Ronitsabhaya75 approved these changes Jun 24, 2026

View reviewed changes

gxglass requested review from saintstack and spraza June 25, 2026 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DD pipeline: gate all relocation priorities, not just low-priority ones#13381

DD pipeline: gate all relocation priorities, not just low-priority ones#13381
gxglass wants to merge 1 commit into
apple:mainfrom
gxglass:ddpipelinefix

gxglass commented Jun 23, 2026 •

edited

Loading

Uh oh!

foundationdb-ci commented Jun 23, 2026

Uh oh!

foundationdb-ci commented Jun 23, 2026

Uh oh!

foundationdb-ci commented Jun 23, 2026

Uh oh!

foundationdb-ci commented Jun 23, 2026

Uh oh!

foundationdb-ci commented Jun 23, 2026

Uh oh!

foundationdb-ci commented Jun 23, 2026

Uh oh!

gxglass commented Jun 23, 2026

Uh oh!

foundationdb-ci commented Jun 23, 2026

Uh oh!

Ronitsabhaya75 commented Jun 24, 2026 •

edited

Loading

Uh oh!

gxglass commented Jun 24, 2026

Uh oh!

Ronitsabhaya75 commented Jun 24, 2026

Uh oh!

spraza commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

gxglass commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

foundationdb-ci commented Jun 23, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 23, 2026

Result of foundationdb-pr-macos-m1 on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 23, 2026

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 23, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 23, 2026

Result of foundationdb-pr-macos on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 23, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

gxglass commented Jun 23, 2026

Uh oh!

foundationdb-ci commented Jun 23, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

Ronitsabhaya75 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gxglass commented Jun 24, 2026

Uh oh!

Ronitsabhaya75 commented Jun 24, 2026

Uh oh!

spraza commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gxglass commented Jun 23, 2026 •

edited

Loading

Ronitsabhaya75 commented Jun 24, 2026 •

edited

Loading

spraza commented Jun 25, 2026 •

edited

Loading