DD pipeline: gate all relocation priorities, not just low-priority ones#13381
DD pipeline: gate all relocation priorities, not just low-priority ones#13381gxglass wants to merge 1 commit into
Conversation
Previously, moves with priority >= PRIORITY_TEAM_UNHEALTHY (700) bypassed the DDQueue pipeline gate. In production, operator "exclude" operations emit moves at exactly PRIORITY_TEAM_UNHEALTHY, so the vast majority of real relocations escaped DD_MAX_PIPELINE_MOVES entirely -- defeating the back-pressure the gate was meant to provide. Remove the high-priority exemption so the gate applies to every relocation except cancellations (which reduce tracked metadata rather than adding to it). Also: - Raise the simulation BUGGIFY value of DD_MAX_PIPELINE_MOVES from 5 to 20. A limit of 5 created degenerate artificial scarcity that is not representative of any real cluster. - Add a "DD Pipeline Full" CODE_PROBE at the pipeline-full transition. - Add tests/fast/DDPipelineSaturation.toml, which forces >20 concurrent relocations (tiny shards + machine kills, limit pinned to 20) so the non-rare probe stays covered across the Joshua ensemble. Testing: Joshua correctness ensemble: 20260623-045806-gglass-c664dabd16a534de compressed=True data_size=37256856 duration=2909497 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:29:25 sanity=False stopped=20260623-052731 submitted=20260623-045806 timeout=5400 username=gglass The single failure was tests/restarting/from_7.3.0_until_7.4.0/SnapTestRestart-2, a pre-existing known flake (filed 2025-10-08, predating this work) that times out in the phase-2 restart. It ran with buggify disabled, so DD_MAX_PIPELINE_MOVES was 1000 and the pipeline gate was inert -- the failure is in the restart/recovery path and unrelated to this change.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr on Linux RHEL 9
|
|
Some review back and forth with an unbiased (but overly cautious) independent agent. |
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
@gxglass this does look good to me. Minor question for only exempt cancellations since they're actually freeing slots. I was thinking if Adding would love to have your opinion on it. |
@Ronitsabhaya75 thanks for having a look. The truth is that an agent fine-tuned the settings in this toml file based on its own execution of the test a number of times. I believe it looked at various log messages to check the size of the built up queue (I forgot the details already but I remember being pretty satisfied with its work). I did not adjust it further. BTW, all of the code in this PR was written by an agent with aggressive prompting from me about what I wanted it to do and why. For example I had to tell it "find a way to shrink the shard size" to ensure we got a lot of shards. It duly figured out how to do that. From the log messages it reported seeing I think the CODE_PROBE is hit often enough that we can be satisfied with it. |
I see and does make sense. I'm happy with this. |
|
I'll review this by today or latest tomorrow morning. |
Previously, moves with priority >= PRIORITY_TEAM_UNHEALTHY (700) bypassed the DDQueue pipeline gate. In production, operator "exclude" operations emit moves at exactly PRIORITY_TEAM_UNHEALTHY, so the vast majority of real relocations escaped DD_MAX_PIPELINE_MOVES entirely -- defeating the back-pressure the gate was meant to provide. Remove the high-priority exemption so the gate applies to every relocation except cancellations (which reduce tracked metadata rather than adding to it).
Also:
Testing:
Joshua correctness ensemble:
20260623-045806-gglass-c664dabd16a534de compressed=True data_size=37256856 duration=2909497 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:29:25 sanity=False stopped=20260623-052731 submitted=20260623-045806 timeout=5400 username=gglass
The single failure was tests/restarting/from_7.3.0_until_7.4.0/SnapTestRestart-2, a pre-existing known flake (filed 2025-10-08, predating this work) that times out in the phase-2 restart. It ran with buggify disabled, so DD_MAX_PIPELINE_MOVES was 1000 and the pipeline gate was inert -- the failure is in the restart/recovery path and unrelated to this change.