Skip to content

DD pipeline: gate all relocation priorities, not just low-priority ones#13381

Open
gxglass wants to merge 1 commit into
apple:mainfrom
gxglass:ddpipelinefix
Open

DD pipeline: gate all relocation priorities, not just low-priority ones#13381
gxglass wants to merge 1 commit into
apple:mainfrom
gxglass:ddpipelinefix

Conversation

@gxglass

@gxglass gxglass commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Previously, moves with priority >= PRIORITY_TEAM_UNHEALTHY (700) bypassed the DDQueue pipeline gate. In production, operator "exclude" operations emit moves at exactly PRIORITY_TEAM_UNHEALTHY, so the vast majority of real relocations escaped DD_MAX_PIPELINE_MOVES entirely -- defeating the back-pressure the gate was meant to provide. Remove the high-priority exemption so the gate applies to every relocation except cancellations (which reduce tracked metadata rather than adding to it).

Also:

  • Raise the simulation BUGGIFY value of DD_MAX_PIPELINE_MOVES from 5 to 20. A limit of 5 created degenerate artificial scarcity that is not representative of any real cluster. The prior pipeline exemption for high priority moves was likely a result of using this low value.
  • Add a "DD Pipeline Full" CODE_PROBE at the pipeline-full transition.
  • Add tests/fast/DDPipelineSaturation.toml, which forces >20 concurrent relocations (tiny shards + machine kills, limit pinned to 20) so the non-rare probe stays covered across the Joshua ensemble. Lack of this test was likely the reason for using a pipeline size of 5 earlier (in order to observe some coverage of the pipeline-full condition).

Testing:

Joshua correctness ensemble:
20260623-045806-gglass-c664dabd16a534de compressed=True data_size=37256856 duration=2909497 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:29:25 sanity=False stopped=20260623-052731 submitted=20260623-045806 timeout=5400 username=gglass

The single failure was tests/restarting/from_7.3.0_until_7.4.0/SnapTestRestart-2, a pre-existing known flake (filed 2025-10-08, predating this work) that times out in the phase-2 restart. It ran with buggify disabled, so DD_MAX_PIPELINE_MOVES was 1000 and the pipeline gate was inert -- the failure is in the restart/recovery path and unrelated to this change.

Previously, moves with priority >= PRIORITY_TEAM_UNHEALTHY (700) bypassed the
DDQueue pipeline gate. In production, operator "exclude" operations emit moves
at exactly PRIORITY_TEAM_UNHEALTHY, so the vast majority of real relocations
escaped DD_MAX_PIPELINE_MOVES entirely -- defeating the back-pressure the gate
was meant to provide. Remove the high-priority exemption so the gate applies to
every relocation except cancellations (which reduce tracked metadata rather
than adding to it).

Also:
- Raise the simulation BUGGIFY value of DD_MAX_PIPELINE_MOVES from 5 to 20.
  A limit of 5 created degenerate artificial scarcity that is not
  representative of any real cluster.
- Add a "DD Pipeline Full" CODE_PROBE at the pipeline-full transition.
- Add tests/fast/DDPipelineSaturation.toml, which forces >20 concurrent
  relocations (tiny shards + machine kills, limit pinned to 20) so the
  non-rare probe stays covered across the Joshua ensemble.

Testing:

Joshua correctness ensemble:
  20260623-045806-gglass-c664dabd16a534de compressed=True data_size=37256856
  duration=2909497 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999
  priority=100 remaining=0 runtime=0:29:25 sanity=False stopped=20260623-052731
  submitted=20260623-045806 timeout=5400 username=gglass

The single failure was tests/restarting/from_7.3.0_until_7.4.0/SnapTestRestart-2,
a pre-existing known flake (filed 2025-10-08, predating this work) that times out
in the phase-2 restart. It ran with buggify disabled, so DD_MAX_PIPELINE_MOVES was
1000 and the pipeline gate was inert -- the failure is in the restart/recovery
path and unrelated to this change.
@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: f3c4846
  • Duration 0:20:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS 14.x

  • Commit ID: f3c4846
  • Duration 0:32:20
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux RHEL 9

  • Commit ID: f3c4846
  • Duration 0:46:34
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: f3c4846
  • Duration 0:56:10
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS 14.x

  • Commit ID: f3c4846
  • Duration 1:00:20
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: f3c4846
  • Duration 1:00:21
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass

gxglass commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Some review back and forth with an unbiased (but overly cautious) independent agent.
ddpipeline-review-discussion.md

@foundationdb-ci

Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: f3c4846
  • Duration 1:05:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@Ronitsabhaya75

Ronitsabhaya75 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@gxglass this does look good to me.

Minor question for only exempt cancellations since they're actually freeing slots. DDPipelineSaturation.toml has no [configuration] block, so if the cluster randomizes to ≤5 machines, Attrition skips killing any and the pipeline-full path never gets hit.

I was thinking if Adding machineCount = 10 would help the code probe to be covered.

would love to have your opinion on it.

@gxglass

gxglass commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

@gxglass this does look good to me.

Minor question for only exempt cancellations since they're actually freeing slots. DDPipelineSaturation.toml has no [configuration] block, so if the cluster randomizes to ≤5 machines, Attrition skips killing any and the pipeline-full path never gets hit.

I was thinking if Adding machineCount = 10 would help the code probe to be covered.

would love to have your opinion on it.

@Ronitsabhaya75 thanks for having a look. The truth is that an agent fine-tuned the settings in this toml file based on its own execution of the test a number of times. I believe it looked at various log messages to check the size of the built up queue (I forgot the details already but I remember being pretty satisfied with its work). I did not adjust it further. BTW, all of the code in this PR was written by an agent with aggressive prompting from me about what I wanted it to do and why. For example I had to tell it "find a way to shrink the shard size" to ensure we got a lot of shards. It duly figured out how to do that. From the log messages it reported seeing I think the CODE_PROBE is hit often enough that we can be satisfied with it.

@Ronitsabhaya75

Copy link
Copy Markdown
Contributor

W, all of the code in this PR was written by an agent with aggressive prompting from me about what I wanted it to do and why. For example I had to tell it "find a way to shrink the shard size" to ensure we got a lot of shards. It duly figured out how to do that. From the log messages it reported seeing I think the CODE_PROBE is hit often enough that we can be satisfied with it.

I see and does make sense. I'm happy with this.

@gxglass gxglass requested review from saintstack and spraza June 25, 2026 06:01
@spraza

spraza commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

I'll review this by today or latest tomorrow morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants