Skip to content

Parallel tests#2190

Open
Johan-Liebert1 wants to merge 8 commits into
bootc-dev:mainfrom
Johan-Liebert1:parallel-tests
Open

Parallel tests#2190
Johan-Liebert1 wants to merge 8 commits into
bootc-dev:mainfrom
Johan-Liebert1:parallel-tests

Conversation

@Johan-Liebert1

Copy link
Copy Markdown
Collaborator

No description provided.

@Johan-Liebert1 Johan-Liebert1 added the ci/merge Run full CI suite (all OSes) — equivalent to merge queue label May 8, 2026
@bootc-bot bootc-bot Bot requested a review from cgwalters May 8, 2026 08:36

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the tmt test runner to support parallel execution of test plans using std::thread. It introduces a RunPlanResult struct and extracts the core test logic into a run_plan function. The review feedback identifies several opportunities for improvement, including addressing interleaved console output from concurrent threads, optimizing the thread-joining logic to avoid waiting for entire batches, handling potential panics when querying system parallelism, and removing redundant clones of owned objects.

Comment thread crates/xtask/src/tmt.rs
Comment thread crates/xtask/src/tmt.rs Outdated
Comment thread crates/xtask/src/tmt.rs Outdated
Comment thread crates/xtask/src/tmt.rs Outdated
Comment thread crates/xtask/src/tmt.rs Outdated
@Johan-Liebert1 Johan-Liebert1 marked this pull request as draft May 8, 2026 08:48

@cgwalters cgwalters left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without any kind of deeper review, I think what we really want to do is upstream into https://github.com/teemtee/tmt support for bcvk.

I believe it already has support for concurrency (I mean I'd hope) and such - and that would make it a lot more sustainable for other projects to use tmt+bcvk (and we have many in the ecosystem that would make sense to do so)

Comment thread .github/workflows/ci.yml Outdated
Comment thread crates/xtask/src/tmt.rs Outdated
@Johan-Liebert1

Copy link
Copy Markdown
Collaborator Author

One or two test failures, but not as good of a speedup as I'd had locally. GA seems to have 4cpu

@cgwalters

Copy link
Copy Markdown
Collaborator

Yes, we can bump to larger runners though. I was experimenting with that previously in bootc-dev/ci-sandbox#1 but it's been a while

@Johan-Liebert1 Johan-Liebert1 added ci/tier-1 Run CI for tier-1 OS (centos-10) only and removed ci/merge Run full CI suite (all OSes) — equivalent to merge queue labels May 9, 2026
@Johan-Liebert1 Johan-Liebert1 force-pushed the parallel-tests branch 3 times, most recently from 5230a9f to a98f50a Compare May 11, 2026 10:41
@Johan-Liebert1

Copy link
Copy Markdown
Collaborator Author

Alright, so composefs integration tests now take ~30-40 mins instead of ~1hr - 1hr30min

Tests are still quite inconsistent tough. Especially the GC one randomly takes over 1000s to complete, dunno why. Needs some investigation

@Johan-Liebert1

Copy link
Copy Markdown
Collaborator Author

The ostree one takes around ~1hr instead of ~2h15min

Some tests are randomly taking a bit too long though

/tmt/plans/integration/plan-32-multi-device-esp: PASSED (1140.54328774s)

I'll try to debug this

@Johan-Liebert1 Johan-Liebert1 added ci/merge Run full CI suite (all OSes) — equivalent to merge queue and removed ci/tier-1 Run CI for tier-1 OS (centos-10) only labels May 11, 2026
Comment thread crates/xtask/src/tmt.rs Outdated
Johan-Liebert1 added a commit to Johan-Liebert1/composefs-rs that referenced this pull request May 25, 2026
`bootc status` for UKIs takes upto 250MB of memory as we load the entire
UKI into memory just to extract the cmdline. In bootc-dev/bootc#2190
tests for UKI get OOM killed

Signed-off-by: Johan-Liebert1 <pragyanpoudyal41999@gmail.com>
Johan-Liebert1 added a commit to Johan-Liebert1/composefs-rs that referenced this pull request May 25, 2026
`bootc status` for UKIs takes upto 250MB of memory as we load the entire
UKI into memory just to extract the cmdline. In bootc-dev/bootc#2190
tests for UKI get OOM killed

Signed-off-by: Johan-Liebert1 <pragyanpoudyal41999@gmail.com>
cgwalters pushed a commit to cgwalters/composefs-rs that referenced this pull request May 26, 2026
`bootc status` for UKIs takes upto 250MB of memory as we load the entire
UKI into memory just to extract the cmdline. In bootc-dev/bootc#2190
tests for UKI get OOM killed

Signed-off-by: Johan-Liebert1 <pragyanpoudyal41999@gmail.com>
@Johan-Liebert1 Johan-Liebert1 force-pushed the parallel-tests branch 2 times, most recently from ff859ce to e437f70 Compare May 30, 2026 09:40
@Johan-Liebert1 Johan-Liebert1 added ci/tier-1 Run CI for tier-1 OS (centos-10) only and removed ci/merge Run full CI suite (all OSes) — equivalent to merge queue labels May 30, 2026
@Johan-Liebert1 Johan-Liebert1 force-pushed the parallel-tests branch 4 times, most recently from 00e2ca9 to d514d82 Compare June 1, 2026 08:25
@Johan-Liebert1

Copy link
Copy Markdown
Collaborator Author

Interesting

content: error: 
    Building UKI: Computing composefs digest: 
        Reading container root: 
            Reading filtered filesystem: 
                Async reading filesystem from .: 
                    Scanning directory .: 
                        Scanning inode usr: Scanning directory usr: Scanning inode lib: Scanning directory lib: 
                            Getting file stats: Reading extended attributes: Operation not supported (os error 95)

I'll look into this one

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Signed-off-by: Johan-Liebert1 <pragyanpoudyal41999@gmail.com>
So we don't spawn VMs for tests like "composefs-gc" just to do nothing
and exit

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Sort tests in descending order of time taken for completion so longer
tests get scheduled together. Also, update to use mpsc channels for
communication between threads

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Use a Mutex guard for VM creation as CI sometimes is failing with

```
Error:
0: Failed to create libvirt domain
1: Failed to start libvirt domain: error: Failed to start domain 'bootc-..'
  error: failed to create directory '/run/user/1001/libvirt/qemu/run/swtpm': File exists
```

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
To reduce CI failures due to flakes, rerun failed tests

Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
@Johan-Liebert1

Copy link
Copy Markdown
Collaborator Author

I think this one's ready for review now

@Johan-Liebert1 Johan-Liebert1 marked this pull request as ready for review June 18, 2026 08:37
@Johan-Liebert1 Johan-Liebert1 requested a review from cgwalters June 18, 2026 08:37
@bootc-bot bootc-bot Bot requested a review from henrywang June 18, 2026 08:37
Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Comment thread crates/xtask/src/tmt.rs
// Log disk usage after each test run to help diagnose "no space left on device" failures
println!("Disk usage after plan {}:", plan);
let _ = cmd!(sh, "df -h").run();
let _ = cmd!(sh, "du -h /var/* -d 1").run();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df is a cheap operation, this one is not; I don't think we should do it by default.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only added this for testing. I'll remove it

Comment thread crates/xtask/src/tmt.rs
} else {
println!("Run ID not available - cannot generate detailed report");
}
let rerun_suffix = format!("{}-rerun", random_suffix);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be part of tmt?

I am really uncertain about doing this retry by default too at this level.

Ideally, we basically have retries around all network operations. Everything else - we really do want to drive flakes to zero.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I can see tests in CI failing because the VM never starts after rebooting. This should hopefully catch that. I'm not sure about how tmt manages VMs, but in the said case, we would have to create a new VM

Comment thread crates/xtask/src/tmt.rs
// Launch VM with bcvk
let firmware_args_slice = firmware_args.as_slice();

let guard = match libvirt_lock.lock() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: failed to create directory '/run/user/1001/libvirt/qemu/run/swtpm': File exists

That's a bug in libvirt that we're just working around, right?

If we have to do a workaround, I think it'd be better in bcvk.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this would be considered a bug. It only doesn't work sometimes so I believed it was due to a race condition

def main [] {
tap begin "bootc-image-builder qcow2 build test"

print ">>>>>>>>>>>>>>>>>> the current working directory <<<<<<<<<<<<<<<<<" ($env.PWD)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems...verbose

Comment thread crates/xtask/src/tmt.rs
const DISTRO_CENTOS_9: &str = "centos-9";

// Tests sorted by time taken (descending)
const TESTS_SORTED_BY_TIME: [&str; 23] = [

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of duplicating all of our test names, this will get out of date quickly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... would you prefer adding a key extra-time-taken or something similar so we can do this sort using that key?

Comment thread crates/xtask/src/tmt.rs

/// For tests that should only run for composefs systems
/// Ex. composefs-gc
const FIELD_SKIP_IF_OSTREE: &str = "skip_if_ostree";

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I like it

Comment thread crates/xtask/src/tmt.rs
struct RunPlanResult {
plan_name: String,
passed: bool,
time_taken: Option<Duration>,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess but isn't this part of tmt too?

Comment thread crates/xtask/src/tmt.rs

let mut handles: Vec<JoinHandle<RunPlanResult>> = vec![];

let num_cpu = std::thread::available_parallelism()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't tmt support parallelism? I think eventually the right thing to do here is to move the bcvk stuff as a first-class op in tmt.

But I don't object at all to doing it here in the interim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/tier-1 Run CI for tier-1 OS (centos-10) only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants