feat: stage AMD SEV-SNP attestation support#703
Conversation
|
SEV-SNP TCB/advisory policy slice is pushed. What changed:
Still fail-closed:
Validation:
|
|
Continued with the next quality-gate slice and pushed a small clippy cleanup commit. Commit:
What changed:
Validation now passing:
Independent review of the cleanup diff found no behavior/security regressions. |
|
Milestone 1 is done: PR #703 is now review-ready staging for AMD SEV-SNP, still without production key release. New commit:
What changed:
Validation passed after doc/proof refresh:
I am marking the PR ready for review now. Milestone 2 remains separate: production SNP key release policy + revocation/advisory collateral + guarded release enablement. |
|
Milestone 2 is now implemented and pushed. Commit: What changed:
Validation passed: cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest runIndependent security review: no release-gate blockers found after the self-authorization startup-safety fix. |
SNP E2E smoke follow-upI kept going on the manual SNP smoke on What the smoke found/fixed:
Smoke status:
Validation passed after the fixes: cargo fmt --all
cargo test -p dstack-vmm --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run |
AMD SEV-SNP manual E2E smoke updateI pushed a follow-up commit that completes the dstack-managed SNP smoke path:
What the smoke proved
Failure gate also exercisedThe lab host reports verifier-derived Then, with an explicit lab-only allowlist ( Fixes included
Validation runAll passed locally: cargo fmt --all
cargo test -p dstack-attest --all-features
cargo test -p dstack-util --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest runNo secret/key material was included in logs or this comment. |
|
Thank you so much for this — it's a huge and impressively thorough piece of work. 🙏 I have some other things on my plate right now, but I'll review this once I'm through them. Thanks again! |
|
Fresh-box SNP smoke update (sanitized):
Validation after doc/script update: Commit: No secrets or credential material included in this update. |
|
Fresh-box SNP smoke follow-up pushed in What changed:
Remote smoke evidence from
So the remaining gap is not guest boot, VMM wiring, KMS startup, or release-policy plumbing. The current blocker is external AMD KDS collateral availability/rate-limiting for the app quote. Validation passed after this update: No secret material is included in the docs or PR evidence. |
|
Update: completed the fresh-box SNP smoke through both gates and pushed the follow-up fix. What changed in the latest commit (
Remote smoke result on Validation rerun after the final patch:
No secrets or key material were included in the smoke artifacts or PR note. |
85ace8b to
b9d968d
Compare
|
Published the AMD SEV-SNP support design and production-readiness tracker here: #713 This separates the broader design discussion from the PR implementation review. The issue captures:
|
|
Applied the Chipotle-agent feedback to the PR docs/body without pulling Chipotle-specific app config into dstack scope. Updates pushed in
Validated before push: bash -n test-scripts/snp-e2e-smoke.sh
cargo fmt --all --check
git diff --checkChipotle-specific note deliberately left out of dstack implementation/docs except as out-of-scope in the PR body: the app needed |
show-mrs special-cased AMD SEV-SNP to emit null MRs with a note claiming they were TDX-RTMR-only. The app-info path (Attestation::local()-> decode_app_info) computes mr_system/mr_aggregated for SEV too, so drop the special case and report the real values.
ensure_snp_key_release_config_safe refused to start the KMS when sev_snp_key_release was enabled without enforce_self_authorization. The self-authorization requirement is not needed for SEV key release, so remove the startup gate, its helper, and the associated test.
Add a real AMD SEV-SNP attestation captured from a live dstack CVM plus its pinned ASK/VCEK, and an integration test that verifies the full chain offline (builtin ARK -> ASK -> VCEK -> report signature) and asserts the report_data marker, launch measurement, and HOST_DATA. Fully deterministic — nothing is fetched from AMD KDS. See sev_snp_fixture.README.md for provenance.
Move the SEV-SNP launch-measurement recomputation and os_image_hash derivation into a new dstack-mr::sev module so the KMS (key release) and the verifier (attestation verification) compute identical values from a single source of truth, instead of the verifier lacking it entirely. Primitive-typed API (measurement/host_data byte arrays) keeps the module free of attestation/RA-TLS types, avoiding a dependency cycle. Includes a real-fixture regression test that recomputes the captured CVM's launch measurement (7f51e17f...) and os_image_hash (32b47673...).
Replace the in-tree launch-measurement recomputation, os_image_hash derivation, OVMF parsing and mr_config binding with re-exports from dstack-mr::sev. The KMS keeps its authorization BootInfo/policy layer on top. Behaviour is unchanged: all 28 KMS tests (incl. the pinned 88a479... measurement vector) pass against the shared implementation.
verify_os_image_hash previously bailed "Unsupported attestation quote" for DstackAmdSevSnp, so SEV-SNP attestations always returned is_valid=false. Add verify_os_image_hash_for_dstack_sev: recompute the launch measurement from the self-contained sev_snp_measurement inputs carried in the attestation config, require it to equal the hardware-signed MEASUREMENT, require HOST_DATA to bind the MrConfigV3 document, then derive and surface the image-invariant os_image_hash. Also fills tcb_status/advisory_ids for SEV. Same dstack-mr::sev code path the KMS uses for key release, so a quote the KMS would release keys for now verifies here too (is_valid=true).
dstack-util quote was TDX-only (read the Intel configfs directly and failed on SEV hosts); make it detect the running TEE via Attestation::quote and emit the platform's raw hardware quote (TDX DCAP quote or SNP report). GetQuoteResponse gains an 'attestation' field carrying the platform- adaptive versioned attestation, populated on every platform. On non-TDX (SEV-SNP) the legacy quote/event_log fields are empty, so this is the verifier-ready payload to send to dstack-verifier's /verify 'attestation' field. Populated in the real, simulator and test backends; exposed in the Rust SDK GetQuoteResponse with a decode_attestation helper.
Extend the offline SEV-SNP fixture test to also run the verifier's full binding path with no network: after the hardware report verifies, recompute the launch measurement from the attestation's embedded sev_snp_measurement, confirm HOST_DATA binds the mr_config, and assert the derived os_image_hash (32b47673...) and HOST_DATA-bound app_id. Adds dstack-mr as a dev-dep.
The binary/PEM SEV-SNP fixtures can't carry inline SPDX headers; annotate them in REUSE.toml as CC0-1.0 alongside the existing nitro fixtures so the REUSE compliance check passes.
Adversarial negative tests for the SEV-SNP verification path: dstack-mr::sev (synthetic, deterministic): - forged hardware MEASUREMENT and HOST_DATA are rejected - every measured launch field (ovmf/kernel/initrd hashes, cmdline, hash-table offset, reset eip, section gpa, vcpus, vcpu_type, guest_features) is caught by the measurement-equality check - substituting a different MrConfigV3 (app/compose/instance id) breaks the HOST_DATA binding - an advertised top-level os_image_hash is ignored (derived value wins) - booting a different image cannot present an allow-listed image's inputs - missing sev_snp_measurement / mr_config fail closed - documents that rootfs_hash is os_image_hash-only (bound via the measured cmdline), so tampering it changes the derived os_image_hash rather than failing the measurement check dstack-attest (real fixture, offline): - flipping any signed report field (report_data/measurement/host_data) or the signature invalidates VCEK verification; zeroed/truncated reports rejected - wrong collateral (ASK-as-VCEK, malformed VCEK) rejected - forged measurement/host_data, tampered launch inputs, substituted mr_config and bogus advertised os_image_hash all handled correctly against real data Derive Debug on SevImageBinding for test ergonomics.
Move the AMD SEV-SNP os_image_hash computation out of dstack-vmm into the dstack-mr crate, and add a `dstack-mr sev-os-image-hash <image_dir>` command that emits the value (digest.sev.txt). dstack-mr now parses metadata.json, measures the SEV firmware (GCTX over ovmf-sev.fd), hashes kernel/initrd and projects them through dstack_types::SevOsImageMeasurement — the single hashing path already shared with KMS/verifier. dstack-vmm no longer recomputes the SEV os_image_hash at deploy: Image::load reads digest.sev.txt and make_vm_config uses it directly (failing closed if the file is absent), mirroring how TDX uses digest.txt. The vmm `sev-os-image-hash` subcommand is removed. Verified the new CLI reproduces the existing digest.sev.txt byte-for-byte (32b47673...) on the nvidia-0.6.0.a2 image, matching the value the verifier and CVM report.
The sev_snp_measurement launch-input document built at deploy time used vmm's own snp_measure.rs (OVMF footer parse + GCTX). That logic is byte-for-byte the same as dstack_mr::sev::ovmf_measurement_info (added for the os_image_hash CLI), so delegate to it and delete the duplicate module. dstack-mr becomes a normal vmm dependency. Output is unchanged: the measurement-doc test and its os_image_hash projection cross-check still pass.
TeePlatform::resolve() folded an 'Auto' variant into the resolved type, so
every match on a resolved platform carried a dead Auto arm (e.g. `Tdx | Auto`
in the -machine selection). Remove the Auto variant: the config field becomes
`Option<TeePlatform>` (None = auto-detect), and CvmConfig::resolved_platform()
returns the pinned platform or TeePlatform::detect(). Matches on the resolved
platform are now exhaustive over {Tdx, AmdSevSnp} with no unreachable arm.
A back-compat deserializer still accepts the literal `platform = "auto"`
(mapped to None) so existing vmm.toml configs keep working.
- broaden ARK/ASK/VCEK support to Milan/Genoa/Turin - describe VMM platform auto-detection (sev_snp cpuinfo flag) vs explicit pin - correct GetTempCaCert gating: per-RPC self BootInfo check, not a startup reject - flag the golden-vector capture as stale (predates MrConfigV3/HOST_DATA binding) - fix auth-simple validation command to use bun
64a33e8 to
439e3d5
Compare
|
Great thanks for the excellent work! I force-pushed the rebased branch with the follow-up refactors. The main post-rebase changes are refactors to make the SNP path less ad-hoc and more platform-driven:
|
Summary
This PR stages AMD SEV-SNP as a first-class dstack attestation platform alongside existing TDX/Nitro/GCP paths, and includes a controlled/fail-closed SNP key/cert release path.
At a high level, this branch:
/dev/sev-guestextended-report ioctl fallback.DSTACK_AMD_KDS_PROXY_URL/ KMSsev_snp.amd_kds_proxy_urlfor lab hosts that hit AMD KDS throttling.app_idlaunch-measured for SNP by binding app identity into the measured kernel cmdline.BootInfofrom verified evidence: measurement, chip id, app id, compose hash, rootfs hash, TCB status, and advisory ids.test-scripts/snp-e2e-smoke.shas a reusable manual hardware smoke script.Default security posture
SNP release remains fail-closed by default.
Defaults:
Sensitive release surfaces guarded by this gate:
GetAppKeyGetKmsKeySignCertGetTempCaCertAdditional safety: KMS startup rejects SNP release enablement unless
enforce_self_authorization = true, so the self-authorized temp-CA path cannot silently bypass SNP release policy.Even when local release is enabled, external auth must still allow the verified SNP
BootInfo.AMD KDS collateral proxy support
The lab SNP host hit direct AMD KDS
HTTP 429while fetching VCEK/cert-chain collateral. This PR preserves fail-closed verification and adds an explicit proxy/cache path instead of bypassing cert verification.Important details:
dstack-attestrespectsDSTACK_AMD_KDS_PROXY_URLfor AMD KDS cert-chain and VCEK fetches.core.sev_snp.amd_kds_proxy_url.kms/src/main.rsexports the configured proxy before attestation verification.ra-rpc::QuoteVerifiercarries/re-applies the proxy around per-request quote verification.dstack.amd_kds_proxy_url=...in the kernel cmdline;basefiles/dstack-prepare.shexportsDSTACK_AMD_KDS_PROXY_URLand writes/run/dstack/environment;basefiles/dstack-guest-agent.serviceloads that file viaEnvironmentFile=-/run/dstack/environment.The Lit proxy shape used in the smoke is path-prefix passthrough:
not a
?url=wrapper.Hardware smoke proof
Manual hardware smoke was rerun on the SNP host:
Latest sanitized result:
Lab success used:
Production defaults still deny
OutOfDateTCB and keepallowed_advisory_ids = [].Image requirement
The working guest image was a coherent
meta-dstackimage built with:Do not use the default TDX image for SNP smoke. A coherent PR image built with the default
tdxmachine produced a6.18.24-dstackkernel with# CONFIG_AMD_MEM_ENCRYPT is not set; controlled QEMU tests showed that kernel resets immediately after OVMF loads kernel/initrd. SNP-capable kernels booted the same QEMU/OVMF path to Linux/SNP markers.Also do not rely on ad-hoc
dstack-utilinjection into a stock image. That changed measurement/boot behavior and regressed the boundary. For full app-key success, use a coherentmeta-dstackimage whose kernel/modules/initramfs/rootfs/verity metadata and guest userspace include the same PR branch.Quote / attestation proof
Earlier guest quote proof confirmed the SNP guest can produce a hardware report containing the expected challenge bytes:
The final KMS smoke additionally proves that the app guest's SNP evidence verifies through KMS/auth successfully enough to exercise strict denial and lab success release gates.
Measurement proof
A live golden-vector test on an SNP-capable host cross-checks dstack's pure Rust SNP measurement recomputation against
sev-snp-measure:cargo test -p dstack-kms --all-features recomputation_matches_sev_snp_measure_live_golden_vector -- --ignored --nocaptureLatest recorded proof:
See
docs/amd-sev-snp-review-readiness.mdfor the fuller proof block and review boundary.Important implementation notes
Key fixes discovered during E2E smoke:
.sys-config.jsonnow includessev_snp_measurementso KMS can recompute the same SNP launch measurement used by QEMU.rootfs_hashonly in kernel cmdline (dstack.rootfs_hash=...), so VMM/KMS now preserve and use that path.docker_compose_hash,rootfs_hash, andapp_id.EPYC-v4and confidential virtio PCI options (disable-legacy=on,iommu_platform=true).chip_id+ reported TCB when local evidence lacks cert collateral.mr_config_idchecks while preserving non-SNP behavior.dstack-prepare.shhandles SNP guest detection, earlychronycunavailability, minimal smoke DNS fallback, and AMD KDS proxy propagation.DSTACK_SNP_SMOKE_KDS_PROXY_URL, configurable VMM ports/URL, port cleanup viafuser, better strict-TCB denial detection, and clearer KDS-blocked vs policy-denied logs.Validation run
All passed locally on the final branch head:
Known limitations / follow-ups
platform = "auto"remains conservative while SNP is experimental. Operators must explicitly setplatform = "amd-sev-snp".advisory_idsis currently explicit and empty. Future advisory/revocation collateral should populate it and will be denied unless explicitly allowlisted.test-scripts/snp-e2e-smoke.sh.meta-dstackSNP guest image.tcbStatus = "OutOfDate"; success required an explicit lab allowlist. Production defaults still deny this.Human review focus
Please pay special attention to:
Fail-closed release semantics
UpToDateonly by default.Measurement / identity binding
app_id, compose hash, rootfs hash, kernel/initrd/cmdline, OVMF, vCPU model, guest features, and optional smoke proxy cmdline are all part of recomputation or policy input.app_idis launch-measured, not just auth metadata.AMD KDS collateral fallback/proxy
Non-SNP regression risk
DstackAmdSevSnp.Operational policy choice
UpToDateTCB in production should remain an explicit operator decision, not a default.