Skip to content

feat: stage AMD SEV-SNP attestation support#703

Merged
kvinwang merged 81 commits into
Dstack-TEE:masterfrom
clawdbot-glitch003:feat/amd-sev-snp-conversion
Jun 23, 2026
Merged

feat: stage AMD SEV-SNP attestation support#703
kvinwang merged 81 commits into
Dstack-TEE:masterfrom
clawdbot-glitch003:feat/amd-sev-snp-conversion

Conversation

@clawdbot-glitch003

@clawdbot-glitch003 clawdbot-glitch003 commented Jun 1, 2026

Copy link
Copy Markdown

Summary

This PR stages AMD SEV-SNP as a first-class dstack attestation platform alongside existing TDX/Nitro/GCP paths, and includes a controlled/fail-closed SNP key/cert release path.

At a high level, this branch:

  • Adds AMD SEV-SNP evidence plumbing to the v1 attestation format.
  • Collects SNP reports from Linux guest interfaces:
    • configfs TSM first;
    • /dev/sev-guest extended-report ioctl fallback.
  • Verifies SNP reports against AMD ARK/ASK/VCEK collateral, including report-data challenge binding and signed-report policy checks.
  • Adds fail-closed AMD KDS collateral augmentation when local evidence lacks ASK/VCEK, using report chip id + reported TCB.
  • Supports an explicit AMD KDS collateral proxy via DSTACK_AMD_KDS_PROXY_URL / KMS sev_snp.amd_kds_proxy_url for lab hosts that hit AMD KDS throttling.
  • Recomputes SNP launch measurement from OVMF/kernel/initrd/cmdline inputs and compares it to the hardware-verified report measurement.
  • Makes app_id launch-measured for SNP by binding app identity into the measured kernel cmdline.
  • Builds SNP-aware KMS BootInfo from verified evidence: measurement, chip id, app id, compose hash, rootfs hash, TCB status, and advisory ids.
  • Routes SNP KMS/app authorization through the existing auth flow.
  • Adds an explicit local KMS release gate for sensitive SNP outputs.
  • Adds test-scripts/snp-e2e-smoke.sh as a reusable manual hardware smoke script.

Default security posture

SNP release remains fail-closed by default.

Defaults:

[core.sev_snp_key_release]
enabled = false
allowed_tcb_statuses = ["UpToDate"]
allowed_advisory_ids = []

Sensitive release surfaces guarded by this gate:

  • GetAppKey
  • GetKmsKey
  • SignCert
  • self-authorized GetTempCaCert

Additional safety: KMS startup rejects SNP release enablement unless enforce_self_authorization = true, so the self-authorized temp-CA path cannot silently bypass SNP release policy.

Even when local release is enabled, external auth must still allow the verified SNP BootInfo.

AMD KDS collateral proxy support

The lab SNP host hit direct AMD KDS HTTP 429 while fetching VCEK/cert-chain collateral. This PR preserves fail-closed verification and adds an explicit proxy/cache path instead of bypassing cert verification.

Important details:

  • dstack-attest respects DSTACK_AMD_KDS_PROXY_URL for AMD KDS cert-chain and VCEK fetches.
  • KMS config supports core.sev_snp.amd_kds_proxy_url.
  • kms/src/main.rs exports the configured proxy before attestation verification.
  • ra-rpc::QuoteVerifier carries/re-applies the proxy around per-request quote verification.
  • The guest receives dstack.amd_kds_proxy_url=... in the kernel cmdline; basefiles/dstack-prepare.sh exports DSTACK_AMD_KDS_PROXY_URL and writes /run/dstack/environment; basefiles/dstack-guest-agent.service loads that file via EnvironmentFile=-/run/dstack/environment.
  • When the proxy is passed in the launched guest cmdline, VMM/KMS measurement recomputation includes the same cmdline fragment to avoid SNP measurement drift.

The Lit proxy shape used in the smoke is path-prefix passthrough:

https://cors.litgateway.com/https://kdsintf.amd.com/...

not a ?url= wrapper.

Hardware smoke proof

Manual hardware smoke was rerun on the SNP host:

remote_host=chris@173.234.27.162
host_kernel=Linux 6.11.0-rc3-snp-host-85ef1ac03941
qemu_version=10.0.2
ovmf_path=/opt/AMDSEV/usr/local/share/qemu/OVMF.fd
ovmf_sha256=67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a
image=dstack-dev-0.6.0
platform=amd-sev-snp
image_kernel=Linux 6.18.24-dstack with CONFIG_AMD_MEM_ENCRYPT=y, CONFIG_SEV_GUEST=y, CONFIG_TSM_REPORTS=y

Latest sanitized result:

kms_guest=booted SNP Linux/userspace and started dstack-kms
kms_marker=SNP_KMS_CONTAINER_STARTED / KMS runtime ready
kds_proxy=enabled for smoke via DSTACK_SNP_SMOKE_KDS_PROXY_URL=https://cors.litgateway.com/
strict_tcb_probe=denied_as_expected with tcb_status is not allowed
success_probe=GetTempCaCert HTTP 200; GetAppKey HTTP 200; SignCert HTTP 200; app container started
smoke_result=SNP E2E smoke success
no_secret_material_logged=true

Lab success used:

DSTACK_SNP_SMOKE_ALLOW_OUT_OF_DATE_TCB=1
DSTACK_SNP_SMOKE_KDS_PROXY_URL=https://cors.litgateway.com/

Production defaults still deny OutOfDate TCB and keep allowed_advisory_ids = [].

Image requirement

The working guest image was a coherent meta-dstack image built with:

MACHINE = "sev-snp"

Do not use the default TDX image for SNP smoke. A coherent PR image built with the default tdx machine produced a 6.18.24-dstack kernel with # CONFIG_AMD_MEM_ENCRYPT is not set; controlled QEMU tests showed that kernel resets immediately after OVMF loads kernel/initrd. SNP-capable kernels booted the same QEMU/OVMF path to Linux/SNP markers.

Also do not rely on ad-hoc dstack-util injection into a stock image. That changed measurement/boot behavior and regressed the boundary. For full app-key success, use a coherent meta-dstack image whose kernel/modules/initramfs/rootfs/verity metadata and guest userspace include the same PR branch.

Quote / attestation proof

Earlier guest quote proof confirmed the SNP guest can produce a hardware report containing the expected challenge bytes:

Memory Encryption Features active: AMD SEV SEV-ES SEV-SNP
SEV: SNP running at VMPL0.
sev-guest sev-guest: Initialized SEV guest driver (using vmpck_id 0)
DSTACK_SEV_SNP_ATTESTATION_PROOF_BEGIN
source=configfs-tsm
report_size=1184
report_data_offset=80
report_contains_expected_report_data=true
DSTACK_SEV_SNP_ATTESTATION_PROOF_END

The final KMS smoke additionally proves that the app guest's SNP evidence verifies through KMS/auth successfully enough to exercise strict denial and lab success release gates.

Measurement proof

A live golden-vector test on an SNP-capable host cross-checks dstack's pure Rust SNP measurement recomputation against sev-snp-measure:

cargo test -p dstack-kms --all-features recomputation_matches_sev_snp_measure_live_golden_vector -- --ignored --nocapture

Latest recorded proof:

DSTACK_SEV_SNP_MEASURE_GOLDEN_VECTOR_BEGIN
utc=2026-06-02T19:49:14Z
host=dedicated-m24-fork
sev_snp_measure=/usr/local/bin/sev-snp-measure
sev_snp_measure_version=sev-snp-measure 0.0.10
ovmf_sha256=67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a
vcpus=2
vcpu_type=EPYC-v4
guest_features=0x1
sev_snp_measurement=6497fb9f90dc4a322228a8a5eb14742e09067bc44c184c2068d583ef628b5bae8c6cf15d91fe1bc0b7a8cbcc575be370
cargo_live_test_result=passed locally on this host at 2026-06-02T19:49:14Z
DSTACK_SEV_SNP_MEASURE_GOLDEN_VECTOR_END

See docs/amd-sev-snp-review-readiness.md for the fuller proof block and review boundary.

Important implementation notes

Key fixes discovered during E2E smoke:

  • VMM .sys-config.json now includes sev_snp_measurement so KMS can recompute the same SNP launch measurement used by QEMU.
  • Released images may carry rootfs_hash only in kernel cmdline (dstack.rootfs_hash=...), so VMM/KMS now preserve and use that path.
  • KMS measurement recomputation preserves the original image cmdline before appending measured docker_compose_hash, rootfs_hash, and app_id.
  • SNP QEMU launch uses EPYC-v4 and confidential virtio PCI options (disable-legacy=on,iommu_platform=true).
  • Configfs TSM reports on the test host may omit ASK/VCEK collateral; verifier now fail-closed fetches AMD KDS ARK/ASK/VCEK by report chip_id + reported TCB when local evidence lacks cert collateral.
  • SNP guests skip TDX-only app-info / mr_config_id checks while preserving non-SNP behavior.
  • dstack-prepare.sh handles SNP guest detection, early chronyc unavailability, minimal smoke DNS fallback, and AMD KDS proxy propagation.
  • The smoke script supports DSTACK_SNP_SMOKE_KDS_PROXY_URL, configurable VMM ports/URL, port cleanup via fuser, better strict-TCB denial detection, and clearer KDS-blocked vs policy-denied logs.

Validation run

All passed locally on the final branch head:

bash -n test-scripts/snp-e2e-smoke.sh
cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo test -p dstack-vmm --all-features
cargo test -p ra-rpc --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run

Known limitations / follow-ups

  • platform = "auto" remains conservative while SNP is experimental. Operators must explicitly set platform = "amd-sev-snp".
  • This PR does not claim a production revocation/advisory feed. SNP reports/VCEKs do not directly expose an advisory-list field in the current evidence path, so advisory_ids is currently explicit and empty. Future advisory/revocation collateral should populate it and will be denied unless explicitly allowlisted.
  • AMD KDS fallback/proxy is implemented fail-closed. Production deployments should decide whether they need a trusted cache/proxy and configure it explicitly.
  • The hardware E2E smoke is manual, not CI; the repeatable manual script is checked in at test-scripts/snp-e2e-smoke.sh.
  • Full app success on a fresh box needs a coherent PR-built meta-dstack SNP guest image.
  • The lab host has tcbStatus = "OutOfDate"; success required an explicit lab allowlist. Production defaults still deny this.
  • Chipotle-specific Anvil RPC config is intentionally out of scope for this dstack PR.

Human review focus

Please pay special attention to:

  1. Fail-closed release semantics

    • SNP release disabled by default.
    • UpToDate only by default.
    • advisories denied unless allowlisted.
    • startup rejects release enablement without self-authorization.
  2. Measurement / identity binding

    • app_id, compose hash, rootfs hash, kernel/initrd/cmdline, OVMF, vCPU model, guest features, and optional smoke proxy cmdline are all part of recomputation or policy input.
    • app_id is launch-measured, not just auth metadata.
  3. AMD KDS collateral fallback/proxy

    • Report with no cert chain must not verify unless KDS collateral can be fetched and report signature/policy checks pass.
    • Network/KDS/proxy failure should fail closed.
    • Proxy support should stay explicit and measured when passed to the guest.
  4. Non-SNP regression risk

    • TDX/Nitro/GCP paths should continue through existing behavior.
    • SNP-specific skips should remain scoped to DstackAmdSevSnp.
  5. Operational policy choice

    • Whether to accept any non-UpToDate TCB in production should remain an explicit operator decision, not a default.

@clawdbot-glitch003

Copy link
Copy Markdown
Author

SEV-SNP TCB/advisory policy slice is pushed.

What changed:

  • VerifiedAmdSnpReport now carries verifier-derived AMD SNP TCB info from the signed report (current_tcb, reported_tcb, committed_tcb, launch_tcb).
  • KMS SNP BootInfo.tcb_status now comes from that verified report data instead of the old snp-verified-basic-policy placeholder.
    • maps to UpToDate only when current/reported/committed/launch TCB all match;
    • maps to OutOfDate otherwise, which stays denied by default.
  • VerifiedAmdSnpReport.advisory_ids is now explicit and propagated into KMS BootInfo; it is currently empty because the AMD report/VCEK evidence does not carry a direct advisory-list field.
  • The direct fake/default UpToDate SNP boot-info helper is now test-only; production goes through verified attestation.
  • auth-simple docs/tests now describe verifier-derived statuses instead of the placeholder and keep defaults strict: allowedTcbStatuses = ["UpToDate"], allowedAdvisoryIds = [].

Still fail-closed:

  • SNP key/cert release remains blocked for app keys, KMS keys, signing certs, and temp CA material.
  • Any non-UpToDate status or any advisory ID remains denied unless explicitly allowlisted.

Validation:

  • cargo fmt --all
  • cargo test -p dstack-kms --all-features
  • cargo test -p dstack-attest --all-features
  • cargo check --workspace --all-features
  • git diff --check
  • cd kms/auth-simple && npx oxlint . && npx vitest run
  • independent review: no blockers

@clawdbot-glitch003

Copy link
Copy Markdown
Author

Continued with the next quality-gate slice and pushed a small clippy cleanup commit.

Commit:

  • a0ff6efa chore: satisfy sev-snp workspace clippy

What changed:

  • removed a needless return in dstack attestation-mode detection without changing TDX/SNP selection semantics;
  • simplified KMS onboarding response error propagation (Ok(...?) -> direct Result return), preserving behavior;
  • derived Default for TeePlatform with Auto as the default variant, preserving the conservative default.

Validation now passing:

  • cargo fmt --all
  • cargo test -p dstack-kms --all-features
  • cargo test -p dstack-attest --all-features
  • cargo test -p dstack-vmm --all-features
  • cargo check --workspace --all-features
  • cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
  • git diff --check
  • prior auth-simple validation remains: cd kms/auth-simple && npx oxlint . && npx vitest run

Independent review of the cleanup diff found no behavior/security regressions.

@clawdbot-glitch003

Copy link
Copy Markdown
Author

Milestone 1 is done: PR #703 is now review-ready staging for AMD SEV-SNP, still without production key release.

New commit:

  • 93354eb6 docs: add sev-snp review readiness note

What changed:

  • Added docs/amd-sev-snp-review-readiness.md documenting:
    • exact review boundary;
    • fail-closed SNP key/cert release posture;
    • strict TCB/advisory defaults;
    • live sev-snp-measure golden-vector proof;
    • prior SNP guest attestation proof summary;
    • local validation commands.
  • Refreshed live golden-vector proof on dedicated-m24-fork at 2026-06-02T19:49:14Z:
    • ignored live test passed: cargo test -p dstack-kms --all-features recomputation_matches_sev_snp_measure_live_golden_vector -- --ignored --nocapture
    • measurement remains 6497fb9f90dc4a322228a8a5eb14742e09067bc44c184c2068d583ef628b5bae8c6cf15d91fe1bc0b7a8cbcc575be370

Validation passed after doc/proof refresh:

  • cargo fmt --all
  • cargo test -p dstack-kms --all-features
  • cargo test -p dstack-attest --all-features
  • cargo test -p dstack-vmm --all-features
  • cargo check --workspace --all-features
  • cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
  • git diff --check
  • cd kms/auth-simple && npx oxlint . && npx vitest run
  • independent review of the review-ready doc/code posture: no blockers

I am marking the PR ready for review now. Milestone 2 remains separate: production SNP key release policy + revocation/advisory collateral + guarded release enablement.

@clawdbot-glitch003 clawdbot-glitch003 marked this pull request as ready for review June 2, 2026 19:57
@clawdbot-glitch003

Copy link
Copy Markdown
Author

Milestone 2 is now implemented and pushed.

Commit: 6cb351f9 feat: enable guarded sev-snp key release

What changed:

  • Added local KMS [core.sev_snp_key_release] gate for AMD SEV-SNP key/cert material.
  • Default remains fail-closed: enabled = false, allowed_tcb_statuses = ["UpToDate"], allowed_advisory_ids = [].
  • Release requires both:
    1. verified SNP attestation + recomputed launch measurement + external auth API allow, and
    2. explicit local KMS release opt-in with acceptable TCB/advisory state.
  • Guarded all sensitive SNP release surfaces:
    • GetAppKey
    • GetKmsKey
    • SignCert
    • self-authorized GetTempCaCert
  • Added startup safety: KMS rejects sev_snp_key_release.enabled = true unless enforce_self_authorization = true, so temp-CA self-release cannot bypass SNP release checks in production config.
  • Updated kms/kms.toml and docs/amd-sev-snp-review-readiness.md with the opt-in release policy.

Validation passed:

cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run

Independent security review: no release-gate blockers found after the self-authorization startup-safety fix.

@clawdbot-glitch003

Copy link
Copy Markdown
Author

SNP E2E smoke follow-up

I kept going on the manual SNP smoke on chris@173.234.27.162 and pushed the fixes/docs in fe08b86f fix: bind sev-snp vm launch inputs.

What the smoke found/fixed:

  • VMM .sys-config.json now includes sev_snp_measurement so KMS SNP BootInfo recomputation has the same launch inputs QEMU used.
  • VMM now accepts released image metadata where rootfs_hash is only present as dstack.rootfs_hash=... in the kernel cmdline.
  • SNP QEMU launch now uses EPYC-v4 and confidential virtio PCI options (disable-legacy=on,iommu_platform=true) for SNP-launched virtio devices.

Smoke status:

  • Tested dstack-0.5.11 and dstack-dev-0.5.11 with PR-built dstack-vmm/supervisor/dstack-kms, QEMU 10.0.2, and SNP OVMF.
  • Both SNP runs reached OVMF loading the measured kernel/cmdline/initrd path and emitted:
    • EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
  • Neither completed Linux/userspace boot before timeout, so the full dstack-managed guest -> KMS GetAppKey hardware E2E is still blocked before KMS userspace/app-key exercise.
  • Control check: the same dstack-dev-0.5.11 kernel/initrd/rootfs boots without SNP and reaches dstack Guest Preparation Service, narrowing the blocker to SNP+OVMF direct-kernel boot compatibility rather than KMS release policy.
  • No key/secret material was returned.

Validation passed after the fixes:

cargo fmt --all
cargo test -p dstack-vmm --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run

@clawdbot-glitch003

Copy link
Copy Markdown
Author

AMD SEV-SNP manual E2E smoke update

I pushed a follow-up commit that completes the dstack-managed SNP smoke path:

  • Commit: 0a08253a fix: complete sev-snp key release smoke path
  • Smoke host: chris@173.234.27.162
  • QEMU: 10.0.2
  • OVMF: /opt/AMDSEV/usr/local/share/qemu/OVMF.fd (67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a)
  • Image: dstack-dev-0.5.11-snp-dnsfix

What the smoke proved

  • KMS SNP guest booted Linux/userspace and started dstack-kms.
  • App SNP guest booted Linux/userspace and requested app keys from KMS.
  • KMS self auth and app auth both succeeded through auth-simple:
    • /bootAuth/kms -> 200
    • /bootAuth/app -> 200
  • App guest reached GetTempCaCert and GetAppKey against the SNP-backed KMS.
  • KMS metrics after app request:
    • dstack_kms_attestation_requests_total 1
    • dstack_kms_attestation_failures_total 0

Failure gate also exercised

The lab host reports verifier-derived tcbStatus = "OutOfDate". With the default strict release policy (allowed_tcb_statuses = ["UpToDate"]), the app guest was denied as expected:

error: "tcb_status is not allowed"

Then, with an explicit lab-only allowlist (["UpToDate", "OutOfDate"]), the same flow succeeded. Production defaults remain fail-closed.

Fixes included

  • Preserve the released image's original kernel cmdline in SNP measurement recomputation, then append measured docker_compose_hash, rootfs_hash, and app_id exactly like the VMM launch path.
  • Include base_cmdline in VMM-provided sev_snp_measurement input.
  • Add AMD KDS fallback for SNP reports that do not carry cert collateral: fetch ARK/ASK/VCEK from KDS using report chip_id + reported TCB and verify fail-closed.
  • Add configfs TSM -> extended-report ioctl fallback for cert-chain collection.
  • Let SNP guests skip TDX-only app-info / mr_config_id checks while preserving non-SNP behavior.
  • Make dstack-prepare.sh robust for SNP smoke boots (sev-guest detection, early chronyc tolerance, DNS fallback).

Validation run

All passed locally:

cargo fmt --all
cargo test -p dstack-attest --all-features
cargo test -p dstack-util --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run

No secret/key material was included in logs or this comment.

@kvinwang

kvinwang commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Thank you so much for this — it's a huge and impressively thorough piece of work. 🙏

I have some other things on my plate right now, but I'll review this once I'm through them. Thanks again!

@clawdbot-glitch003

Copy link
Copy Markdown
Author

Fresh-box SNP smoke update (sanitized):

  • Built and ran a coherent meta-dstack dev image with the PR branch wired into guest userspace and MACHINE = "sev-snp".
  • Confirmed why the earlier coherent image still reset: the default tdx machine build produced a dstack kernel without AMD memory-encryption/SNP support. The SNP machine build boots under QEMU 10.0.2 + SNP OVMF.
  • Latest smoke reached:
    • Linux/userspace boot
    • dstack Guest Preparation Service
    • SNP_KMS_CONTAINER_STARTED
    • KMS /metrics readiness
    • app guest Requesting app keys from KMS
    • GetTempCaCert
    • app GetAppKey request boundary
  • Current remaining blocker is external AMD KDS collateral fetch throttling, not guest boot/KMS startup/release-policy wiring:
    • app GetAppKey failed while fetching AMD SEV-SNP VCEK collateral from kdsintf.amd.com
    • observed HTTP 429 for Genoa VCEK request
  • Updated docs/script to make the fresh-box requirements explicit and to avoid overclaiming before KDS collateral fetch completes.

Validation after doc/script update:

bash -n test-scripts/snp-e2e-smoke.sh
cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check

Commit: 24d61e74 docs: clarify sev-snp fresh-box smoke

No secrets or credential material included in this update.

@clawdbot-glitch003

Copy link
Copy Markdown
Author

Fresh-box SNP smoke follow-up pushed in 45c77779 (docs: record sev-snp smoke gate boundary).

What changed:

  • Updated test-scripts/snp-e2e-smoke.sh so the strict TCB probe actually drives an app guest to the strict KMS GetAppKey path instead of only waiting for KMS startup.
  • Added separate strict/success KMS host ports so the failure and success probes can run in one smoke without port collision.
  • Fixed the app deploy helper so captured VM IDs stay clean; compose/deploy diagnostics now go to artifacts/stderr instead of contaminating command substitution.
  • Updated docs/amd-sev-snp-review-readiness.md with the latest coherent MACHINE = "sev-snp" image result and the exact remaining boundary.

Remote smoke evidence from chris@173.234.27.162 using QEMU 10.0.2, SNP OVMF 67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a, and coherent dstack-dev-0.6.0 image:

  • KMS guest boots Linux/userspace and reaches SNP_KMS_CONTAINER_STARTED.
  • Strict probe app guest reaches dstack-prepare.sh, detects SEV-SNP, and requests app keys from strict KMS at GetTempCaCert / GetAppKey.
  • Success probe app guest reaches the same GetTempCaCert / GetAppKey request boundary against lab-allowlisted KMS.
  • Both probes are currently blocked before final strict-denial/success markers by external AMD KDS collateral fetch throttling: Genoa/Milan VCEK requests return HTTP 429; other product fallbacks return expected nonmatching-product 404s.

So the remaining gap is not guest boot, VMM wiring, KMS startup, or release-policy plumbing. The current blocker is external AMD KDS collateral availability/rate-limiting for the app quote.

Validation passed after this update:

bash -n test-scripts/snp-e2e-smoke.sh
git diff --check
cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check

No secret material is included in the docs or PR evidence.

@clawdbot-glitch003

Copy link
Copy Markdown
Author

Update: completed the fresh-box SNP smoke through both gates and pushed the follow-up fix.

What changed in the latest commit (b9d968de, fix: complete sev-snp smoke proxy path):

  • Added controlled AMD KDS proxy support for SNP collateral fetches (DSTACK_AMD_KDS_PROXY_URL) so lab runs can avoid AMD KDS HTTP 429 throttling without weakening attestation policy.
  • Propagated the proxy through:
    • guest cmdline / dstack-prepare.sh / /run/dstack/environment for guest services,
    • KMS config startup before attestation verification,
    • RA-RPC quote verification before per-request cert/quote validation.
  • Mirrored the proxy cmdline fragment in VMM measured launch and KMS measurement recomputation, so enabling the smoke proxy does not create a measurement mismatch.
  • Hardened test-scripts/snp-e2e-smoke.sh with separate ports/VMs, KMS-log-aware strict probe checks, and reusable proxy configuration.
  • Updated docs/amd-sev-snp-review-readiness.md with the final sanitized smoke boundary.

Remote smoke result on chris@173.234.27.162 with coherent MACHINE = "sev-snp" dstack-dev-0.6.0 image, QEMU 10.0.2, and SNP OVMF sha 67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a:

strict_tcb_probe=denied_as_expected with "tcb_status is not allowed"
success_probe=GetTempCaCert HTTP 200; GetAppKey HTTP 200; SignCert HTTP 200; app container started
smoke_result=SNP E2E smoke success
no_secret_material_logged=true

Validation rerun after the final patch:

  • bash -n test-scripts/snp-e2e-smoke.sh
  • cargo fmt --all
  • cargo test -p dstack-attest --all-features amd_kds_proxy_url_wraps_amd_urls_when_configured -- --nocapture
  • cargo test -p ra-rpc --all-features quote_verifier_carries_trimmed_amd_kds_proxy_url -- --nocapture
  • cargo test -p dstack-vmm --all-features amd_sev_snp -- --nocapture
  • cargo test -p dstack-kms --all-features
  • cargo test -p dstack-attest --all-features
  • cargo test -p dstack-vmm --all-features
  • cargo test -p ra-rpc --all-features
  • cargo check --workspace --all-features
  • cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
  • git diff --check
  • cd kms/auth-simple && npx oxlint . && npx vitest run

No secrets or key material were included in the smoke artifacts or PR note.

Copy link
Copy Markdown
Contributor

Published the AMD SEV-SNP support design and production-readiness tracker here: #713

This separates the broader design discussion from the PR implementation review. The issue captures:

  • the intended boundary for feat: stage AMD SEV-SNP attestation support #703 as explicit-opt-in bare-metal SNP support;
  • the production blockers: AMD root pinning, KDS fetching/caching, KMS binding parity, and ACPI/BadAML mitigation;
  • the platform strategy for bare metal, GCP, Azure, and AWS;
  • the decisions and open questions before SNP should become automatic or production-ready.

@clawdbot-glitch003

Copy link
Copy Markdown
Author

Applied the Chipotle-agent feedback to the PR docs/body without pulling Chipotle-specific app config into dstack scope.

Updates pushed in 64a33e8a:

  • Clarified the AMD KDS proxy shape and smoke env:
    • DSTACK_SNP_SMOKE_KDS_PROXY_URL=https://cors.litgateway.com/
    • runtime export as DSTACK_AMD_KDS_PROXY_URL
    • path-prefix passthrough: https://cors.litgateway.com/https://kdsintf.amd.com/..., not ?url=
  • Clarified that the final smoke is no longer blocked at AMD KDS 429 when the proxy is enabled.
  • Documented that lab success used DSTACK_SNP_SMOKE_ALLOW_OUT_OF_DATE_TCB=1, while production defaults remain UpToDate only with an empty advisory allowlist.
  • Kept the coherent image requirement explicit: build/use MACHINE = "sev-snp"; default TDX images can miss CONFIG_AMD_MEM_ENCRYPT and reset after OVMF loads kernel/initrd.
  • Updated validation docs to include bash -n test-scripts/snp-e2e-smoke.sh and cargo test -p ra-rpc --all-features.
  • Rewrote the PR body so it reflects the latest successful managed SNP smoke: KMS ready, strict TCB denial, permissive lab GetTempCaCert / GetAppKey / SignCert success, and app container startup.

Validated before push:

bash -n test-scripts/snp-e2e-smoke.sh
cargo fmt --all --check
git diff --check

Chipotle-specific note deliberately left out of dstack implementation/docs except as out-of-scope in the PR body: the app needed ANVIL_CHAIN_RPC=http://10.0.2.2:8545 / Chain::Anvil handling fixed on the Chipotle side.

kvinwang added 27 commits June 17, 2026 17:56
show-mrs special-cased AMD SEV-SNP to emit null MRs with a note claiming
they were TDX-RTMR-only. The app-info path (Attestation::local()->
decode_app_info) computes mr_system/mr_aggregated for SEV too, so drop the
special case and report the real values.
ensure_snp_key_release_config_safe refused to start the KMS when
sev_snp_key_release was enabled without enforce_self_authorization. The
self-authorization requirement is not needed for SEV key release, so remove
the startup gate, its helper, and the associated test.
Add a real AMD SEV-SNP attestation captured from a live dstack CVM plus its
pinned ASK/VCEK, and an integration test that verifies the full chain offline
(builtin ARK -> ASK -> VCEK -> report signature) and asserts the report_data
marker, launch measurement, and HOST_DATA. Fully deterministic — nothing is
fetched from AMD KDS. See sev_snp_fixture.README.md for provenance.
Move the SEV-SNP launch-measurement recomputation and os_image_hash
derivation into a new dstack-mr::sev module so the KMS (key release) and
the verifier (attestation verification) compute identical values from a
single source of truth, instead of the verifier lacking it entirely.

Primitive-typed API (measurement/host_data byte arrays) keeps the module
free of attestation/RA-TLS types, avoiding a dependency cycle. Includes a
real-fixture regression test that recomputes the captured CVM's launch
measurement (7f51e17f...) and os_image_hash (32b47673...).
Replace the in-tree launch-measurement recomputation, os_image_hash
derivation, OVMF parsing and mr_config binding with re-exports from
dstack-mr::sev. The KMS keeps its authorization BootInfo/policy layer on
top. Behaviour is unchanged: all 28 KMS tests (incl. the pinned 88a479...
measurement vector) pass against the shared implementation.
verify_os_image_hash previously bailed "Unsupported attestation quote"
for DstackAmdSevSnp, so SEV-SNP attestations always returned is_valid=false.

Add verify_os_image_hash_for_dstack_sev: recompute the launch measurement
from the self-contained sev_snp_measurement inputs carried in the
attestation config, require it to equal the hardware-signed MEASUREMENT,
require HOST_DATA to bind the MrConfigV3 document, then derive and surface
the image-invariant os_image_hash. Also fills tcb_status/advisory_ids for
SEV. Same dstack-mr::sev code path the KMS uses for key release, so a quote
the KMS would release keys for now verifies here too (is_valid=true).
dstack-util quote was TDX-only (read the Intel configfs directly and
failed on SEV hosts); make it detect the running TEE via Attestation::quote
and emit the platform's raw hardware quote (TDX DCAP quote or SNP report).

GetQuoteResponse gains an 'attestation' field carrying the platform-
adaptive versioned attestation, populated on every platform. On non-TDX
(SEV-SNP) the legacy quote/event_log fields are empty, so this is the
verifier-ready payload to send to dstack-verifier's /verify 'attestation'
field. Populated in the real, simulator and test backends; exposed in the
Rust SDK GetQuoteResponse with a decode_attestation helper.
Extend the offline SEV-SNP fixture test to also run the verifier's full
binding path with no network: after the hardware report verifies, recompute
the launch measurement from the attestation's embedded sev_snp_measurement,
confirm HOST_DATA binds the mr_config, and assert the derived os_image_hash
(32b47673...) and HOST_DATA-bound app_id. Adds dstack-mr as a dev-dep.
The binary/PEM SEV-SNP fixtures can't carry inline SPDX headers; annotate
them in REUSE.toml as CC0-1.0 alongside the existing nitro fixtures so the
REUSE compliance check passes.
Adversarial negative tests for the SEV-SNP verification path:

dstack-mr::sev (synthetic, deterministic):
- forged hardware MEASUREMENT and HOST_DATA are rejected
- every measured launch field (ovmf/kernel/initrd hashes, cmdline, hash-table
  offset, reset eip, section gpa, vcpus, vcpu_type, guest_features) is caught
  by the measurement-equality check
- substituting a different MrConfigV3 (app/compose/instance id) breaks the
  HOST_DATA binding
- an advertised top-level os_image_hash is ignored (derived value wins)
- booting a different image cannot present an allow-listed image's inputs
- missing sev_snp_measurement / mr_config fail closed
- documents that rootfs_hash is os_image_hash-only (bound via the measured
  cmdline), so tampering it changes the derived os_image_hash rather than
  failing the measurement check

dstack-attest (real fixture, offline):
- flipping any signed report field (report_data/measurement/host_data) or the
  signature invalidates VCEK verification; zeroed/truncated reports rejected
- wrong collateral (ASK-as-VCEK, malformed VCEK) rejected
- forged measurement/host_data, tampered launch inputs, substituted mr_config
  and bogus advertised os_image_hash all handled correctly against real data

Derive Debug on SevImageBinding for test ergonomics.
Move the AMD SEV-SNP os_image_hash computation out of dstack-vmm into the
dstack-mr crate, and add a `dstack-mr sev-os-image-hash <image_dir>` command
that emits the value (digest.sev.txt). dstack-mr now parses metadata.json,
measures the SEV firmware (GCTX over ovmf-sev.fd), hashes kernel/initrd and
projects them through dstack_types::SevOsImageMeasurement — the single hashing
path already shared with KMS/verifier.

dstack-vmm no longer recomputes the SEV os_image_hash at deploy: Image::load
reads digest.sev.txt and make_vm_config uses it directly (failing closed if the
file is absent), mirroring how TDX uses digest.txt. The vmm `sev-os-image-hash`
subcommand is removed.

Verified the new CLI reproduces the existing digest.sev.txt byte-for-byte
(32b47673...) on the nvidia-0.6.0.a2 image, matching the value the verifier and
CVM report.
The sev_snp_measurement launch-input document built at deploy time used vmm's
own snp_measure.rs (OVMF footer parse + GCTX). That logic is byte-for-byte the
same as dstack_mr::sev::ovmf_measurement_info (added for the os_image_hash CLI),
so delegate to it and delete the duplicate module. dstack-mr becomes a normal
vmm dependency. Output is unchanged: the measurement-doc test and its
os_image_hash projection cross-check still pass.
TeePlatform::resolve() folded an 'Auto' variant into the resolved type, so
every match on a resolved platform carried a dead Auto arm (e.g. `Tdx | Auto`
in the -machine selection). Remove the Auto variant: the config field becomes
`Option<TeePlatform>` (None = auto-detect), and CvmConfig::resolved_platform()
returns the pinned platform or TeePlatform::detect(). Matches on the resolved
platform are now exhaustive over {Tdx, AmdSevSnp} with no unreachable arm.

A back-compat deserializer still accepts the literal `platform = "auto"`
(mapped to None) so existing vmm.toml configs keep working.
- broaden ARK/ASK/VCEK support to Milan/Genoa/Turin
- describe VMM platform auto-detection (sev_snp cpuinfo flag) vs explicit pin
- correct GetTempCaCert gating: per-RPC self BootInfo check, not a startup reject
- flag the golden-vector capture as stale (predates MrConfigV3/HOST_DATA binding)
- fix auth-simple validation command to use bun
@kvinwang kvinwang force-pushed the feat/amd-sev-snp-conversion branch from 64a33e8 to 439e3d5 Compare June 23, 2026 00:46
@kvinwang

Copy link
Copy Markdown
Collaborator

Great thanks for the excellent work! I force-pushed the rebased branch with the follow-up refactors.

The main post-rebase changes are refactors to make the SNP path less ad-hoc and more platform-driven:

  • Refactored app-info decoding in dstack-attest.

    • decode_app_info_ex() now dispatches from the actual PlatformEvidence.
    • SNP app info is derived directly from the verified SNP report + mr_config / HOST_DATA binding.
    • TDX / GCP TDX / Nitro keep the existing runtime-event based flow.
    • This removes the previous mixed “app info source” plumbing and makes SNP app identity come from the platform-bound evidence path.
  • Refactored SNP KMS authorization responsibilities.

    • Removed the KMS-local SNP TCB/advisory allowlist policy.
    • KMS now validates SNP launch measurement + HOST_DATA binding, then leaves TCB/advisory policy to the external auth API.
    • The local SNP key-release control is now just an explicit boolean gate: sev_snp_key_release.
    • This keeps measurement binding fail-closed while avoiding duplicate policy logic inside KMS.
  • Refactored SNP measurement input.

    • Dropped app identity fields from MeasurementInput; app identity now comes from MrConfigV3 bound by HOST_DATA.
    • Dropped standalone rootfs_hash from SNP launch input.
    • The rootfs hash is now derived from the measured kernel cmdline (dstack.rootfs_hash=...), so rootfs identity is tied to measured launch data rather than duplicated as an unmeasured field.
  • Refactored SEV-SNP measurement code in dstack-mr.

    • dstack-mr::sev is the shared source of truth used by KMS and verifier.
    • Split OVMF footer / metadata parsing helpers.
    • Reworked OVMF SEV metadata parsing to use binrw.
    • Removed the legacy measurement-document parser path.
    • Added a golden-vector test for the OVMF parser.
  • Refactored VMM SNP launch metadata.

    • VMM now uses the same effective vCPU count in both QEMU launch and vm_config.
    • Rootfs identity is no longer carried as a standalone SNP measurement field; VMM validates/extracts it from the measured cmdline instead.
    • Shared vCPU / NUMA helper logic was moved out of the QEMU-specific path.
  • Updated SNP smoke coverage.

    • The app smoke now pins the KMS key_provider_id by deriving it from the KMS root CA public key via KMS.GetMeta.
    • This preserves the intended SNP mr_config.key_provider_id verification semantics.
    • Host API port is configurable to avoid conflicts on shared SNP hosts.

@kvinwang kvinwang merged commit adb1040 into Dstack-TEE:master Jun 23, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants