Skip to content

mvp vm attestation#1091

Open
jordanhendricks wants to merge 32 commits intomasterfrom
jhendricks/rfd-605
Open

mvp vm attestation#1091
jordanhendricks wants to merge 32 commits intomasterfrom
jhendricks/rfd-605

Conversation

@jordanhendricks
Copy link
Copy Markdown
Contributor

@jordanhendricks jordanhendricks commented Mar 27, 2026

closes #1067

TODO:

Testing notes

No boot disk

Steps to test: create instance, stop it (or don't auto-start it), then remove the boot disk as a boot disk. Send a challenge from inside the guest.

Result: attestation server used just the instance UUID for qualifying data

21:24:25.538Z INFO propolis-server (vm_state_driver): vm conf is ready = VmInstanceConf { uuid: 1f1ec2e3-c5cf-4eaf-8a19-aa25ec1f6895, boot_digest: None }

Failed boot disk

in progress

Cargo.toml Outdated
# Attestation
#dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] }
dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] }
vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.

i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.

@jordanhendricks jordanhendricks marked this pull request as ready for review April 2, 2026 00:08
@jordanhendricks
Copy link
Copy Markdown
Contributor Author

I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review!

@jordanhendricks jordanhendricks requested a review from hawkw April 2, 2026 00:41
@jordanhendricks jordanhendricks self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super important but this string could be better probably

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 014950e

Some(backend.clone_volume())
} else {
// Disk must be read-only to be used for attestation.
slog::info!(self.log, "boot disk is not read-only");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should explicitly state that this means it will not be attested?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a crack at this in 014950e

Comment on lines +42 to +118
#[derive(Debug)]
enum AttestationInitState {
Preparing {
vm_conf_send: oneshot::Sender<VmInstanceConf>,
},
/// A transient state while we're getting the initializer ready, having
/// taken `Preparing` and its `vm_conf_send`, but before we've got a
/// `JoinHandle` to track as running.
Initializing,
Running {
init_task: JoinHandle<()>,
},
}

/// This struct manages providing the requisite data for a corresponding
/// `AttestationSock` to become fully functional.
pub struct AttestationSockInit {
log: slog::Logger,
vm_conf_send: oneshot::Sender<VmInstanceConf>,
uuid: uuid::Uuid,
volume_ref: Option<crucible::Volume>,
}

impl AttestationSockInit {
/// Do any any remaining work of collecting VM RoT measurements in support
/// of this VM's attestation server.
pub async fn run(self) {
let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self;

let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None };

if let Some(volume) = volume_ref {
// TODO(jph): make propolis issue, link to #1078 and add a log line
// TODO: load-bearing sleep: we have a Crucible volume, but we can
// be here and chomping at the bit to get a digest calculation
// started well before the volume has been activated; in
// `propolis-server` we need to wait for at least a subsequent
// instance start. Similar to the scrub task for Crucible disks,
// delay some number of seconds in the hopes that activation is done
// promptly.
//
// This should be replaced by awaiting for some kind of actual
// "activated" signal.
tokio::time::sleep(std::time::Duration::from_secs(10)).await;

let boot_digest =
match crate::attestation::boot_digest::boot_disk_digest(
volume, &log,
)
.await
{
Ok(digest) => digest,
Err(e) => {
// a panic here is unfortunate, but helps us debug for
// now; if the digest calculation fails it may be some
// retryable issue that a guest OS would survive. but
// panicking here means we've stopped Propolis at the
// actual error, rather than noticing the
// `vm_conf_sender` having dropped elsewhere.
panic!("failed to compute boot disk digest: {e:?}");
}
};

vm_conf.boot_digest = Some(boot_digest);
} else {
slog::warn!(log, "not computing boot disk digest");
}

let send_res = vm_conf_send.send(vm_conf);
if let Err(_) = send_res {
slog::error!(
log,
"attestation server is not listening for its config?"
);
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?

I'd be happy to throw together a patch that does that refactoring if it's too annoying.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. The JoinHandle was from a previous iteration of how we would structure things that looked more like the way we presently handle the VNC server. I'll take a look at how hard this is to remove.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this and also the change in this module that I suggested in #1091 (comment) are kinda just refactoring/tidying things up, I would be fine with leaving a lot of this as-is and then merge some refactoring later --- I'd be happy to open a follow-up PR after this has merged, if that makes life easier for you?

let mut buffer =
Buffer::new(this_block_count as usize, block_size as usize);

// TODO(jph): We don't want to panic in the case of a failed read. How
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do this and test on dublin.

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Crucible retry stuff seems pretty much correct, I commented on some minor nitpicks. I think it's fine to defer some of the async refactoring to a subsequent PR, as there isn't anything wrong there, I just think we could maybe make the code a bit simpler. Beyond that, I think that pending whatever testing you need to do, I have no major concerns.

Comment on lines +89 to +95
slog::error!(
log,
"read failed: {e:?}.
offset={offset},
this_block_cout={this_block_count},
block_size={block_size},
end_block={end_block}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super weird formatting here, can we do something about that? also perhaps these ought to be structured fields...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also, perhaps this ought to include the retry count?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both done in c096720

jordanhendricks and others added 2 commits April 5, 2026 15:06
Co-authored-by: Eliza Weisman <eliza@elizas.website>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mvp vm attestation support in propolis-server (rfd 605)

4 participants