Conversation
Cargo.toml
Outdated
| # Attestation | ||
| #dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", branch = "jhendricks/update-sled-agent-types-versions", features = ["sled-agent"] } | ||
| dice-verifier = { git = "https://github.com/oxidecomputer/dice-util", features = ["sled-agent"] } | ||
| vm-attest = { git = "https://github.com/oxidecomputer/vm-attest", rev = "a7c2a341866e359a3126aaaa67823ec5097000cd", default-features = false } |
There was a problem hiding this comment.
most of the Cargo.lock weirdness from dice-verifier -> sled-agent-client -> omciron-common (some previous rev) and that's where the later API dependency stuff we saw in Omicron comes up when building the tuf. sled-agent-client re-exports items out of propolis-client which means we end up in a situation where propolis-server depends on a different rev of propolis-client and everything's Weird.
i'm not totally sure what we want or need to do about this, particularly because we're definitely not using the propolis-client-related parts of sled-agent! we're just using one small part of the API for the RoT calls. but sled-agent and propolis are (i think?) updated in the same deployment unit so the cyclic dependency is fine.
this fixes issues (read: panics) related to AttestSledAgent's internal `rt`, block_on, and dropping.
actually stop the `AttestationSock` when we stop other Propolis devices/backends, and along the way `tcp_attest` -> `attest_handle`.
|
I want to add some comments in the attestation module but from a code-structure perspective @iximeow and I are happy with this. Ready for review! |
hawkw
left a comment
There was a problem hiding this comment.
Some of the Tokio stuff felt a bit awkward here --- I'd be happy to open a PR against this branch changing some of the things I mentioned, if that's easier for you?
There was a problem hiding this comment.
not super important but this string could be better probably
| Some(backend.clone_volume()) | ||
| } else { | ||
| // Disk must be read-only to be used for attestation. | ||
| slog::info!(self.log, "boot disk is not read-only"); |
There was a problem hiding this comment.
maybe this should explicitly state that this means it will not be attested?
| #[derive(Debug)] | ||
| enum AttestationInitState { | ||
| Preparing { | ||
| vm_conf_send: oneshot::Sender<VmInstanceConf>, | ||
| }, | ||
| /// A transient state while we're getting the initializer ready, having | ||
| /// taken `Preparing` and its `vm_conf_send`, but before we've got a | ||
| /// `JoinHandle` to track as running. | ||
| Initializing, | ||
| Running { | ||
| init_task: JoinHandle<()>, | ||
| }, | ||
| } | ||
|
|
||
| /// This struct manages providing the requisite data for a corresponding | ||
| /// `AttestationSock` to become fully functional. | ||
| pub struct AttestationSockInit { | ||
| log: slog::Logger, | ||
| vm_conf_send: oneshot::Sender<VmInstanceConf>, | ||
| uuid: uuid::Uuid, | ||
| volume_ref: Option<crucible::Volume>, | ||
| } | ||
|
|
||
| impl AttestationSockInit { | ||
| /// Do any any remaining work of collecting VM RoT measurements in support | ||
| /// of this VM's attestation server. | ||
| pub async fn run(self) { | ||
| let AttestationSockInit { log, vm_conf_send, uuid, volume_ref } = self; | ||
|
|
||
| let mut vm_conf = vm_attest::VmInstanceConf { uuid, boot_digest: None }; | ||
|
|
||
| if let Some(volume) = volume_ref { | ||
| // TODO(jph): make propolis issue, link to #1078 and add a log line | ||
| // TODO: load-bearing sleep: we have a Crucible volume, but we can | ||
| // be here and chomping at the bit to get a digest calculation | ||
| // started well before the volume has been activated; in | ||
| // `propolis-server` we need to wait for at least a subsequent | ||
| // instance start. Similar to the scrub task for Crucible disks, | ||
| // delay some number of seconds in the hopes that activation is done | ||
| // promptly. | ||
| // | ||
| // This should be replaced by awaiting for some kind of actual | ||
| // "activated" signal. | ||
| tokio::time::sleep(std::time::Duration::from_secs(10)).await; | ||
|
|
||
| let boot_digest = | ||
| match crate::attestation::boot_digest::boot_disk_digest( | ||
| volume, &log, | ||
| ) | ||
| .await | ||
| { | ||
| Ok(digest) => digest, | ||
| Err(e) => { | ||
| // a panic here is unfortunate, but helps us debug for | ||
| // now; if the digest calculation fails it may be some | ||
| // retryable issue that a guest OS would survive. but | ||
| // panicking here means we've stopped Propolis at the | ||
| // actual error, rather than noticing the | ||
| // `vm_conf_sender` having dropped elsewhere. | ||
| panic!("failed to compute boot disk digest: {e:?}"); | ||
| } | ||
| }; | ||
|
|
||
| vm_conf.boot_digest = Some(boot_digest); | ||
| } else { | ||
| slog::warn!(log, "not computing boot disk digest"); | ||
| } | ||
|
|
||
| let send_res = vm_conf_send.send(vm_conf); | ||
| if let Err(_) = send_res { | ||
| slog::error!( | ||
| log, | ||
| "attestation server is not listening for its config?" | ||
| ); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Soo, it feels a bit funny to me that this thing is a task we spawn that, when it completes, sends a message over a oneshot channel and then exits, and then we have a JoinHandle<()> for that task. It kinda feels like this could just be a JoinHandle<VmInstanceConf> and make a bunch of this at least a bit simpler?
I'd be happy to throw together a patch that does that refactoring if it's too annoying.
There was a problem hiding this comment.
That's fair. The JoinHandle was from a previous iteration of how we would structure things that looked more like the way we presently handle the VNC server. I'll take a look at how hard this is to remove.
There was a problem hiding this comment.
Since this and also the change in this module that I suggested in #1091 (comment) are kinda just refactoring/tidying things up, I would be fine with leaving a lot of this as-is and then merge some refactoring later --- I'd be happy to open a follow-up PR after this has merged, if that makes life easier for you?
| let mut buffer = | ||
| Buffer::new(this_block_count as usize, block_size as usize); | ||
|
|
||
| // TODO(jph): We don't want to panic in the case of a failed read. How |
There was a problem hiding this comment.
I still need to do this and test on dublin.
hawkw
left a comment
There was a problem hiding this comment.
The Crucible retry stuff seems pretty much correct, I commented on some minor nitpicks. I think it's fine to defer some of the async refactoring to a subsequent PR, as there isn't anything wrong there, I just think we could maybe make the code a bit simpler. Beyond that, I think that pending whatever testing you need to do, I have no major concerns.
| slog::error!( | ||
| log, | ||
| "read failed: {e:?}. | ||
| offset={offset}, | ||
| this_block_cout={this_block_count}, | ||
| block_size={block_size}, | ||
| end_block={end_block}" |
There was a problem hiding this comment.
super weird formatting here, can we do something about that? also perhaps these ought to be structured fields...
There was a problem hiding this comment.
and also, perhaps this ought to include the retry count?
Co-authored-by: Eliza Weisman <eliza@elizas.website>
closes #1067
TODO:
Testing notes
No boot disk
Steps to test: create instance, stop it (or don't auto-start it), then remove the boot disk as a boot disk. Send a challenge from inside the guest.
Result: attestation server used just the instance UUID for qualifying data
Failed boot disk
in progress