feat(patch_set): UniDiff streaming parser for multi-file patches#55
feat(patch_set): UniDiff streaming parser for multi-file patches#55bmwill merged 5 commits intobmwill:masterfrom
Conversation
src/patches/parse.rs
Outdated
| let patch = match parse_one(patch_content) { | ||
| Ok((patch, _consumed)) => patch, | ||
| Err(e) => return Some(Err(e.into())), | ||
| }; | ||
|
|
||
| let abs_patch_start = self.offset + patch_start; | ||
| let operation = match extract_file_op_unidiff(patch.original_path(), patch.modified_path()) | ||
| { | ||
| Ok(op) => op, | ||
| Err(mut e) => { | ||
| e.set_span(abs_patch_start..abs_patch_start); | ||
| return Some(Err(e)); | ||
| } | ||
| }; |
There was a problem hiding this comment.
I wonder if we want to handle error cases a bit better. If we hit an error we don't advance self.offset or set self.finished which depending on how the Iterator impl is used can lead to an infinite loop in circumstances (eg for p in patches). So we maybe want to check the error return in the Iterator impl and if we error out, set self.finished and then return the error. Either that or if we have sufficient information and can gracefully skip over the problematic section we can return the error, advance our offset past the end of the problematic patch and then allow the user to continue parsing subsequent patches. This case is probably more difficult to get completely correct so for the time being we can opt to do the easier thing and document that behavior.
| Rename { | ||
| from: Cow<'a, str>, | ||
| to: Cow<'a, str>, | ||
| }, | ||
| /// Copy a file (copy from `from` to `to`, keep `from`). | ||
| /// | ||
| /// Only produced when git extended headers explicitly indicate a copy. | ||
| Copy { | ||
| from: Cow<'a, str>, | ||
| to: Cow<'a, str>, | ||
| }, |
There was a problem hiding this comment.
I think these are presently unused, but i assume they'll be used in a follow up change?
There was a problem hiding this comment.
Yes, for parsing git diff output.
I was just a bit lazy. Could split them smarter, if needed.
There was a problem hiding this comment.
feel free to leave in here, just wanted to check :)
src/patches/parse.rs
Outdated
| while let Some(line) = lines.next() { | ||
| let next_line = lines.peek().copied(); | ||
|
|
||
| if is_unidiff_boundary(prev_line, line, next_line) { | ||
| if patch_start.is_some() { | ||
| // Found the start of the next patch | ||
| patch_end = Some(byte_offset); | ||
| break; | ||
| } | ||
| // Found start of first patch | ||
| patch_start = Some(byte_offset); | ||
| self.found_any = true; | ||
| } | ||
|
|
||
| prev_line = Some(line); | ||
| byte_offset += line.len(); | ||
| byte_offset += line_ending_len(&remaining[byte_offset..]); | ||
| } | ||
|
|
||
| // No patch found | ||
| let patch_start = patch_start?; | ||
|
|
||
| // If no end found, patch extends to EOF | ||
| let patch_end = patch_end.unwrap_or(remaining.len()); | ||
|
|
||
| let patch_content = &remaining[patch_start..patch_end]; | ||
|
|
||
| let patch = match parse_one(patch_content) { |
There was a problem hiding this comment.
This seems slightly inefficient to me. We do the following:
- scan lines till we find a boundary
- re scan lines to parse into a patch
This means that we are going over the lines at least 2x as well as reimplementing some of the parsing logic in is_unidiff_boundary which may be better suited to be handled with the other parsing logic. For example the parsing can choke on some types of input based on how is_unidiff_boundary decides to split the input:
#[test]
fn false_positive_plus_plus_in_hunk() {
// A hunk that adds a line whose content is literally "++ foo"
// renders in the diff as "+++ foo" (the leading "+" is the add
// marker). The boundary detector sees a line starting with the
// modified-header prefix "+++ " followed by the next patch's
// "--- " header and wrongly splits the first patch's hunk in
// two, producing a malformed first patch.
let content = "\
--- a/file1.rs
+++ b/file1.rs
@@ -1,2 +1,2 @@
line1
-old
+++ foo
--- a/file2.rs
+++ b/file2.rs
@@ -1 +1 @@
-a
+b
";
let patches = Patches::parse(content, ParseOptions::unidiff())
.collect::<Result<Vec<_>, _>>()
.unwrap();
assert_eq!(patches.len(), 2);
}
There was a problem hiding this comment.
Seems like a slightly more robust setup would be to have something like the following:
let (patch, consumed) = parse_one(remaining)?;
self.offset += patch_start + consumed;
Where parsing of a single patch gets a bit more robust, or we do something like:
let patch_start = find_patch_start(remaining)?;
let (patch, consumed) = parse_one(&remaining[patch_start..])?;
self.offset += patch_start + consumed;
Where we have some function that can handle finding the start of a patch and stripping that before passing it off to parse_one.
Thoughts?
There was a problem hiding this comment.
I picked the latter one. It seems more robust and actually simplified the logic quite a bit!
See f539bbf
src/lib.rs
Outdated
| mod diff; | ||
| mod merge; | ||
| mod patch; | ||
| pub mod patches; |
There was a problem hiding this comment.
I wonder if "patch_set" would be a bit more descriptive over "patches"?
There was a problem hiding this comment.
It was originally patchset / PatchSet. I forgot why I changed it in weihanglo@e9eb5c1. Perhaps because "set" gives an unordered sense. I am willing to change to whatever name that feels better!
src/patch/parse.rs
Outdated
| fn consumed(&self) -> usize { | ||
| self.offset | ||
| } |
There was a problem hiding this comment.
This function is identical to the one above it.
src/patches/error.rs
Outdated
| pub struct PatchesParseError { | ||
| pub(crate) kind: PatchesParseErrorKind, | ||
| span: Option<Range<usize>>, | ||
| } |
There was a problem hiding this comment.
Do you see value in adding a separate parsing error for parsing a PatchSet vs parsing a single Patch?
One other option would be to have just a single Error type and extend the PatchParseErrorKind with the few you've added below.
There was a problem hiding this comment.
I feels like having clear-scoped Error types is easier for lib users to understand, rather than looking at a gigantic enum and realized it is for binary patching not for single file patch. diffy developers are also the lib user.
However, for real end users, it probably matter less as we have a wrapper struct already. So, some directions,
- We have different private error enums for better code organization for diffy internals, and have just one top-level error struct encapsulating everything.
- Just one private erro enum and one public error struct.
(I planned to add more BTW, see https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/src/patches/error.rs#L72-L102 and https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/src/binary/error.rs#L70-L103)
src/patches/parse.rs
Outdated
| if !self.found_any { | ||
| return Some(Err(self.error(PatchesParseErrorKind::NoPatchesFound))); | ||
| } |
There was a problem hiding this comment.
Whats the rational for returning an Error on an empty iterator vs just returning None?
There was a problem hiding this comment.
No valid patch in a .patch file should be an error?
For example,
https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/examples/multi_file_patch.rs#L70-L74
If we don't do this, caller can still count the patches received to determine whether they want to error out. However, they need to handle it explicitly and is error-prone ans easy to forget.
The downside of we making it a hard error is that caller can't ignore it (unless we provide some error reason inspection).
src/patch/mod.rs
Outdated
| pub(crate) fn original_path(&self) -> Option<Cow<'a, T>> { | ||
| self.original.as_ref().map(|f| f.0.clone()) | ||
| } | ||
|
|
||
| pub(crate) fn modified_path(&self) -> Option<Cow<'a, T>> { | ||
| self.modified.as_ref().map(|f| f.0.clone()) | ||
| } |
There was a problem hiding this comment.
This may result in over cloning, since these are pub(crate) we should be able to just return an Option<&Cow<'a, T>> so that downstream consumers can clone if they absolutely need to and if they don't we can avoid an extra allocation in the cases where Cow is the Owned variant.
The boundary detector scanned lines twice and reimplemented parsing logic that couldn't distinguish hunk content from patch headers (e.g. "+++ foo" as added content vs "+++ b/file.rs" as header). Delegate boundary detection to parse_one, which naturally knows where a patch ends via hunk line counting. parse_one now always returns consumed bytes (even on error) so the iterator can advance. bmwill#55 (comment)
Preparation for multi-file patches support.
Preparation for multi-file patches support.
We adopted this suggestion ```rust let patch_start = find_patch_start(remaining)?; let (patch, consumed) = parse_one(&remaining[patch_start..])?; self.offset += patch_start + consumed; ``` bmwill#55 (comment)
Rename patches module to patch_set and Patches struct to PatchSet. Replace boundary pre-scan with single-pass parse_one. Terminate iterator on parse error while allowing callers to continue. bmwill#55 (comment) bmwill#55 (comment) bmwill#55 (comment) bmwill#55 (comment)
We need to track lockfile because we want to make sure our CI using rayon@1.10.0. And I don't want to block whoever wants to use latest rayon to run tests. (Though that doesn't matter as dependent won't respect our lockfile) For other reasons, see * <https://blog.rust-lang.org/2023/08/29/committing-lockfiles/> * <https://doc.rust-lang.org/nightly/cargo/faq.html#why-have-cargolock-in-version-control> * <https://doc.rust-lang.org/nightly/cargo/reference/rust-version.html> Most importantly: > Verifying a minimum-supported Rust version (MSRV) > that is less than the latest version of a dependency supports And actually during the course, I found that `windows-link` requires Rust 1.71. `windows-link` is a transitive dependency of `anstyle` through `windows-sys`. Our MSRV is already a lie for windows user when enabling `color`. ``` console $ cargo generate-lockfile --config 'resolver.incompatible-rust-versions="fallback"' Updating crates.io index Locking 24 packages to latest Rust 1.70.0 compatible versions Adding snapbox v0.6.24 (available: v1.1.0) Adding windows-link v0.2.1 (requires Rust 1.71) ```
We adopted this suggestion ```rust let patch_start = find_patch_start(remaining)?; let (patch, consumed) = parse_one(&remaining[patch_start..])?; self.offset += patch_start + consumed; ``` bmwill#55 (comment)
There was a problem hiding this comment.
I created a separate branch showing the incremental fixes to resolve review comments: https://github.com/weihanglo/diffy/commits/patches-working. You can review this branch, which I think is easier to reason. Each commit message has a link to the respective review comment.
The original PR branch was rebased to keep git history clean without too much unnecessary intermediate commits. There is no diff between these two branches.
Hope this help the review
Rename patches module to patch_set and Patches struct to PatchSet. Replace boundary pre-scan with single-pass parse_one. Terminate iterator on parse error while allowing callers to continue. #55 (comment) #55 (comment) #55 (comment) #55 (comment)
The new
patch_setmodule for parsing unified diff input that contains changes to multiple files (multi-file patch comparing toPatchthat is single file patch)Key types:
PatchSet: streaming iterator that yieldsFilePatchesFilePatch: combines aPatchwith aFileOperationFileOperation: Create, Delete, Modify, Rename, CopyParseOptions: configures parsing format