Skip to content

feat(patch_set): UniDiff streaming parser for multi-file patches#55

Merged
bmwill merged 5 commits intobmwill:masterfrom
weihanglo:patches
Apr 11, 2026
Merged

feat(patch_set): UniDiff streaming parser for multi-file patches#55
bmwill merged 5 commits intobmwill:masterfrom
weihanglo:patches

Conversation

@weihanglo
Copy link
Copy Markdown
Contributor

@weihanglo weihanglo commented Apr 10, 2026

The new patch_set module for parsing unified diff input that contains changes to multiple files (multi-file patch comparing to Patch that is single file patch)

Key types:

  • PatchSet: streaming iterator that yields FilePatches
  • FilePatch: combines a Patch with a FileOperation
  • FileOperation: Create, Delete, Modify, Rename, Copy
  • ParseOptions: configures parsing format

Comment on lines +117 to +130
let patch = match parse_one(patch_content) {
Ok((patch, _consumed)) => patch,
Err(e) => return Some(Err(e.into())),
};

let abs_patch_start = self.offset + patch_start;
let operation = match extract_file_op_unidiff(patch.original_path(), patch.modified_path())
{
Ok(op) => op,
Err(mut e) => {
e.set_span(abs_patch_start..abs_patch_start);
return Some(Err(e));
}
};
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to handle error cases a bit better. If we hit an error we don't advance self.offset or set self.finished which depending on how the Iterator impl is used can lead to an infinite loop in circumstances (eg for p in patches). So we maybe want to check the error return in the Iterator impl and if we error out, set self.finished and then return the error. Either that or if we have sufficient information and can gracefully skip over the problematic section we can return the error, advance our offset past the end of the problematic patch and then allow the user to continue parsing subsequent patches. This case is probably more difficult to get completely correct so for the time being we can opt to do the easier thing and document that behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in aab78ce

Comment on lines +214 to +224
Rename {
from: Cow<'a, str>,
to: Cow<'a, str>,
},
/// Copy a file (copy from `from` to `to`, keep `from`).
///
/// Only produced when git extended headers explicitly indicate a copy.
Copy {
from: Cow<'a, str>,
to: Cow<'a, str>,
},
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are presently unused, but i assume they'll be used in a follow up change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for parsing git diff output.

I was just a bit lazy. Could split them smarter, if needed.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to leave in here, just wanted to check :)

Comment on lines +90 to +117
while let Some(line) = lines.next() {
let next_line = lines.peek().copied();

if is_unidiff_boundary(prev_line, line, next_line) {
if patch_start.is_some() {
// Found the start of the next patch
patch_end = Some(byte_offset);
break;
}
// Found start of first patch
patch_start = Some(byte_offset);
self.found_any = true;
}

prev_line = Some(line);
byte_offset += line.len();
byte_offset += line_ending_len(&remaining[byte_offset..]);
}

// No patch found
let patch_start = patch_start?;

// If no end found, patch extends to EOF
let patch_end = patch_end.unwrap_or(remaining.len());

let patch_content = &remaining[patch_start..patch_end];

let patch = match parse_one(patch_content) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems slightly inefficient to me. We do the following:

  • scan lines till we find a boundary
  • re scan lines to parse into a patch

This means that we are going over the lines at least 2x as well as reimplementing some of the parsing logic in is_unidiff_boundary which may be better suited to be handled with the other parsing logic. For example the parsing can choke on some types of input based on how is_unidiff_boundary decides to split the input:

    #[test]
    fn false_positive_plus_plus_in_hunk() {
        // A hunk that adds a line whose content is literally "++ foo"
        // renders in the diff as "+++ foo" (the leading "+" is the add
        // marker). The boundary detector sees a line starting with the
        // modified-header prefix "+++ " followed by the next patch's
        // "--- " header and wrongly splits the first patch's hunk in
        // two, producing a malformed first patch.
        let content = "\
--- a/file1.rs
+++ b/file1.rs
@@ -1,2 +1,2 @@
 line1
-old
+++ foo
--- a/file2.rs
+++ b/file2.rs
@@ -1 +1 @@
-a
+b
";
        let patches = Patches::parse(content, ParseOptions::unidiff())
            .collect::<Result<Vec<_>, _>>()
            .unwrap();
        assert_eq!(patches.len(), 2);
    }

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a slightly more robust setup would be to have something like the following:

  let (patch, consumed) = parse_one(remaining)?;
  self.offset += patch_start + consumed;

Where parsing of a single patch gets a bit more robust, or we do something like:

  let patch_start = find_patch_start(remaining)?;
  let (patch, consumed) = parse_one(&remaining[patch_start..])?;
  self.offset += patch_start + consumed;

Where we have some function that can handle finding the start of a patch and stripping that before passing it off to parse_one.

Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked the latter one. It seems more robust and actually simplified the logic quite a bit!

See f539bbf

src/lib.rs Outdated
mod diff;
mod merge;
mod patch;
pub mod patches;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if "patch_set" would be a bit more descriptive over "patches"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally patchset / PatchSet. I forgot why I changed it in weihanglo@e9eb5c1. Perhaps because "set" gives an unordered sense. I am willing to change to whatever name that feels better!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 37c73f7 and 88f1488

Comment on lines +37 to +39
fn consumed(&self) -> usize {
self.offset
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is identical to the one above it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bc7d2aa

Comment on lines +10 to +13
pub struct PatchesParseError {
pub(crate) kind: PatchesParseErrorKind,
span: Option<Range<usize>>,
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see value in adding a separate parsing error for parsing a PatchSet vs parsing a single Patch?

One other option would be to have just a single Error type and extend the PatchParseErrorKind with the few you've added below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feels like having clear-scoped Error types is easier for lib users to understand, rather than looking at a gigantic enum and realized it is for binary patching not for single file patch. diffy developers are also the lib user.

However, for real end users, it probably matter less as we have a wrapper struct already. So, some directions,

  • We have different private error enums for better code organization for diffy internals, and have just one top-level error struct encapsulating everything.
  • Just one private erro enum and one public error struct.

(I planned to add more BTW, see https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/src/patches/error.rs#L72-L102 and https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/src/binary/error.rs#L70-L103)

Comment on lines +152 to +154
if !self.found_any {
return Some(Err(self.error(PatchesParseErrorKind::NoPatchesFound)));
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the rational for returning an Error on an empty iterator vs just returning None?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No valid patch in a .patch file should be an error?

For example,
https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/examples/multi_file_patch.rs#L70-L74
If we don't do this, caller can still count the patches received to determine whether they want to error out. However, they need to handle it explicitly and is error-prone ans easy to forget.

The downside of we making it a hard error is that caller can't ignore it (unless we provide some error reason inspection).

src/patch/mod.rs Outdated
Comment on lines +72 to +78
pub(crate) fn original_path(&self) -> Option<Cow<'a, T>> {
self.original.as_ref().map(|f| f.0.clone())
}

pub(crate) fn modified_path(&self) -> Option<Cow<'a, T>> {
self.modified.as_ref().map(|f| f.0.clone())
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may result in over cloning, since these are pub(crate) we should be able to just return an Option<&Cow<'a, T>> so that downstream consumers can clone if they absolutely need to and if they don't we can avoid an extra allocation in the cases where Cow is the Owned variant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 9f2a2b1

weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026
weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026
weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026
weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026
weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026
The boundary detector scanned lines twice and reimplemented parsing
logic that couldn't distinguish hunk content from patch headers
(e.g. "+++ foo" as added content vs "+++ b/file.rs" as header).

Delegate boundary detection to parse_one, which naturally knows
where a patch ends via hunk line counting. parse_one now always
returns consumed bytes (even on error) so the iterator can advance.

bmwill#55 (comment)
Preparation for multi-file patches support.
Preparation for multi-file patches support.
weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 11, 2026
We adopted this suggestion

```rust
  let patch_start = find_patch_start(remaining)?;
  let (patch, consumed) = parse_one(&remaining[patch_start..])?;
  self.offset += patch_start + consumed;
```

bmwill#55 (comment)
Rename patches module to patch_set and Patches struct to PatchSet.
Replace boundary pre-scan with single-pass parse_one.
Terminate iterator on parse error while allowing callers to continue.

bmwill#55 (comment)
bmwill#55 (comment)
bmwill#55 (comment)
bmwill#55 (comment)
We need to track lockfile
because we want to make sure our CI using rayon@1.10.0.
And I don't want to block whoever wants to use latest rayon to run tests.
(Though that doesn't matter as dependent won't respect our lockfile)

For other reasons, see

* <https://blog.rust-lang.org/2023/08/29/committing-lockfiles/>
* <https://doc.rust-lang.org/nightly/cargo/faq.html#why-have-cargolock-in-version-control>
* <https://doc.rust-lang.org/nightly/cargo/reference/rust-version.html>

Most importantly:

> Verifying a minimum-supported Rust version (MSRV)
> that is less than the latest version of a dependency supports

And actually during the course, I found that `windows-link` requires
Rust 1.71. `windows-link` is a transitive dependency of `anstyle`
through `windows-sys`.
Our MSRV is already a lie for windows user when enabling `color`.

``` console
$ cargo generate-lockfile --config 'resolver.incompatible-rust-versions="fallback"'
    Updating crates.io index
     Locking 24 packages to latest Rust 1.70.0 compatible versions
      Adding snapbox v0.6.24 (available: v1.1.0)
      Adding windows-link v0.2.1 (requires Rust 1.71)
```
weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 11, 2026
We adopted this suggestion

```rust
  let patch_start = find_patch_start(remaining)?;
  let (patch, consumed) = parse_one(&remaining[patch_start..])?;
  self.offset += patch_start + consumed;
```

bmwill#55 (comment)
Copy link
Copy Markdown
Contributor Author

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a separate branch showing the incremental fixes to resolve review comments: https://github.com/weihanglo/diffy/commits/patches-working. You can review this branch, which I think is easier to reason. Each commit message has a link to the respective review comment.

The original PR branch was rebased to keep git history clean without too much unnecessary intermediate commits. There is no diff between these two branches.

Hope this help the review

@weihanglo weihanglo changed the title feat(patches): UniDiff streaming parser for multi-file patches feat(patch_set): UniDiff streaming parser for multi-file patches Apr 11, 2026
@bmwill bmwill merged commit 486703b into bmwill:master Apr 11, 2026
8 checks passed
bmwill pushed a commit that referenced this pull request Apr 11, 2026
Rename patches module to patch_set and Patches struct to PatchSet.
Replace boundary pre-scan with single-pass parse_one.
Terminate iterator on parse error while allowing callers to continue.

#55 (comment)
#55 (comment)
#55 (comment)
#55 (comment)
@weihanglo weihanglo deleted the patches branch April 11, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants