feat(patch_set): UniDiff streaming parser for multi-file patches by weihanglo · Pull Request #55 · bmwill/diffy

weihanglo · 2026-04-10T05:02:22Z

The new patch_set module for parsing unified diff input that contains changes to multiple files (multi-file patch comparing to Patch that is single file patch)

Key types:

PatchSet: streaming iterator that yields FilePatches
FilePatch: combines a Patch with a FileOperation
FileOperation: Create, Delete, Modify, Rename, Copy
ParseOptions: configures parsing format

bmwill · 2026-04-10T13:41:20Z

src/patches/parse.rs

+        let patch = match parse_one(patch_content) {
+            Ok((patch, _consumed)) => patch,
+            Err(e) => return Some(Err(e.into())),
+        };
+
+        let abs_patch_start = self.offset + patch_start;
+        let operation = match extract_file_op_unidiff(patch.original_path(), patch.modified_path())
+        {
+            Ok(op) => op,
+            Err(mut e) => {
+                e.set_span(abs_patch_start..abs_patch_start);
+                return Some(Err(e));
+            }
+        };


I wonder if we want to handle error cases a bit better. If we hit an error we don't advance self.offset or set self.finished which depending on how the Iterator impl is used can lead to an infinite loop in circumstances (eg for p in patches). So we maybe want to check the error return in the Iterator impl and if we error out, set self.finished and then return the error. Either that or if we have sufficient information and can gracefully skip over the problematic section we can return the error, advance our offset past the end of the problematic patch and then allow the user to continue parsing subsequent patches. This case is probably more difficult to get completely correct so for the time being we can opt to do the easier thing and document that behavior.

Fixed in aab78ce

bmwill · 2026-04-10T13:42:41Z

src/patch_set/mod.rs

+    Rename {
+        from: Cow<'a, str>,
+        to: Cow<'a, str>,
+    },
+    /// Copy a file (copy from `from` to `to`, keep `from`).
+    ///
+    /// Only produced when git extended headers explicitly indicate a copy.
+    Copy {
+        from: Cow<'a, str>,
+        to: Cow<'a, str>,
+    },


I think these are presently unused, but i assume they'll be used in a follow up change?

Yes, for parsing git diff output.

I was just a bit lazy. Could split them smarter, if needed.

feel free to leave in here, just wanted to check :)

bmwill · 2026-04-10T14:00:59Z

src/patches/parse.rs

+        while let Some(line) = lines.next() {
+            let next_line = lines.peek().copied();
+
+            if is_unidiff_boundary(prev_line, line, next_line) {
+                if patch_start.is_some() {
+                    // Found the start of the next patch
+                    patch_end = Some(byte_offset);
+                    break;
+                }
+                // Found start of first patch
+                patch_start = Some(byte_offset);
+                self.found_any = true;
+            }
+
+            prev_line = Some(line);
+            byte_offset += line.len();
+            byte_offset += line_ending_len(&remaining[byte_offset..]);
+        }
+
+        // No patch found
+        let patch_start = patch_start?;
+
+        // If no end found, patch extends to EOF
+        let patch_end = patch_end.unwrap_or(remaining.len());
+
+        let patch_content = &remaining[patch_start..patch_end];
+
+        let patch = match parse_one(patch_content) {


This seems slightly inefficient to me. We do the following:

scan lines till we find a boundary

re scan lines to parse into a patch

This means that we are going over the lines at least 2x as well as reimplementing some of the parsing logic in is_unidiff_boundary which may be better suited to be handled with the other parsing logic. For example the parsing can choke on some types of input based on how is_unidiff_boundary decides to split the input:

#[test] fn false_positive_plus_plus_in_hunk() { // A hunk that adds a line whose content is literally "++ foo" // renders in the diff as "+++ foo" (the leading "+" is the add // marker). The boundary detector sees a line starting with the // modified-header prefix "+++ " followed by the next patch's // "--- " header and wrongly splits the first patch's hunk in // two, producing a malformed first patch. let content = "\ --- a/file1.rs +++ b/file1.rs @@ -1,2 +1,2 @@ line1 -old +++ foo --- a/file2.rs +++ b/file2.rs @@ -1 +1 @@ -a +b "; let patches = Patches::parse(content, ParseOptions::unidiff()) .collect::<Result<Vec<_>, _>>() .unwrap(); assert_eq!(patches.len(), 2); }

Seems like a slightly more robust setup would be to have something like the following:

let (patch, consumed) = parse_one(remaining)?; self.offset += patch_start + consumed;

Where parsing of a single patch gets a bit more robust, or we do something like:

let patch_start = find_patch_start(remaining)?; let (patch, consumed) = parse_one(&remaining[patch_start..])?; self.offset += patch_start + consumed;

Where we have some function that can handle finding the start of a patch and stripping that before passing it off to parse_one.

Thoughts?

I picked the latter one. It seems more robust and actually simplified the logic quite a bit!

See f539bbf

bmwill · 2026-04-10T14:04:49Z

src/lib.rs

 mod diff;
 mod merge;
 mod patch;
+pub mod patches;


I wonder if "patch_set" would be a bit more descriptive over "patches"?

It was originally patchset / PatchSet. I forgot why I changed it in weihanglo@e9eb5c1. Perhaps because "set" gives an unordered sense. I am willing to change to whatever name that feels better!

Addressed in 37c73f7 and 88f1488

bmwill · 2026-04-10T14:06:36Z

src/patch/parse.rs

+    fn consumed(&self) -> usize {
+        self.offset
+    }


This function is identical to the one above it.

Addressed in bc7d2aa

bmwill · 2026-04-10T14:14:06Z

src/patches/error.rs

+pub struct PatchesParseError {
+    pub(crate) kind: PatchesParseErrorKind,
+    span: Option<Range<usize>>,
+}


Do you see value in adding a separate parsing error for parsing a PatchSet vs parsing a single Patch?

One other option would be to have just a single Error type and extend the PatchParseErrorKind with the few you've added below.

I feels like having clear-scoped Error types is easier for lib users to understand, rather than looking at a gigantic enum and realized it is for binary patching not for single file patch. diffy developers are also the lib user.

However, for real end users, it probably matter less as we have a wrapper struct already. So, some directions,

We have different private error enums for better code organization for diffy internals, and have just one top-level error struct encapsulating everything.

Just one private erro enum and one public error struct.

(I planned to add more BTW, see https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/src/patches/error.rs#L72-L102 and https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/src/binary/error.rs#L70-L103)

bmwill · 2026-04-10T14:17:58Z

src/patches/parse.rs

+                    if !self.found_any {
+                        return Some(Err(self.error(PatchesParseErrorKind::NoPatchesFound)));
+                    }


Whats the rational for returning an Error on an empty iterator vs just returning None?

No valid patch in a .patch file should be an error?

For example,
https://github.com/weihanglo/diffy/blob/4fc7c4f0e4bc6db55da6a8eb92f4cdf20113225c/examples/multi_file_patch.rs#L70-L74
If we don't do this, caller can still count the patches received to determine whether they want to error out. However, they need to handle it explicitly and is error-prone ans easy to forget.

The downside of we making it a hard error is that caller can't ignore it (unless we provide some error reason inspection).

bmwill · 2026-04-10T14:21:57Z

src/patch/mod.rs

+    pub(crate) fn original_path(&self) -> Option<Cow<'a, T>> {
+        self.original.as_ref().map(|f| f.0.clone())
+    }
+
+    pub(crate) fn modified_path(&self) -> Option<Cow<'a, T>> {
+        self.modified.as_ref().map(|f| f.0.clone())
+    }


This may result in over cloning, since these are pub(crate) we should be able to just return an Option<&Cow<'a, T>> so that downstream consumers can clone if they absolutely need to and if they don't we can avoid an extra allocation in the cases where Cow is the Owned variant.

addressed in 9f2a2b1

bmwill#55 (comment)

…oundary bmwill#55 (comment) bmwill#55 (comment)

bmwill#55 (comment)

The boundary detector scanned lines twice and reimplemented parsing logic that couldn't distinguish hunk content from patch headers (e.g. "+++ foo" as added content vs "+++ b/file.rs" as header). Delegate boundary detection to parse_one, which naturally knows where a patch ends via hunk line counting. parse_one now always returns consumed bytes (even on error) so the iterator can advance. bmwill#55 (comment)

Preparation for multi-file patches support.

We adopted this suggestion ```rust let patch_start = find_patch_start(remaining)?; let (patch, consumed) = parse_one(&remaining[patch_start..])?; self.offset += patch_start + consumed; ``` bmwill#55 (comment)

Rename patches module to patch_set and Patches struct to PatchSet. Replace boundary pre-scan with single-pass parse_one. Terminate iterator on parse error while allowing callers to continue. bmwill#55 (comment) bmwill#55 (comment) bmwill#55 (comment) bmwill#55 (comment)

bmwill#55 (comment) bmwill#55 (comment)

We need to track lockfile because we want to make sure our CI using rayon@1.10.0. And I don't want to block whoever wants to use latest rayon to run tests. (Though that doesn't matter as dependent won't respect our lockfile) For other reasons, see * <https://blog.rust-lang.org/2023/08/29/committing-lockfiles/> * <https://doc.rust-lang.org/nightly/cargo/faq.html#why-have-cargolock-in-version-control> * <https://doc.rust-lang.org/nightly/cargo/reference/rust-version.html> Most importantly: > Verifying a minimum-supported Rust version (MSRV) > that is less than the latest version of a dependency supports And actually during the course, I found that `windows-link` requires Rust 1.71. `windows-link` is a transitive dependency of `anstyle` through `windows-sys`. Our MSRV is already a lie for windows user when enabling `color`. ``` console $ cargo generate-lockfile --config 'resolver.incompatible-rust-versions="fallback"' Updating crates.io index Locking 24 packages to latest Rust 1.70.0 compatible versions Adding snapbox v0.6.24 (available: v1.1.0) Adding windows-link v0.2.1 (requires Rust 1.71) ```

We adopted this suggestion ```rust let patch_start = find_patch_start(remaining)?; let (patch, consumed) = parse_one(&remaining[patch_start..])?; self.offset += patch_start + consumed; ``` bmwill#55 (comment)

weihanglo

I created a separate branch showing the incremental fixes to resolve review comments: https://github.com/weihanglo/diffy/commits/patches-working. You can review this branch, which I think is easier to reason. Each commit message has a link to the respective review comment.

The original PR branch was rebased to keep git history clean without too much unnecessary intermediate commits. There is no diff between these two branches.

Hope this help the review

Rename patches module to patch_set and Patches struct to PatchSet. Replace boundary pre-scan with single-pass parse_one. Terminate iterator on parse error while allowing callers to continue. #55 (comment) #55 (comment) #55 (comment) #55 (comment)

#55 (comment) #55 (comment)

This was referenced Apr 10, 2026

test(compat): add GNU patch compatibility tests #56

Merged

test(replay): Git history replay test #57

Merged

weihanglo force-pushed the patches branch from 46ea861 to 5a33e37 Compare April 10, 2026 13:32

bmwill reviewed Apr 10, 2026

View reviewed changes

weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026

refactor(patch): remove duplicate consumed() method on Parser

bc7d2aa

bmwill#55 (comment)

weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026

refactor(patch): return &Cow from path accessors to avoid cloning

9f2a2b1

bmwill#55 (comment)

weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026

refactor: rename patches module to patch_set

37c73f7

bmwill#55 (comment)

weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026

refactor: rename Patches to PatchSet

88f1488

bmwill#55 (comment)

weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026

test(patch_set): document iterator infinite loop and false-positive b…

d80e6a5

…oundary bmwill#55 (comment) bmwill#55 (comment)

weihanglo added a commit to weihanglo/diffy that referenced this pull request Apr 10, 2026

fix(patch_set): terminate iterator on parse error

aab78ce

bmwill#55 (comment)

weihanglo added 2 commits April 10, 2026 19:52

refactor(patch): add parse_one returning bytes consumed

60920df

Preparation for multi-file patches support.

refactor(patch): add {original,modified}_path returning Cow

a268e93

Preparation for multi-file patches support.

weihanglo added 3 commits April 10, 2026 20:15

test(patch_set): UniDiff streaming parser

f87b57c

bmwill#55 (comment) bmwill#55 (comment)

weihanglo force-pushed the patches branch from 5a33e37 to cc00df5 Compare April 11, 2026 00:19

weihanglo commented Apr 11, 2026

View reviewed changes

weihanglo changed the title ~~feat(patches): UniDiff streaming parser for multi-file patches~~ feat(patch_set): UniDiff streaming parser for multi-file patches Apr 11, 2026

bmwill merged commit 486703b into bmwill:master Apr 11, 2026
8 checks passed

bmwill pushed a commit that referenced this pull request Apr 11, 2026

test(patch_set): UniDiff streaming parser

5742baa

#55 (comment) #55 (comment)

weihanglo deleted the patches branch April 11, 2026 14:33

Conversation

weihanglo commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weihanglo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weihanglo commented Apr 10, 2026 •

edited

Loading

weihanglo left a comment •

edited

Loading