Cascading Restrictions #1232

CBroz1 · 2025-04-17T21:54:53Z

CBroz1
Apr 17, 2025

I have a couple shelved ideas that I never put into DataJoint issues, but instead implemented some fix for in my own work

Problem

For long pipelines, my team has hit some issues with key length and the need to version pipelines. We solved each of these by 'burying' foreign keys into a single primary key alias, usually a UUID.

class MyParent(dj.Manual):
    definition = """
    my_pk1 : varchar(32)
    my_pk2 : varchar(32)
    -> ManyOtherPks
    ---
    my_sk1 : varchar(32)
    """

class MyChild(dj.Manual):
    definition = """
    this_id : UUID
    ---
    -> MyParent
    -> SomeOtherPks
    """

It then became difficult for our users to figure out the right joins so they could restrict these aliasing tables based on some upstream field. The following was not intuitive for them, and even less so if the field they wanted to restrect was 2 or 3 joins upstream.

((MyParent & {'my_sk1': 'some_val'}) * MyChild).fetch('this_id')

Ideas

I think we could address this difficulty by extracting the cascade functionality from delete here and finding a way to apply the same idea to graph functions.

MyChild.ancestors(as_objects=True, restriction={'my_sk1': 'some_val'})

I took the idea a step further and introduced 'long distance' restrictions to our codebase (doc, implementation) allowing users to do the following...

restricted_child = MyChild() << 'my_sk1 = "some_val"'
restricted_parent = MyParent() >> f'this_id = {some_UUID}'

See also our demo in pytests: tables, and tests.

It's slower than a join, but it's seen wide adoption among our users, and makes it much easier to explore how a single subject, for example, is populated across many different downstream tables. It's not something I would suggest for use in a production fetch, but it's been more intuitive for sorting through the table graph.

To this end, it would be helpful if assert_join_compatibility here could be split up into a is_join_compatible func that returned a boolean, and another that raised the error

CBroz1 · 2026-04-02T17:21:26Z

CBroz1
Apr 2, 2026
Author

My proposal and the code I've linked here seems very similar to what was implemented in #1407, but limited to a downstream cascade. Spyglass users continue to rely on upstream cascades

0 replies

dimitri-yatsenko · 2026-04-03T17:57:42Z

dimitri-yatsenko
Apr 3, 2026
Maintainer

@CBroz1 — Good observation connecting your proposal to #1407. You're right that Diagram.cascade() covers the downstream direction but the upstream half is missing from core.

This connects to several threads:

What #1407 solved (downstream)

PR #1407 implemented graph-driven restriction propagation on Diagram — Diagram.cascade() (OR-convergence for delete) and Diagram.restrict() (AND-convergence for filtering). Both propagate downstream through the FK graph. Diagram.cascade() is the closer analog to Spyglass's >> operator — propagating restrictions downstream through every descendant via OR convergence.

What's still missing (upstream)

Three use cases all need upstream graph traversal:

Your << operator — restrict a table based on an attribute that lives in an ancestor table, without manually constructing the join chain
Issue Implement farfetch #242 (farfetch) — auto-resolve and fetch attributes from upstream tables inside make() by walking the dependency graph
Provenance tracing — given an entity, walk the FK graph upward to recover its full input lineage (see the datajoint-provenance design docs Gap ERD: entity relationship diagram #7)

The provenance angle

There's an additional connection worth noting. The the datajoint-provenance design docs gap analysis identifies that make() methods can currently fetch() from any table — not just declared upstream dependencies (Gap #6). Enforcing the upstream-only convention is a key provenance requirement. But enforcement only works if upstream data is easy to reach. Your long-distance restriction pattern and #242's farfetch concept both address this: they make the upstream-only convention practical by removing the manual join burden.

In other words, upstream graph traversal serves double duty:

Usability: users don't need to manually construct multi-hop joins
Provenance: make() can be restricted to declared dependencies without sacrificing ergonomics

PostgreSQL note

One nuance: the original pressure for UUID aliasing in Spyglass came partly from MySQL's 3072-byte primary key index limit (Spyglass #630). PostgreSQL doesn't have this limit, so pipelines running on Postgres can keep natural composite keys without hitting engine constraints. This reduces (but doesn't eliminate) the need for surrogate keys and the join-path complexity they introduce. Pipeline versioning and other architectural reasons for UUIDs still apply.

Design: `Diagram.trace()` — the upstream mirror of `Diagram.cascade()`

Diagram.trace(table_expr) walks the FK graph upward from a restricted table expression, propagating the key restriction through every ancestor. Like cascade(), it uses OR convergence — an ancestor entity is included if reachable through any FK path. This is the provenance question: "what contributed to this result?"

# Trace upstream (new) — OR convergence, like cascade but upward
trace = dj.Diagram.trace(MyChild & key)
trace.counts()
trace[Session].to_dicts()

# Cascade downstream (existing, #1407) — OR convergence
cascade = dj.Diagram.cascade(Session & key)
cascade.counts()

Method	Direction	Convergence	Question
`cascade`	downstream	OR	What's affected if this is deleted?
`restrict`	downstream	AND	What satisfies all these conditions?
`trace`	upstream	OR	What contributed to this result?

See subsequent comments for the full self.upstream design inside make() and strict-provenance mode.

0 replies

dimitri-yatsenko · 2026-04-03T23:12:45Z

dimitri-yatsenko
Apr 3, 2026
Maintainer

To sharpen the framing above: there are two goals here, and a single mechanism can address both.

Goal 1 (critical): Provenance of computations

make() should only access direct ancestors restricted by the key. This is the upstream-only convention described in the datajoint-provenance design docs — and it's not just a convention, it's the foundation of DataJoint's provenance guarantee. If make() can fetch from anywhere in the pipeline (or worse, from unrelated tables), the FK DAG no longer describes the true dependency structure. The provenance graph becomes incomplete.

Today the framework defines this convention but doesn't enforce it. A make() method can fetch() from any table — the undeclared dependency is invisible to the provenance graph, cascade deletes won't reach it, and the integrity guarantee breaks silently.

Enforcement means giving make() a restricted query context: it can only reach tables that are declared ancestors of the current table, and only through the key it was given. This is the more critical goal.

Goal 2 (ergonomic): Convenient upstream access syntax

The manual join construction that Spyglass users struggle with — ((MyParent & condition) * MyChild).fetch(...) across multiple hops — is exactly what makes the upstream-only convention hard to follow in practice. If reaching upstream data is tedious, users take shortcuts (fetching from arbitrary tables, hardcoding joins), which breaks provenance.

One mechanism serves both

An upstream restriction operator that walks the FK graph resolves both goals simultaneously:

For provenance: the operator defines the sanctioned way to access upstream data inside make(). The framework knows which tables were accessed, can verify they're declared ancestors, and can confirm the restriction was key-based. This makes enforcement possible without sacrificing usability.
For convenience: the same operator works interactively for exploration, letting users restrict across multiple hops without manually constructing join chains.

The key insight is that provenance enforcement and user ergonomics aren't competing goals — they're the same mechanism viewed from different angles. Making upstream access easy is what makes upstream-only enforcement practical.

0 replies

dimitri-yatsenko · 2026-04-03T23:17:54Z

dimitri-yatsenko
Apr 3, 2026
Maintainer

Refining the design further — this should be a single mechanism built on Diagram, working upward instead of downward.

The Trace Diagram

A Diagram.trace(table_expr) constructs a restricted upstream diagram: starting from a table expression (e.g., MyChild & key), it walks the FK graph upward, propagating the key restriction through every ancestor. The result is a Diagram object where every ancestor table is restricted to exactly the entities that contribute to the starting expression.

This is the upstream mirror of Diagram.cascade() from #1407 — same OR convergence (an ancestor is included if reachable through any path), opposite direction.

Access pre-restricted ancestor tables

The trace provides access to pre-restricted ancestor table expressions via []. Each ancestor is a standard QueryExpression with all 2.0 fetch methods:

trace = dj.Diagram.trace(self & key)

# Access by table class
trace[Session].fetch1("session_date")
trace[ExtractTraces].to_arrays("trace")
trace[Mouse].to_pandas()
trace[Session].keys()

# Access by attribute name (when unambiguous across ancestors)
trace["mouse_id"]          # returns the pre-restricted table owning this attribute
trace["mouse_id"].fetch1()

# Ambiguous name -> error directing you to qualify with the table
trace["id"]  # Error: 'id' found in Mouse, Session. Use trace[Mouse] or trace[Session].

# Inspect
trace.counts()    # entity counts per ancestor table

The trace doesn't reimplement fetch — it gives you restricted QueryExpression objects that already have full fetch capabilities.

Pre-constructed inside `make()`

Inside make(), the framework pre-constructs the trace as self.upstream:

def make(self):
    session_date = self.upstream[Session].fetch1("session_date")
    traces = self.upstream[ExtractTraces].to_arrays("trace")
    self.insert1({**self.key, "result": compute(traces)})

Strict provenance mode

In strict mode (dj.config["strict_provenance"] = True), self.upstream is the only way to access data from within make(). Direct fetches on other table objects raise an error. self.upstream only exposes declared ancestors and their parts — undeclared dependencies become impossible.

0 replies

dimitri-yatsenko · 2026-04-03T23:19:46Z

dimitri-yatsenko
Apr 3, 2026
Maintainer

Updated — the trace should support all the 2.0 fetch variants, not a single fetch() method. In 2.0, fetch() itself is deprecated in favor of specific output-format methods.

Rather than introducing a separate trace object, this functionality belongs on the table expression itself via []. Inside make(), self already knows its key and its place in the dependency graph — self[AncestorTable] naturally reads as "my ancestor, restricted to my key":

def make(self, key):
    # Access pre-restricted ancestor by table class
    session_date = self[Session].fetch1("session_date")
    traces = self[ExtractTraces].to_arrays("trace")
    mouse_info = self[Mouse].to_pandas()
    trial_keys = self[Session.Trial].keys()

    # Access by attribute name (when unambiguous across ancestors)
    self["mouse_id"]          # returns the pre-restricted ancestor table that owns this attribute
    self["mouse_id"].fetch1() # fetch from it

    # Ambiguous name -> clear error
    self["id"]  # Error: 'id' found in Mouse, Session. Use self[Mouse] or self[Session].

__getitem__ dispatches by type: a class gives you that ancestor table pre-restricted through the FK path; a string resolves the attribute name to its owning ancestor and returns that table pre-restricted.

This works uniformly outside make() too — on any restricted table expression:

# Interactive use: same syntax, same mechanics
(MyChild & {"this_id": "abc"})[Session].to_dicts()
(MyChild & {"this_id": "abc"})["mouse_id"].fetch1()

No new objects to learn, no separate API. The table expression knows its graph and can navigate upstream. The Diagram infrastructure (FK graph traversal, attr_map handling, cross-schema discovery from #1407) powers the resolution under the hood.

In strict-provenance mode, self[...] becomes the only way to access data from within make(). The framework can verify that every access goes through a declared ancestor restricted by the key — making undeclared dependencies impossible, not just unconventional.

0 replies

dimitri-yatsenko · 2026-04-03T23:32:05Z

dimitri-yatsenko
Apr 3, 2026
Maintainer

One more refinement: if self carries the full context — the key, the restricted ancestor graph — then key doesn't need to be passed as an argument to make() at all.

def make(self):
    key = self.key  # available as a property if needed explicitly

    # All upstream data accessible through self
    session_date = self[Session].fetch1("session_date")
    traces = self[ExtractTraces].to_arrays("trace")

    self.insert1({**self.key, "result": compute(traces)})

make() takes no arguments. The framework sets up self.key and the restricted ancestor graph before calling make(). self is fully contextualized.

This also strengthens strict-provenance mode: the framework controls what self can access, and there's no key dict floating around that could be used to query arbitrary tables outside the declared dependency graph.

0 replies

dimitri-yatsenko · 2026-04-03T23:34:50Z

dimitri-yatsenko
Apr 3, 2026
Maintainer

This leaves three open design questions:

1. Backward compatibility

The old make(self, key) signature and all existing fetch patterns remain permanently supported. No deprecation, no removal timeline. The new self[Ancestor] / self.key API is an addition, not a replacement.

# Old style — always works
def make(self, key):
    session_date = (Session & key).fetch1("session_date")
    self.insert1({**key, "result": compute(session_date)})

# New style — also always works
def make(self):
    session_date = self[Session].fetch1("session_date")
    self.insert1({**self.key, "result": compute(session_date)})

The framework detects the make() signature to pass key when expected. self[Ancestor] and self.key are available regardless of mode — users can adopt them incrementally without flipping any switch.

2. Strict provenance mode

A configuration setting that enforces the upstream-only convention at runtime:

dj.config["strict_provenance"] = True

This is an operational concern, not a schema property — the same schema definitions work in both modes. A team enables it globally in config without touching any schema constructors. Useful for enabling in production while leaving it off during development/debugging.

When enabled, old-style direct fetches on table objects inside make() raise an error — forcing all data access through self[...]. When not enabled (the default), everything works as before. Zero breaking changes for existing pipelines.

3. Pre-restricted ancestors outside `make()`

Inside make(), the framework knows the table and key, so self[Ancestor] is automatic. Outside make(), the user provides the starting expression:

# Any restricted table expression supports []
expr = MyChild & {"this_id": "abc"}
expr[Session].to_dicts()
expr["mouse_id"].fetch1()

# Unrestricted table — [] still works but returns unrestricted ancestors
# (useful for exploration, but results may be large)
MyChild()[Session]  # all Session entities that have descendants in MyChild

# Diagram-level inspection
dj.Diagram(MyChild & key).upstream().counts()

The [] operator on a QueryExpression works the same way everywhere — it walks the FK graph upstream and restricts each ancestor through the join path. Inside make(), self is just a QueryExpression already restricted by key — same mechanism, same code path.

0 replies

dimitri-yatsenko · 2026-04-03T23:46:19Z

dimitri-yatsenko
Apr 3, 2026
Maintainer

On query operators — since self[Table] returns a standard QueryExpression already restricted by key, all existing operators just work on it. No new syntax needed for joins, aggregations, projections, etc.:

# Old style
(Session & key).aggr(Scan & key, n='count(scan_id)')

# New style — self[Table] returns a pre-restricted QueryExpression
self[Session].aggr(self[Scan], n='count(scan_id)')

Both sides are independently pre-restricted by the key. Everything after self[...] is standard DataJoint query algebra:

# Aggregation
self[Session].aggr(self[Scan], n='count(scan_id)')

# Join
self[Session] * self[Scan]

# Projection
self[Session].proj('session_date', duration='timestampdiff(session_start, session_end)')

# Chained
self[Session].aggr(self[Scan] & 'scan_type="two_photon"', n='count(scan_id)')

self[...] only does one thing: provide the pre-restricted starting point. Everything composed from those expressions uses the existing query language unchanged.

Importantly, self[...] precludes access to tables that are not direct ancestors or their part tables. Requesting a table outside the declared upstream dependency graph raises an error:

self[UnrelatedTable]  # Error: UnrelatedTable is not an ancestor of MyComputed
self[Sibling]         # Error: Sibling is not an ancestor of MyComputed

This is the provenance guarantee: the upstream subgraph is closed. In strict-provenance mode, self[...] is the only way to obtain a table expression inside make(), so undeclared dependencies are not just flagged — they're impossible.

0 replies

dimitri-yatsenko · 2026-04-03T23:54:32Z

dimitri-yatsenko
Apr 3, 2026
Maintainer

Design: `self.upstream` as a Trace Diagram

The make(self, key) signature stays unchanged. self.upstream is a new property — a pre-constructed Diagram.trace(self & key) that the framework sets up before calling make():

def make(self, key):
    # Read from upstream — pre-restricted by key
    session_date = self.upstream[Session].fetch1("session_date")
    traces = self.upstream[ExtractTraces].to_arrays("trace")
    
    # All query operators work — upstream[Table] returns a QueryExpression
    self.upstream[Session].aggr(self.upstream[Scan], n='count(scan_id)')
    self.upstream[Session] * self.upstream[Scan]
    
    # Write to self
    self.insert1({**key, "result": compute(traces)})

Read from self.upstream, write to self. The read/write boundary is the provenance boundary.

self.upstream is purpose-built for ancestor navigation — [TableClass] is natural on an object whose job is graph traversal
self.upstream only exposes declared ancestors and their part tables — requesting anything else raises an error
key is passed as before — no signature changes, no redundancy
make_kwargs work unchanged

Backward compatibility

Zero changes to the make() contract. self.upstream is simply a new property available during make() execution. Existing code that doesn't use it is unaffected.

Strict provenance mode

dj.config["strict_provenance"] = True

When enabled, self.upstream is the only way to read data inside make(). Direct fetches on other table objects raise an error. self is the only way to write. The enforcement rule is crisp — no need to intercept arbitrary table operations.

Outside `make()`

Same mechanism via dj.Diagram.trace() — a classmethod parallel to the existing dj.Diagram.cascade() from #1407:

# Trace (new) — walks FK graph upward, key-restricted
trace = dj.Diagram.trace(MyChild & {"this_id": "abc"})
trace[Session].to_dicts()
trace.counts()

# Cascade (existing, #1407) — walks FK graph downward
cascade = dj.Diagram.cascade(Session & 'subject_id=1')
cascade.counts()

Diagram is already the object that knows how to traverse the FK graph. cascade() goes down, trace() goes up. Both are verbs — cascading flows down like water, tracing follows the path back to the source. Same class, same infrastructure, opposite direction.

Inside make(), self.upstream is simply a pre-constructed dj.Diagram.trace(self & key).

Convergence rules

trace uses OR convergence — an ancestor entity is in the trace if it's reachable through any FK path from the starting expression.

Consider a diamond where D depends on B and C, which both depend on A. Starting from D & key: trace follows FKs up to B and C (independent lookups), then both paths reach A. The A rows in the trace are the union of rows reached through B and through C.

This is the natural dual of cascade: cascade asks "what downstream data would be affected?" (OR — any path), trace asks "what upstream data contributed?" (OR — any path). AND would be wrong — it would exclude ancestor rows that contributed through only one path, even when there's a clear provenance chain.

Method	Direction	Convergence	Question
`cascade`	downstream	OR	What's affected if this is deleted?
`restrict`	downstream	AND	What satisfies all these conditions?
`trace`	upstream	OR	What contributed to this result?

0 replies

CBroz1 · 2026-04-04T03:15:02Z

CBroz1
Apr 4, 2026
Author

I would be excited to have this feature in DataJoint. There are a few other features of Spyglass that might be worth mentioning related to traceing or these "Restriction Graph" objects we have...

Merge Tables (code, doc). To allow concurrent operation of multiple different versions of a pipeline, we have a master table with parts that fk-ref the respective endpoints of each pipeline version. Downstream pipelines fk-ref that master. Sequential versioned pipelines can result in multiple paths data may take. Some data takes multiple paths for the sake of comparing techniques
- An upstream cascade can sometimes return a false negative if we only cascade the first valid path. Users then 'ban nodes' from the search to ensure the right path is taken. This might be avoided with the right breadth-first and OR search logic
- The Parent -> Part -> Master -> Child structure of merge tables (example) requires a flip in the search direction (down to part, up to master, down to child). I would hope the __getitem__ notation could handle this case without raising the self[UnrealatedTable] error
'Peripheral Tables' (issue): For better or worse, all analysis files and time intervals in Spyglass are logged by central tables that each get fk-ref'd many many times. These require banning to prevent false positive restriction cascades when reverse-direction links are allowed. It doesn't make sense to restrict a child/ancestor pair via their mutually exclusive fk-refs in a peripheral table.
Cautious delete (mixin, doc): Spyglass checks data ownership via a 'table chain' (class). This is a sub-case of restricted graphs where we only care about the path between two endpoints. For us, that parent table is SessionExperimenter, which we compare against the user datajoint username. A domain-general solution wouldn't make sense, but it's worth considering the case where the user is only interested in restricting the shortest path and would want to save cascade time when we know the upstream/downstream endpoint.
Exports (doc, tutorial, logging mixin, export execution) allow a user to set a toggle that monitors all fetch, restrict, and join calls. Anything accessed is logged as a 'leaf node' in an upward trace. Our Export table uses each leaf node in an upward cascade to write the right mysqldump commands for pipeline tables and hidden tables (external, log). External files are then uploaded to DANDI here. This facilitates the publication of a single paper from a lab's shared multi-project pipeline

All that to say that ...

The cascade process is not strictly up- or downstream.
There may be multiple ancestry/descendant paths for a given row.
We can save on an expensive search process if we delay cascade until the endpoint is called (via (dj.Diagram.trace(MyChild & {"this_id": "abc"}))[MyParent].fetch() logic
Other functionality (e.g., exports) may be found from this feature

0 replies

Cascading Restrictions #1232

Uh oh!

CBroz1 Apr 17, 2025

Problem

Ideas

Replies: 10 comments

Uh oh!

Uh oh!

CBroz1 Apr 2, 2026 Author

Uh oh!

Uh oh!

dimitri-yatsenko Apr 3, 2026 Maintainer

What #1407 solved (downstream)

What's still missing (upstream)

The provenance angle

PostgreSQL note

Design: Diagram.trace() — the upstream mirror of Diagram.cascade()

Uh oh!

Uh oh!

dimitri-yatsenko Apr 3, 2026 Maintainer

Goal 1 (critical): Provenance of computations

Goal 2 (ergonomic): Convenient upstream access syntax

One mechanism serves both

Uh oh!

Uh oh!

dimitri-yatsenko Apr 3, 2026 Maintainer

The Trace Diagram

Access pre-restricted ancestor tables

Pre-constructed inside make()

Strict provenance mode

Uh oh!

Uh oh!

dimitri-yatsenko Apr 3, 2026 Maintainer

Uh oh!

dimitri-yatsenko Apr 3, 2026 Maintainer

Uh oh!

Uh oh!

dimitri-yatsenko Apr 3, 2026 Maintainer

1. Backward compatibility

2. Strict provenance mode

3. Pre-restricted ancestors outside make()

Uh oh!

Uh oh!

dimitri-yatsenko Apr 3, 2026 Maintainer

Uh oh!

Uh oh!

dimitri-yatsenko Apr 3, 2026 Maintainer

Design: self.upstream as a Trace Diagram

Backward compatibility

Strict provenance mode

Outside make()

Convergence rules

Uh oh!

CBroz1 Apr 4, 2026 Author

CBroz1
Apr 17, 2025

CBroz1
Apr 2, 2026
Author

dimitri-yatsenko
Apr 3, 2026
Maintainer

Design: `Diagram.trace()` — the upstream mirror of `Diagram.cascade()`

dimitri-yatsenko
Apr 3, 2026
Maintainer

dimitri-yatsenko
Apr 3, 2026
Maintainer

Pre-constructed inside `make()`

dimitri-yatsenko
Apr 3, 2026
Maintainer

dimitri-yatsenko
Apr 3, 2026
Maintainer

dimitri-yatsenko
Apr 3, 2026
Maintainer

3. Pre-restricted ancestors outside `make()`

dimitri-yatsenko
Apr 3, 2026
Maintainer

dimitri-yatsenko
Apr 3, 2026
Maintainer

Design: `self.upstream` as a Trace Diagram

Outside `make()`

CBroz1
Apr 4, 2026
Author