Replies: 10 comments
-
|
My proposal and the code I've linked here seems very similar to what was implemented in #1407, but limited to a downstream cascade. Spyglass users continue to rely on upstream cascades |
Beta Was this translation helpful? Give feedback.
-
|
@CBroz1 — Good observation connecting your proposal to #1407. You're right that This connects to several threads: What #1407 solved (downstream)PR #1407 implemented graph-driven restriction propagation on What's still missing (upstream)Three use cases all need upstream graph traversal:
The provenance angleThere's an additional connection worth noting. The the datajoint-provenance design docs gap analysis identifies that In other words, upstream graph traversal serves double duty:
PostgreSQL noteOne nuance: the original pressure for UUID aliasing in Spyglass came partly from MySQL's 3072-byte primary key index limit (Spyglass #630). PostgreSQL doesn't have this limit, so pipelines running on Postgres can keep natural composite keys without hitting engine constraints. This reduces (but doesn't eliminate) the need for surrogate keys and the join-path complexity they introduce. Pipeline versioning and other architectural reasons for UUIDs still apply. Design:
|
| Method | Direction | Convergence | Question |
|---|---|---|---|
cascade |
downstream | OR | What's affected if this is deleted? |
restrict |
downstream | AND | What satisfies all these conditions? |
trace |
upstream | OR | What contributed to this result? |
See subsequent comments for the full self.upstream design inside make() and strict-provenance mode.
Beta Was this translation helpful? Give feedback.
-
|
To sharpen the framing above: there are two goals here, and a single mechanism can address both. Goal 1 (critical): Provenance of computations
Today the framework defines this convention but doesn't enforce it. A Enforcement means giving Goal 2 (ergonomic): Convenient upstream access syntaxThe manual join construction that Spyglass users struggle with — One mechanism serves bothAn upstream restriction operator that walks the FK graph resolves both goals simultaneously:
The key insight is that provenance enforcement and user ergonomics aren't competing goals — they're the same mechanism viewed from different angles. Making upstream access easy is what makes upstream-only enforcement practical. |
Beta Was this translation helpful? Give feedback.
-
|
Refining the design further — this should be a single mechanism built on The Trace DiagramA This is the upstream mirror of Access pre-restricted ancestor tablesThe trace provides access to pre-restricted ancestor table expressions via trace = dj.Diagram.trace(self & key)
# Access by table class
trace[Session].fetch1("session_date")
trace[ExtractTraces].to_arrays("trace")
trace[Mouse].to_pandas()
trace[Session].keys()
# Access by attribute name (when unambiguous across ancestors)
trace["mouse_id"] # returns the pre-restricted table owning this attribute
trace["mouse_id"].fetch1()
# Ambiguous name -> error directing you to qualify with the table
trace["id"] # Error: 'id' found in Mouse, Session. Use trace[Mouse] or trace[Session].
# Inspect
trace.counts() # entity counts per ancestor tableThe trace doesn't reimplement fetch — it gives you restricted Pre-constructed inside
|
Beta Was this translation helpful? Give feedback.
-
|
Updated — the trace should support all the 2.0 fetch variants, not a single Rather than introducing a separate trace object, this functionality belongs on the table expression itself via def make(self, key):
# Access pre-restricted ancestor by table class
session_date = self[Session].fetch1("session_date")
traces = self[ExtractTraces].to_arrays("trace")
mouse_info = self[Mouse].to_pandas()
trial_keys = self[Session.Trial].keys()
# Access by attribute name (when unambiguous across ancestors)
self["mouse_id"] # returns the pre-restricted ancestor table that owns this attribute
self["mouse_id"].fetch1() # fetch from it
# Ambiguous name -> clear error
self["id"] # Error: 'id' found in Mouse, Session. Use self[Mouse] or self[Session].
This works uniformly outside # Interactive use: same syntax, same mechanics
(MyChild & {"this_id": "abc"})[Session].to_dicts()
(MyChild & {"this_id": "abc"})["mouse_id"].fetch1()No new objects to learn, no separate API. The table expression knows its graph and can navigate upstream. The Diagram infrastructure (FK graph traversal, attr_map handling, cross-schema discovery from #1407) powers the resolution under the hood. In strict-provenance mode, |
Beta Was this translation helpful? Give feedback.
-
|
One more refinement: if def make(self):
key = self.key # available as a property if needed explicitly
# All upstream data accessible through self
session_date = self[Session].fetch1("session_date")
traces = self[ExtractTraces].to_arrays("trace")
self.insert1({**self.key, "result": compute(traces)})
This also strengthens strict-provenance mode: the framework controls what |
Beta Was this translation helpful? Give feedback.
-
|
This leaves three open design questions: 1. Backward compatibilityThe old # Old style — always works
def make(self, key):
session_date = (Session & key).fetch1("session_date")
self.insert1({**key, "result": compute(session_date)})
# New style — also always works
def make(self):
session_date = self[Session].fetch1("session_date")
self.insert1({**self.key, "result": compute(session_date)})The framework detects the 2. Strict provenance modeA configuration setting that enforces the upstream-only convention at runtime: dj.config["strict_provenance"] = TrueThis is an operational concern, not a schema property — the same schema definitions work in both modes. A team enables it globally in config without touching any schema constructors. Useful for enabling in production while leaving it off during development/debugging. When enabled, old-style direct fetches on table objects inside 3. Pre-restricted ancestors outside
|
Beta Was this translation helpful? Give feedback.
-
|
On query operators — since # Old style
(Session & key).aggr(Scan & key, n='count(scan_id)')
# New style — self[Table] returns a pre-restricted QueryExpression
self[Session].aggr(self[Scan], n='count(scan_id)')Both sides are independently pre-restricted by the key. Everything after # Aggregation
self[Session].aggr(self[Scan], n='count(scan_id)')
# Join
self[Session] * self[Scan]
# Projection
self[Session].proj('session_date', duration='timestampdiff(session_start, session_end)')
# Chained
self[Session].aggr(self[Scan] & 'scan_type="two_photon"', n='count(scan_id)')
Importantly, self[UnrelatedTable] # Error: UnrelatedTable is not an ancestor of MyComputed
self[Sibling] # Error: Sibling is not an ancestor of MyComputedThis is the provenance guarantee: the upstream subgraph is closed. In strict-provenance mode, |
Beta Was this translation helpful? Give feedback.
-
Design:
|
| Method | Direction | Convergence | Question |
|---|---|---|---|
cascade |
downstream | OR | What's affected if this is deleted? |
restrict |
downstream | AND | What satisfies all these conditions? |
trace |
upstream | OR | What contributed to this result? |
Beta Was this translation helpful? Give feedback.
-
|
I would be excited to have this feature in DataJoint. There are a few other features of Spyglass that might be worth mentioning related to
All that to say that ...
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a couple shelved ideas that I never put into DataJoint issues, but instead implemented some fix for in my own work
Problem
For long pipelines, my team has hit some issues with key length and the need to version pipelines. We solved each of these by 'burying' foreign keys into a single primary key alias, usually a UUID.
It then became difficult for our users to figure out the right joins so they could restrict these aliasing tables based on some upstream field. The following was not intuitive for them, and even less so if the field they wanted to restrect was 2 or 3 joins upstream.
Ideas
I think we could address this difficulty by extracting the
cascadefunctionality fromdeletehere and finding a way to apply the same idea to graph functions.I took the idea a step further and introduced 'long distance' restrictions to our codebase (doc, implementation) allowing users to do the following...
See also our demo in pytests: tables, and tests.
It's slower than a join, but it's seen wide adoption among our users, and makes it much easier to explore how a single subject, for example, is populated across many different downstream tables. It's not something I would suggest for use in a production fetch, but it's been more intuitive for sorting through the table graph.
To this end, it would be helpful if
assert_join_compatibilityhere could be split up into ais_join_compatiblefunc that returned a boolean, and another that raised the errorBeta Was this translation helpful? Give feedback.
All reactions