Skip to content

Merge TeFlow into codebase#36

Merged
Kin-Zhang merged 17 commits intoKTH-RPL:mainfrom
Kin-Zhang:feature/teflow
Mar 31, 2026
Merged

Merge TeFlow into codebase#36
Kin-Zhang merged 17 commits intoKTH-RPL:mainfrom
Kin-Zhang:feature/teflow

Conversation

@Kin-Zhang
Copy link
Copy Markdown
Member

@Kin-Zhang Kin-Zhang commented Mar 11, 2026

AI Summary:

TeFlow Merged:

  • Temporal / multi-frame self-supervised learning support: Major expansion of the self-supervision loss stack in src/lossfuncs/selfsupervise.py, enabling TeFlow-style training signals (i.e., losses that rely on cross-frame consistency rather than only supervised labels).
  • Exposed/registered new loss components via src/lossfuncs/__init__.py so they can be selected/configured cleanly from training.
  • Updated training entrypoint and orchestration (train.py, src/trainer.py) to run TeFlow experiments end-to-end (config → dataset → forward → TeFlow losses → logging/metrics).

CUDA / plugin / infrastructure updates:

  • Refactored assets/cuda/chamfer3D/__init__.py to support batched Chamfer distance computation using CUDA streams, significantly improving multi-sample throughput and GPU utilization. Added new utility methods, improved docstrings, and streamlined the interface for both single and batched inputs. Enhanced the test section for batched processing and index validation.
  • Updated assets/cuda/chamfer3D/setup.py to bump the version from 1.0.5 to 1.0.6, reflecting the new features and optimizations.

Documentation:

  • Updated top-level docs and environment notes:
    • README.md
    • assets/README.md
  • Update evaluation mask to exclude ground points from the evaluation score. (This will change the validation score in the Argoverse 2 dataset) change link.

TeFlow got accepted by CVPR 2026. 🎉🎉🎉 Now I'm working on releasing the code.

Please check the progress in the forked teflow branch.

Once it's ready, I will merge it into the codebase with the updated README.

Kin-Zhang and others added 8 commits January 10, 2026 18:40
…ether.

* reset fire class uage directly.
* add save screenshot easily with multi-view.
* sync view point through diff windows.
* visual lidar center tf if set slc to True.
* as we found out the av2 provided eval_mask at first few frames sometimes it include ground points
* and some ground points have ground truth flows because of bbx labeling etc.
* but the method trend should be safe since all methods here set ground points flow as pose_flow so all of them have same error if we include ground points.

the dataset changes maybe revert later since it's only for av2.
* Add teflowLoss into the codebase
* update chamfer3D with CUDA stream-style batch busy compute.

AI summary:
- Added automatic collection of self-supervised loss function names in `src/lossfuncs/__init__.py`.
- Improved documentation and structure of self-supervised loss functions in `src/lossfuncs/selfsupervise.py`.
- Refactored loss calculation logic in `src/trainer.py` to support new self-supervised loss functions.
- Introduced `ssl_loss_calculator` method for handling self-supervised losses.
- Updated training step to differentiate between self-supervised and supervised loss calculations.
- Enhanced error handling during training and validation steps to skip problematic batches.
@Kin-Zhang Kin-Zhang added the new method new method involve label Mar 11, 2026
@Kin-Zhang Kin-Zhang linked an issue Mar 11, 2026 that may be closed by this pull request
Comment on lines +96 to +128
def batched(self,
pc0_list: List[torch.Tensor],
pc1_list: List[torch.Tensor],
truncate_dist: float = -1) -> torch.Tensor:
"""Parallel Chamfer loss via B CUDA streams.

Returns mean-over-samples: (1/B) * Σ_i [mean(dist0_i) + mean(dist1_i)].
~1.14× faster than serial loop on RTX 3090 @ 88K pts/sample;
more importantly, keeps GPU busy with one sustained work block per frame.
"""
B = len(pc0_list)
if B == 1:
return self.forward(pc0_list[0], pc1_list[0], truncate_dist)

streams = self._ensure_streams(B)
main = torch.cuda.current_stream()
per_loss: List[torch.Tensor] = [None] * B # type: ignore[list-item]

for i in range(B):
streams[i].wait_stream(main)
with torch.cuda.stream(streams[i]):
d0, d1, _, _ = ChamferDis.apply(pc0_list[i].contiguous(),
pc1_list[i].contiguous())
if truncate_dist <= 0:
per_loss[i] = d0.mean() + d1.mean()
else:
v0, v1 = d0 <= truncate_dist, d1 <= truncate_dist
per_loss[i] = torch.nanmean(d0[v0]) + torch.nanmean(d1[v1])

for i in range(B):
main.wait_stream(streams[i])

return torch.stack(per_loss).mean()
Copy link
Copy Markdown
Member Author

@Kin-Zhang Kin-Zhang Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speed Performance: Stream CUDA vs For-loop

Quick demo benchmark (1 GPU, bz=8, 312 samples):

Stream CUDA:  1.14s/it  →  Epoch 1: 46%|███████▍ | 18/39 [00:20<00:23, 1.14s/it]
For-loop:     1.29s/it  →  Epoch 1: 46%|███████▍ | 18/39 [00:23<00:27, 1.29s/it]

1.132× faster (~13.2% speedup)

Based on the previous full training run (8 GPUs, bz=16, 153,932 samples), this reduces self-supervised training time from 11 hours → ~9.5 hours on 8 gpus.

@Kin-Zhang Kin-Zhang changed the title [WIP] Merge TeFlow into codebase Merge TeFlow into codebase Mar 31, 2026
@Kin-Zhang Kin-Zhang merged commit 1c8fc5a into KTH-RPL:main Mar 31, 2026
@Kin-Zhang Kin-Zhang self-assigned this Mar 31, 2026
@Kin-Zhang Kin-Zhang deleted the feature/teflow branch March 31, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new method new method involve

Projects

None yet

Development

Successfully merging this pull request may close these issues.

when TeFlow code released

1 participant