Merged
Conversation
…ether. * reset fire class uage directly. * add save screenshot easily with multi-view. * sync view point through diff windows. * visual lidar center tf if set slc to True.
* as we found out the av2 provided eval_mask at first few frames sometimes it include ground points * and some ground points have ground truth flows because of bbx labeling etc. * but the method trend should be safe since all methods here set ground points flow as pose_flow so all of them have same error if we include ground points. the dataset changes maybe revert later since it's only for av2.
* Add teflowLoss into the codebase * update chamfer3D with CUDA stream-style batch busy compute. AI summary: - Added automatic collection of self-supervised loss function names in `src/lossfuncs/__init__.py`. - Improved documentation and structure of self-supervised loss functions in `src/lossfuncs/selfsupervise.py`. - Refactored loss calculation logic in `src/trainer.py` to support new self-supervised loss functions. - Introduced `ssl_loss_calculator` method for handling self-supervised losses. - Updated training step to differentiate between self-supervised and supervised loss calculations. - Enhanced error handling during training and validation steps to skip problematic batches.
update slurm and command for teflow
Kin-Zhang
commented
Mar 12, 2026
assets/cuda/chamfer3D/__init__.py
Outdated
Comment on lines
+96
to
+128
| def batched(self, | ||
| pc0_list: List[torch.Tensor], | ||
| pc1_list: List[torch.Tensor], | ||
| truncate_dist: float = -1) -> torch.Tensor: | ||
| """Parallel Chamfer loss via B CUDA streams. | ||
|
|
||
| Returns mean-over-samples: (1/B) * Σ_i [mean(dist0_i) + mean(dist1_i)]. | ||
| ~1.14× faster than serial loop on RTX 3090 @ 88K pts/sample; | ||
| more importantly, keeps GPU busy with one sustained work block per frame. | ||
| """ | ||
| B = len(pc0_list) | ||
| if B == 1: | ||
| return self.forward(pc0_list[0], pc1_list[0], truncate_dist) | ||
|
|
||
| streams = self._ensure_streams(B) | ||
| main = torch.cuda.current_stream() | ||
| per_loss: List[torch.Tensor] = [None] * B # type: ignore[list-item] | ||
|
|
||
| for i in range(B): | ||
| streams[i].wait_stream(main) | ||
| with torch.cuda.stream(streams[i]): | ||
| d0, d1, _, _ = ChamferDis.apply(pc0_list[i].contiguous(), | ||
| pc1_list[i].contiguous()) | ||
| if truncate_dist <= 0: | ||
| per_loss[i] = d0.mean() + d1.mean() | ||
| else: | ||
| v0, v1 = d0 <= truncate_dist, d1 <= truncate_dist | ||
| per_loss[i] = torch.nanmean(d0[v0]) + torch.nanmean(d1[v1]) | ||
|
|
||
| for i in range(B): | ||
| main.wait_stream(streams[i]) | ||
|
|
||
| return torch.stack(per_loss).mean() |
Member
Author
There was a problem hiding this comment.
Speed Performance: Stream CUDA vs For-loop
Quick demo benchmark (1 GPU, bz=8, 312 samples):
Stream CUDA: 1.14s/it → Epoch 1: 46%|███████▍ | 18/39 [00:20<00:23, 1.14s/it]
For-loop: 1.29s/it → Epoch 1: 46%|███████▍ | 18/39 [00:23<00:27, 1.29s/it]
1.132× faster (~13.2% speedup)
Based on the previous full training run (8 GPUs, bz=16, 153,932 samples), this reduces self-supervised training time from 11 hours → ~9.5 hours on 8 gpus.
* update training script.
Kin-Zhang
commented
Mar 31, 2026
and update version to 1.0.6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AI Summary:
TeFlow Merged:
src/lossfuncs/selfsupervise.py, enabling TeFlow-style training signals (i.e., losses that rely on cross-frame consistency rather than only supervised labels).src/lossfuncs/__init__.pyso they can be selected/configured cleanly from training.train.py,src/trainer.py) to run TeFlow experiments end-to-end (config → dataset → forward → TeFlow losses → logging/metrics).CUDA / plugin / infrastructure updates:
assets/cuda/chamfer3D/__init__.pyto support batched Chamfer distance computation using CUDA streams, significantly improving multi-sample throughput and GPU utilization. Added new utility methods, improved docstrings, and streamlined the interface for both single and batched inputs. Enhanced the test section for batched processing and index validation.assets/cuda/chamfer3D/setup.pyto bump the version from 1.0.5 to 1.0.6, reflecting the new features and optimizations.Documentation:
README.mdassets/README.mdTeFlow got accepted by CVPR 2026. 🎉🎉🎉 Now I'm working on releasing the code.
Please check the progress in the forked teflow branch.
Once it's ready, I will merge it into the codebase with the updated README.