feat(ascend): add Ascend framework layer — runtime, type mapping, bui…#46
Open
zhangyue207 wants to merge 12 commits intomasterfrom
Open
feat(ascend): add Ascend framework layer — runtime, type mapping, bui…#46zhangyue207 wants to merge 12 commits intomasterfrom
zhangyue207 wants to merge 12 commits intomasterfrom
Conversation
fb9f42f to
62fb25a
Compare
Collaborator
Author
|
nv |
Collaborator
Author
|
metax |
Collaborator
Author
|
iluvatar |
Collaborator
Author
|
cambricon |
Collaborator
Author
|
moore |
Collaborator
Author
|
ascend |
…ld integration Add Ascend platform scaffolding: - `device_.h`: `DeviceEnabled<kAscend>` specialization - `data_type_.h`: `toAclDtype()`, `isIntegerDtype()` - `common.h`: `buildAclTensor()` with optional transpose - `workspace_pool_.h`: stream-keyed workspace allocator - `runtime_.h`: `Runtime<kAscend>` (Malloc, Free, Memcpy, Memset) - 5 new operator base classes (AddRmsNorm, FlashAttention, Matmul, ReshapeAndCache, RotaryEmbedding) Integrate into CMake build system, Python binding generation (stream + optional tensor support), and examples runtime API.
…emove missing include - Wrap `aclrtMemcpy` (5-arg) and `aclrtMemset` (4-arg) in lambdas to match the generic 4-arg / 3-arg calling convention used by examples. - Assert `aclrtMalloc` return value in `WorkspacePool::ensure()`. - Remove `ascend/gemm/kernel.h` include from `runtime_api.h` (file does not exist until the kernels commit).
- Add Ascend GEMM specialization using `aclnnAddmm`/`aclnnBaddbmm`. - Add `get_npu_stream()` helper and NPU device detection in test utils. - Add `skip_unsupported_dtype` fixture for Ascend in conftest. - Update `runtime_api.h` with Ascend backend entry.
The `aclrtMalloc` call was the sole expression inside `assert()`, so it was compiled away in release builds (NDEBUG). This left the workspace buffer null, causing `aclnnAddmm` to return ACLNN_ERR_PARAM_NULLPTR (161001) for any operation that requires workspace (e.g. alpha != 1.0).
`CudaCausalSoftmax` was missing `#include "cuda/runtime_utils.h"`, causing `RuntimeUtils` to be undefined. Drop `std::forward` from `Operator::make` nested lambda — NVCC instantiates the body during SFINAE invocability checks even inside `if constexpr` false branches, causing template resolution failures. All operator constructors take parameters by value, so lvalue pass has identical semantics.
Upgrade base image from `nvcr.io/nvidia/pytorch:24.10-py3` (CUDA 12.6) to `25.12-py3` (CUDA 13.1), aligning CI with the local dev environment. Restore `std::forward<Args>(args)...` in `Operator::make`, as the NVCC bug that required dropping it is fixed in the newer toolkit.
`Tensor::Size` (`unsigned long`) to `int64_t` narrowing is an error on MetaX's clang-based compiler (`-Wc++11-narrowing`).
- Add blank lines between struct/class members per style guide - Capitalize comments and use backtick syntax for code refs in `matmul.h` - Move `import re` to module level in `generate_wrappers.py` - Add blank lines before `for`/`return` per PEP 8 in `generate_wrappers.py` - Replace `-k npu` with `--devices ascend` in CI config
- Fix `ruff format` violations in `generate_wrappers.py` and `test_gemm.py`. - Fix `ruff isort` violation: move `import re` into stdlib group. - Add backticks around identifiers in comments (`numel()`, `operator()`, `make()`, `torch_npu`, `uint16`/`uint32`/`uint64`). - Add missing blank line after `if` block in `skip_unsupported_dtype`. - Remove `.worktrees/` from project `.gitignore` (belongs in global gitignore).
80acc8b to
7628b2f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…ld integration
Add Ascend platform scaffolding:
device_.h:DeviceEnabled<kAscend>specializationdata_type_.h:toAclDtype(),isIntegerDtype()common.h:buildAclTensor()with optional transposeworkspace_pool_.h: stream-keyed workspace allocatorruntime_.h:Runtime<kAscend>(Malloc, Free, Memcpy, Memset)Integrate into CMake build system, Python binding generation (stream + optional tensor support), and examples runtime API.