VCK5000 ES1: kernel hangs at vx_ready_wait despite XRT AFU done handshake fix (related to #262, #263, #278)

# VCK5000 ES1: kernel hangs at `vx_ready_wait` despite XRT AFU done handshake fix (related to #262, #263, #278)

## Summary

On Xilinx VCK5000 ES1 (Versal), Vortex kernels uploaded successfully via the
XRT runtime hang indefinitely at `vx_ready_wait` after `vx_start`. The same
RTL passes in **simx, rtlsim, AND xrtsim with `DEBUG=3`/`DBG_TRACE_AFU` —
all three software simulators show the AFU FSM transitioning correctly
(`STATE_IDLE → STATE_INIT → STATE_RUN → STATE_DONE → STATE_IDLE`)**, yet
on real hardware execution never completes.

This appears to be the same class of issue as **#262** (U250 hang at
`wait_for_completion`), **#263** (the explicit observation that "the
applications within the tests directory are primarily designed for
simulation environments and require significant modifications to run
correctly on actual FPGA hardware"), and **#278** (DECERR-related hang
where some users report ray24777's `+0x4000000000` AXI offset workaround
fixes it but bighead-liat reports the hang persists after applying it).

I'm filing this as a separate issue because my hardware (VCK5000 ES1
Versal) is different from #262/#263/#278 (U250 UltraScale+), I have a
specific working `STARTUP_ADDR` patch for the single-bank case that I'd
like to upstream, and the depth of diagnostic data may help maintainers
narrow down the framework issue across all these reports.

## Setup

| Item | Value |
|------|-------|
| Hardware | Xilinx VCK5000 ES1 (Versal `xilinx_vck5000-es1_gen3x16_2_202020_1`) |
| OS | Ubuntu 20.04.6 LTS |
| XRT | 2.11.648 (Branch 2021.1) |
| Vivado / Vitis | 2020.2 |
| Vortex master HEAD | `31e4765` (Oct 2025 squash, contains 87e613d2 deadlock fix and fce24b95 done handshake fix) |
| Branch | `fix/vck5000-support` (master + minimal VCK5000 patches) |
| Configurations tested | XLEN=64, NUM_CORES={2,4}, NUM_WARPS={4,8}, L2 {on,off}, drain logic {add,revert} |

## Symptom

Identical hang behavior across **4 independent bitstreams** built at
different points in our iteration (`test1`, `test2`, `test4`, `test7`):

```
$ make -C tests/regression/basic run-xrt
open device connection
info: device name=xilinx_vck5000-es1_gen3x16_base_2, memory_capacity=0x100000000 bytes, memory_banks=1.
number of points: 1024
buffer size: 4096 bytes
allocate device memory
allocating bank0...
reusing bank0...
dev_src=0x10000
dev_dst=0x11000
run memcopy test
write source buffer to local memory
read destination buffer from local memory
verify result
upload time: 0 ms
download time: 0 ms
Total elapsed time: 0 ms                            ← memcopy passes (host↔BO)
run kernel test
Upload kernel binary
reusing bank0...
upload kernel argument
reusing bank0...
start execution                                     ← hangs here forever
```

`xbutil examine` reports the CU as `IDLE` with `Usage=0` indefinitely.

The hang occurs **regardless of**:
- Kernel ELF entry address (tested `0x80000000` and `0x00100000`, same hang)
- Whether drain logic is present (`test4` has it, `test7` doesn't, both hang)
- Cores/warps (2c/4w through 4c/8w with L2)
- Build vintage (4 different bitstream builds spanning weeks)

## What I verified (RTL stack is functionally correct)

Same kernel binary, same host binary, three simulator backends, all PASS:

| Backend | Result | Notes |
|---------|--------|-------|
| **simx** | ✅ PASS | Functional C++ simulator |
| **rtlsim** | ✅ PASS | Verilator, simple wrapper |
| **xrtsim with `DEBUG=3 DBG_TRACE_AFU`** | ✅ **PASS** | Verilator with **full XRT AFU wrap** (`VX_afu_wrap.sv` + `VX_afu_ctrl.sv`) |

The xrtsim trace shows the AFU FSM transitioning correctly:

```
[VXDRV] DCR_WRITE: addr=0x1, value=0x80000000   ← STARTUP_ADDR0
[VXDRV] DCR_WRITE: addr=0x2, value=0x0          ← STARTUP_ADDR1
[VXDRV] DCR_WRITE: addr=0x3, value=0x0          ← STARTUP_ARG0
[VXDRV] DCR_WRITE: addr=0x4, value=0x0          ← STARTUP_ARG1
               28657: AFU: Begin initialization              ← STATE_IDLE → STATE_INIT
               28673: AFU: Initialization completed          ← vx_reset deasserted
               28683: AFU: Begin execution                   ← vx_busy → STATE_RUN
               31445: AFU: Execution completed               ← STATE_RUN → STATE_DONE
               36561: AFU: Processor idle                    ← STATE_DONE → STATE_IDLE (after ap_done_ack)
PERF: instrs=81, cycles=801, IPC=0.101124
Test PASSED
```

This implies the RTL source (including the squash-included `87e613d2`
"fixed XRT AFU deadlock on exit" and `fce24b95` "fixed XRT AFU done
handshake" fixes) is **functionally correct** for the XRT control plane.
The bug is in the synthesis-to-silicon layer.

## What I ruled out

| Hypothesis | Verdict | Evidence |
|------------|---------|----------|
| RTL source bug (logic error) | ❌ REJECTED | 3 simulators PASS the same RTL |
| Our `STARTUP_ADDR` patch wrong | ❌ REJECTED | Verified in 3 sims; ELF inspection confirms |
| Kernel ELF address out of bank | ❌ REJECTED | Tested 0x100000 (1 MB) and 0x80000000 (2 GB), both hang same way |
| Drain logic add/remove | ❌ REJECTED | Both `test4` (drain in) and `test7` (drain out) hang identically |
| Bitstream construction race | ❌ REJECTED | 4 independent builds, identical symptom |
| `PLATFORM_MEMORY_OFFSET` mismatch | ❌ REJECTED | `xclbinutil --info` confirms DDR4 base = `0xC000000000` matches `platforms.mk` |
| `m_axi_mem_0` connectivity | ❌ REJECTED | xclbin shows `m_axi_mem_0 → MC_NOC0 (MEM_DDR4)`, range `0xFFFFFFFF` |
| Host MMIO control plane broken | ❌ UNLIKELY | `dev_caps`, `isa_caps`, `mem_capacity` all return correct values; `CTL_AP_RESET` writes work (init succeeds) |
| Setup timing | ✅ MET | WNS=+0.204 ns (test4), +0.023 ns (test7), 0 failing endpoints |
| Hold timing | ✅ MET (thin) | WHS=+0.030 ns, 0 failing endpoints — but margin is 30 ps |

## What I tried to fix on our end

1. **`STARTUP_ADDR` for single-bank platforms** (1-line fix in
   `tests/regression/common.mk`):

   ```diff
   - STARTUP_ADDR ?= 0x180000000
   + STARTUP_ADDR ?= 0x80000000
   ```

   The default `0x180000000` (6 GiB) was tuned for multi-bank Alveo cards
   (U50/U280: 8 GiB virtual window from 32 banks; U250/U200: 64 GiB from
   4 banks of 16 GiB each), but VCK5000's single-bank `ADDR_WIDTH=32`
   topology gives a 4 GiB virtual window — `0x180000000` doesn't fit, so
   `get_bank_info()` returned `index=1` for `num_banks=1` and the
   c4bcdc5/`>=` validator rejected it.

   `0x80000000` (2 GiB) fits in every supported board's virtual window
   (VCK5000/zynquplus: 4 GiB; U50/U280: 8 GiB; U250/U200: 64 GiB), so
   this is a universally safe lowering. Linker scripts already accept
   `--defsym=STARTUP_ADDR=...` from the makefile via
   `kernel/scripts/link{32,64}.ld:12`.

   **This patch is correct** (verified in 3 sims) but **does not fix the
   FPGA hang**.

2. **`ap_reset` priority over `ap_start` in `VX_afu_ctrl.sv`** —
   prioritized ap_reset bit (4) over ap_start bit (0) on AP_CTRL writes
   to prevent a race when host writes both bits in one cycle. This patch
   is in our local `c4bcdc5` and is needed but doesn't fix the hang
   either.

3. **Drain logic in `VX_afu_wrap.sv`** — added a workaround that drives
   `m_axi_mem_*ready_a = 1` and masks `m_axi_mem_*valid_vx = 0` during
   `vx_reset` to consume stale NoC responses. We later reverted this
   because it didn't help — the hang occurs both with and without it.

4. **Lowering `STARTUP_ADDR` further** to `0x00100000` (1 MB) to test
   whether the kernel ELF address mattered. Same hang.

5. **`PLATFORM_MEMORY_ADDR_WIDTH=34`** to extend the virtual window. **Timing
   FAILED**: WNS = `-0.400 ns`, **815 failing endpoints**, dominant
   critical path is the L2 cache `g_tag_store[6]` BRAM read → tag
   compare → `g_data_store[6]` BRAM write enable. Don't recommend this
   path for single-cycle cache designs.

## Cross-references

- **#262** (yonghun, U250, Jul 2025): identical "hangs at
  `wait_for_completion`" symptom after working around bank 0 issue.
- **#263** (yonghun, follow-up): the key observation that "in contrast,
  running the same applications with `TARGET=sim` (using xrtsim or
  verilator) does not exhibit these memory or buffer-related errors." We
  see exactly this.
- **#278** (steve+ray24777+bighead-liat, U250): the workaround
  `+ 64'h4000000000` to AXI master address calculation in
  `VX_afu_wrap.sv:289-292` resolves DECERR for some users but
  bighead-liat reports the demo "still doesn't finish execution on
  hardware (it hangs waiting for the kernel to complete)" after applying
  it — same as us. Worth noting that `xclbinutil --info` on our VCK5000
  bitstream shows `Base Address: 0xc000000000` matching our
  `PLATFORM_MEMORY_OFFSET=40'hC000000000`, so we don't seem to need the
  same offset adjustment, but the post-fix hang is identical.

## Diagnostic data

### xclbin memory configuration (test4, all bitstreams identical)

```
Memory Configuration
   Type:         MEM_DDR4
   Base Address: 0xc000000000
   Address Size: 0x200000000        # 8 GiB physical DDR
   Bank Used:    Yes

Ports
   Port:          m_axi_mem_0
   Range (bytes): 0xFFFFFFFF        # 4 GiB AXI master window
   Port Type:     addressable

Argument: MEM_0
   Port:    m_axi_mem_0
   Memory:  MC_NOC0 (MEM_DDR4)
```

### Routed timing summary

```
test4 (XLEN=64, ADDR_WIDTH=32, no L2):
  WNS=+0.204 ns  TNS=0  WHS=+0.030 ns  THS=0  All constraints met.

test7 (XLEN=64, ADDR_WIDTH=32, L2 enabled):
  WNS=+0.023 ns  TNS=0  WHS=+0.030 ns  THS=0  All constraints met (very thin).

test5 (XLEN=64, ADDR_WIDTH=34, L2 enabled):  ← abandoned
  WNS=-0.400 ns  TNS=-174 ns  815 failing endpoints  DO NOT USE.
```

### dmesg around xclbin load (no errors)

```
xocl 0000:0a:00.1: icap_lock_bitstream: bitstream 7c720838-... locked, ref=1
xocl 0000:0a:00.1: kds_add_context: Client pid(...) add context CU(0xffffffff) shared(true)
xocl 0000:0a:00.1: kds_del_context: Client pid(...) del context CU(0xffffffff)
xocl 0000:0a:00.1: icap_unlock_bitstream: bitstream 7c720838-... unlocked, ref=0
```

No DMA errors, no AFU/PCIe errors during the hang.

### `strace -p <basic pid>` during hang

The host process is in a tight polling loop reading `MMIO_CTL_ADDR`
(presumably looking for `CTL_AP_DONE` bit) via `xrtKernelReadRegister`.
No errors, no progress.

## Working tree state

```
modified:   hw/rtl/afu/xrt/VX_afu_wrap.sv      # drain logic reverted to upstream master state
modified:   hw/syn/xilinx/xrt/vitis.ini        # commented sp= lines added (no effect)
modified:   tests/regression/common.mk         # STARTUP_ADDR patch (proposed)
```

`hw/rtl/afu/xrt/VX_afu_wrap.sv` is **identical to upstream master**
(`origin/master`'s `31e4765`). `hw/rtl/afu/xrt/VX_afu_ctrl.sv` has only
the `ap_reset` priority fix on top of upstream. So we're effectively
testing the **latest upstream Vortex master** on VCK5000 ES1, with minimal
patches.

## What I'd like to know

1. **Has anyone successfully run a Vortex regression kernel end-to-end on
   VCK5000 (any variant) on actual hardware**, or is the platform mostly
   tested via xrtsim only?

2. **Has anyone successfully run a Vortex regression kernel end-to-end on
   any Alveo board with the current master?** The ratio of "works in
   sim" to "hangs on FPGA" reports in the issue tracker is concerning.

3. **Is there a known set of patches** (not yet upstreamed to master)
   that fixes the FPGA-side hang? I noticed the commit history was
   squashed in `31e4765` (Oct 2025); side branches like `develop`,
   `bug_fixes`, `volt`, etc. have additional commits but I couldn't
   identify which ones target this hang.

4. **Would `report_timing -hold -delay_type max` on the corner conditions
   reveal anything STA missed?** Our WHS is +0.030 ns which feels
   suspiciously thin, and `xrtsim` doesn't model timing.

5. **Does the AFU control plane (`s_axi_ctrl`) need any specific
   `axi_clock_converter` configuration on Versal NoC?** We rely on v++
   to insert it automatically; maybe ES1 needs something explicit.

6. **The `STARTUP_ADDR` lowering patch above** — is there any reason
   *not* to lower it to `0x80000000` for all XLEN=64 builds? It works on
   every supported board's virtual window and removes the silent
   single-bank failure mode. Happy to submit a PR if welcome.

I'm willing to share the full xrtsim trace, all four bitstreams' timing
and utilization reports, and run any additional debug builds if it would
help maintainers narrow down the issue. Thanks for reading.

---

## Files I can attach if helpful

- Full xrtsim DEBUG=3 trace (~50 KB)
- All 4 bitstreams' `impl_1_hw_bb_locked_timing_summary_routed.rpt`
- All 4 bitstreams' `impl_1_hw_bb_locked_utilization_placed.rpt`
- xclbin info dumps (`xclbinutil --info`)
- Working tree diff vs `origin/master`
- Host strace during the hang


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCK5000 ES1: kernel hangs at vx_ready_wait despite XRT AFU done handshake fix (related to #262, #263, #278) #333

VCK5000 ES1: kernel hangs at `vx_ready_wait` despite XRT AFU done handshake fix (related to #262, #263, #278)

Summary

Setup

Symptom

What I verified (RTL stack is functionally correct)

What I ruled out

What I tried to fix on our end

Cross-references

Diagnostic data

xclbin memory configuration (test4, all bitstreams identical)

Routed timing summary

dmesg around xclbin load (no errors)

`strace -p <basic pid>` during hang

Working tree state

What I'd like to know

Files I can attach if helpful

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Item	Value
Hardware	Xilinx VCK5000 ES1 (Versal `xilinx_vck5000-es1_gen3x16_2_202020_1`)
OS	Ubuntu 20.04.6 LTS
XRT	2.11.648 (Branch 2021.1)
Vivado / Vitis	2020.2
Vortex master HEAD	`31e4765` (Oct 2025 squash, contains `87e613d` deadlock fix and `fce24b9` done handshake fix)
Branch	`fix/vck5000-support` (master + minimal VCK5000 patches)
Configurations tested	XLEN=64, NUM_CORES={2,4}, NUM_WARPS={4,8}, L2 {on,off}, drain logic {add,revert}

Backend	Result	Notes
simx	✅ PASS	Functional C++ simulator
rtlsim	✅ PASS	Verilator, simple wrapper
xrtsim with `DEBUG=3 DBG_TRACE_AFU`	✅ PASS	Verilator with full XRT AFU wrap (`VX_afu_wrap.sv` + `VX_afu_ctrl.sv`)

Hypothesis	Verdict	Evidence
RTL source bug (logic error)	❌ REJECTED	3 simulators PASS the same RTL
Our `STARTUP_ADDR` patch wrong	❌ REJECTED	Verified in 3 sims; ELF inspection confirms
Kernel ELF address out of bank	❌ REJECTED	Tested 0x100000 (1 MB) and 0x80000000 (2 GB), both hang same way
Drain logic add/remove	❌ REJECTED	Both `test4` (drain in) and `test7` (drain out) hang identically
Bitstream construction race	❌ REJECTED	4 independent builds, identical symptom
`PLATFORM_MEMORY_OFFSET` mismatch	❌ REJECTED	`xclbinutil --info` confirms DDR4 base = `0xC000000000` matches `platforms.mk`
`m_axi_mem_0` connectivity	❌ REJECTED	xclbin shows `m_axi_mem_0 → MC_NOC0 (MEM_DDR4)`, range `0xFFFFFFFF`
Host MMIO control plane broken	❌ UNLIKELY	`dev_caps`, `isa_caps`, `mem_capacity` all return correct values; `CTL_AP_RESET` writes work (init succeeds)
Setup timing	✅ MET	WNS=+0.204 ns (test4), +0.023 ns (test7), 0 failing endpoints
Hold timing	✅ MET (thin)	WHS=+0.030 ns, 0 failing endpoints — but margin is 30 ps

VCK5000 ES1: kernel hangs at vx_ready_wait despite XRT AFU done handshake fix (related to #262, #263, #278) #333

Description

VCK5000 ES1: kernel hangs at vx_ready_wait despite XRT AFU done handshake fix (related to #262, #263, #278)

Summary

Setup

Symptom

What I verified (RTL stack is functionally correct)

What I ruled out

What I tried to fix on our end

Cross-references

Diagnostic data

xclbin memory configuration (test4, all bitstreams identical)

Routed timing summary

dmesg around xclbin load (no errors)

strace -p <basic pid> during hang

Working tree state

What I'd like to know

Files I can attach if helpful

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

VCK5000 ES1: kernel hangs at `vx_ready_wait` despite XRT AFU done handshake fix (related to #262, #263, #278)

`strace -p <basic pid>` during hang