VCK5000 ES1: kernel hangs at vx_ready_wait despite XRT AFU done handshake fix (related to #262, #263, #278)
Summary
On Xilinx VCK5000 ES1 (Versal), Vortex kernels uploaded successfully via the
XRT runtime hang indefinitely at vx_ready_wait after vx_start. The same
RTL passes in simx, rtlsim, AND xrtsim with DEBUG=3/DBG_TRACE_AFU —
all three software simulators show the AFU FSM transitioning correctly
(STATE_IDLE → STATE_INIT → STATE_RUN → STATE_DONE → STATE_IDLE), yet
on real hardware execution never completes.
This appears to be the same class of issue as #262 (U250 hang at
wait_for_completion), #263 (the explicit observation that "the
applications within the tests directory are primarily designed for
simulation environments and require significant modifications to run
correctly on actual FPGA hardware"), and #278 (DECERR-related hang
where some users report ray24777's +0x4000000000 AXI offset workaround
fixes it but bighead-liat reports the hang persists after applying it).
I'm filing this as a separate issue because my hardware (VCK5000 ES1
Versal) is different from #262/#263/#278 (U250 UltraScale+), I have a
specific working STARTUP_ADDR patch for the single-bank case that I'd
like to upstream, and the depth of diagnostic data may help maintainers
narrow down the framework issue across all these reports.
Setup
| Item |
Value |
| Hardware |
Xilinx VCK5000 ES1 (Versal xilinx_vck5000-es1_gen3x16_2_202020_1) |
| OS |
Ubuntu 20.04.6 LTS |
| XRT |
2.11.648 (Branch 2021.1) |
| Vivado / Vitis |
2020.2 |
| Vortex master HEAD |
31e4765 (Oct 2025 squash, contains 87e613d deadlock fix and fce24b9 done handshake fix) |
| Branch |
fix/vck5000-support (master + minimal VCK5000 patches) |
| Configurations tested |
XLEN=64, NUM_CORES={2,4}, NUM_WARPS={4,8}, L2 {on,off}, drain logic {add,revert} |
Symptom
Identical hang behavior across 4 independent bitstreams built at
different points in our iteration (test1, test2, test4, test7):
$ make -C tests/regression/basic run-xrt
open device connection
info: device name=xilinx_vck5000-es1_gen3x16_base_2, memory_capacity=0x100000000 bytes, memory_banks=1.
number of points: 1024
buffer size: 4096 bytes
allocate device memory
allocating bank0...
reusing bank0...
dev_src=0x10000
dev_dst=0x11000
run memcopy test
write source buffer to local memory
read destination buffer from local memory
verify result
upload time: 0 ms
download time: 0 ms
Total elapsed time: 0 ms ← memcopy passes (host↔BO)
run kernel test
Upload kernel binary
reusing bank0...
upload kernel argument
reusing bank0...
start execution ← hangs here forever
xbutil examine reports the CU as IDLE with Usage=0 indefinitely.
The hang occurs regardless of:
- Kernel ELF entry address (tested
0x80000000 and 0x00100000, same hang)
- Whether drain logic is present (
test4 has it, test7 doesn't, both hang)
- Cores/warps (2c/4w through 4c/8w with L2)
- Build vintage (4 different bitstream builds spanning weeks)
What I verified (RTL stack is functionally correct)
Same kernel binary, same host binary, three simulator backends, all PASS:
| Backend |
Result |
Notes |
| simx |
✅ PASS |
Functional C++ simulator |
| rtlsim |
✅ PASS |
Verilator, simple wrapper |
xrtsim with DEBUG=3 DBG_TRACE_AFU |
✅ PASS |
Verilator with full XRT AFU wrap (VX_afu_wrap.sv + VX_afu_ctrl.sv) |
The xrtsim trace shows the AFU FSM transitioning correctly:
[VXDRV] DCR_WRITE: addr=0x1, value=0x80000000 ← STARTUP_ADDR0
[VXDRV] DCR_WRITE: addr=0x2, value=0x0 ← STARTUP_ADDR1
[VXDRV] DCR_WRITE: addr=0x3, value=0x0 ← STARTUP_ARG0
[VXDRV] DCR_WRITE: addr=0x4, value=0x0 ← STARTUP_ARG1
28657: AFU: Begin initialization ← STATE_IDLE → STATE_INIT
28673: AFU: Initialization completed ← vx_reset deasserted
28683: AFU: Begin execution ← vx_busy → STATE_RUN
31445: AFU: Execution completed ← STATE_RUN → STATE_DONE
36561: AFU: Processor idle ← STATE_DONE → STATE_IDLE (after ap_done_ack)
PERF: instrs=81, cycles=801, IPC=0.101124
Test PASSED
This implies the RTL source (including the squash-included 87e613d2
"fixed XRT AFU deadlock on exit" and fce24b95 "fixed XRT AFU done
handshake" fixes) is functionally correct for the XRT control plane.
The bug is in the synthesis-to-silicon layer.
What I ruled out
| Hypothesis |
Verdict |
Evidence |
| RTL source bug (logic error) |
❌ REJECTED |
3 simulators PASS the same RTL |
Our STARTUP_ADDR patch wrong |
❌ REJECTED |
Verified in 3 sims; ELF inspection confirms |
| Kernel ELF address out of bank |
❌ REJECTED |
Tested 0x100000 (1 MB) and 0x80000000 (2 GB), both hang same way |
| Drain logic add/remove |
❌ REJECTED |
Both test4 (drain in) and test7 (drain out) hang identically |
| Bitstream construction race |
❌ REJECTED |
4 independent builds, identical symptom |
PLATFORM_MEMORY_OFFSET mismatch |
❌ REJECTED |
xclbinutil --info confirms DDR4 base = 0xC000000000 matches platforms.mk |
m_axi_mem_0 connectivity |
❌ REJECTED |
xclbin shows m_axi_mem_0 → MC_NOC0 (MEM_DDR4), range 0xFFFFFFFF |
| Host MMIO control plane broken |
❌ UNLIKELY |
dev_caps, isa_caps, mem_capacity all return correct values; CTL_AP_RESET writes work (init succeeds) |
| Setup timing |
✅ MET |
WNS=+0.204 ns (test4), +0.023 ns (test7), 0 failing endpoints |
| Hold timing |
✅ MET (thin) |
WHS=+0.030 ns, 0 failing endpoints — but margin is 30 ps |
What I tried to fix on our end
-
STARTUP_ADDR for single-bank platforms (1-line fix in
tests/regression/common.mk):
- STARTUP_ADDR ?= 0x180000000
+ STARTUP_ADDR ?= 0x80000000
The default 0x180000000 (6 GiB) was tuned for multi-bank Alveo cards
(U50/U280: 8 GiB virtual window from 32 banks; U250/U200: 64 GiB from
4 banks of 16 GiB each), but VCK5000's single-bank ADDR_WIDTH=32
topology gives a 4 GiB virtual window — 0x180000000 doesn't fit, so
get_bank_info() returned index=1 for num_banks=1 and the
c4bcdc5/>= validator rejected it.
0x80000000 (2 GiB) fits in every supported board's virtual window
(VCK5000/zynquplus: 4 GiB; U50/U280: 8 GiB; U250/U200: 64 GiB), so
this is a universally safe lowering. Linker scripts already accept
--defsym=STARTUP_ADDR=... from the makefile via
kernel/scripts/link{32,64}.ld:12.
This patch is correct (verified in 3 sims) but does not fix the
FPGA hang.
-
ap_reset priority over ap_start in VX_afu_ctrl.sv —
prioritized ap_reset bit (4) over ap_start bit (0) on AP_CTRL writes
to prevent a race when host writes both bits in one cycle. This patch
is in our local c4bcdc5 and is needed but doesn't fix the hang
either.
-
Drain logic in VX_afu_wrap.sv — added a workaround that drives
m_axi_mem_*ready_a = 1 and masks m_axi_mem_*valid_vx = 0 during
vx_reset to consume stale NoC responses. We later reverted this
because it didn't help — the hang occurs both with and without it.
-
Lowering STARTUP_ADDR further to 0x00100000 (1 MB) to test
whether the kernel ELF address mattered. Same hang.
-
PLATFORM_MEMORY_ADDR_WIDTH=34 to extend the virtual window. Timing
FAILED: WNS = -0.400 ns, 815 failing endpoints, dominant
critical path is the L2 cache g_tag_store[6] BRAM read → tag
compare → g_data_store[6] BRAM write enable. Don't recommend this
path for single-cycle cache designs.
Cross-references
Diagnostic data
xclbin memory configuration (test4, all bitstreams identical)
Memory Configuration
Type: MEM_DDR4
Base Address: 0xc000000000
Address Size: 0x200000000 # 8 GiB physical DDR
Bank Used: Yes
Ports
Port: m_axi_mem_0
Range (bytes): 0xFFFFFFFF # 4 GiB AXI master window
Port Type: addressable
Argument: MEM_0
Port: m_axi_mem_0
Memory: MC_NOC0 (MEM_DDR4)
Routed timing summary
test4 (XLEN=64, ADDR_WIDTH=32, no L2):
WNS=+0.204 ns TNS=0 WHS=+0.030 ns THS=0 All constraints met.
test7 (XLEN=64, ADDR_WIDTH=32, L2 enabled):
WNS=+0.023 ns TNS=0 WHS=+0.030 ns THS=0 All constraints met (very thin).
test5 (XLEN=64, ADDR_WIDTH=34, L2 enabled): ← abandoned
WNS=-0.400 ns TNS=-174 ns 815 failing endpoints DO NOT USE.
dmesg around xclbin load (no errors)
xocl 0000:0a:00.1: icap_lock_bitstream: bitstream 7c720838-... locked, ref=1
xocl 0000:0a:00.1: kds_add_context: Client pid(...) add context CU(0xffffffff) shared(true)
xocl 0000:0a:00.1: kds_del_context: Client pid(...) del context CU(0xffffffff)
xocl 0000:0a:00.1: icap_unlock_bitstream: bitstream 7c720838-... unlocked, ref=0
No DMA errors, no AFU/PCIe errors during the hang.
strace -p <basic pid> during hang
The host process is in a tight polling loop reading MMIO_CTL_ADDR
(presumably looking for CTL_AP_DONE bit) via xrtKernelReadRegister.
No errors, no progress.
Working tree state
modified: hw/rtl/afu/xrt/VX_afu_wrap.sv # drain logic reverted to upstream master state
modified: hw/syn/xilinx/xrt/vitis.ini # commented sp= lines added (no effect)
modified: tests/regression/common.mk # STARTUP_ADDR patch (proposed)
hw/rtl/afu/xrt/VX_afu_wrap.sv is identical to upstream master
(origin/master's 31e4765). hw/rtl/afu/xrt/VX_afu_ctrl.sv has only
the ap_reset priority fix on top of upstream. So we're effectively
testing the latest upstream Vortex master on VCK5000 ES1, with minimal
patches.
What I'd like to know
-
Has anyone successfully run a Vortex regression kernel end-to-end on
VCK5000 (any variant) on actual hardware, or is the platform mostly
tested via xrtsim only?
-
Has anyone successfully run a Vortex regression kernel end-to-end on
any Alveo board with the current master? The ratio of "works in
sim" to "hangs on FPGA" reports in the issue tracker is concerning.
-
Is there a known set of patches (not yet upstreamed to master)
that fixes the FPGA-side hang? I noticed the commit history was
squashed in 31e4765 (Oct 2025); side branches like develop,
bug_fixes, volt, etc. have additional commits but I couldn't
identify which ones target this hang.
-
Would report_timing -hold -delay_type max on the corner conditions
reveal anything STA missed? Our WHS is +0.030 ns which feels
suspiciously thin, and xrtsim doesn't model timing.
-
Does the AFU control plane (s_axi_ctrl) need any specific
axi_clock_converter configuration on Versal NoC? We rely on v++
to insert it automatically; maybe ES1 needs something explicit.
-
The STARTUP_ADDR lowering patch above — is there any reason
not to lower it to 0x80000000 for all XLEN=64 builds? It works on
every supported board's virtual window and removes the silent
single-bank failure mode. Happy to submit a PR if welcome.
I'm willing to share the full xrtsim trace, all four bitstreams' timing
and utilization reports, and run any additional debug builds if it would
help maintainers narrow down the issue. Thanks for reading.
Files I can attach if helpful
- Full xrtsim DEBUG=3 trace (~50 KB)
- All 4 bitstreams'
impl_1_hw_bb_locked_timing_summary_routed.rpt
- All 4 bitstreams'
impl_1_hw_bb_locked_utilization_placed.rpt
- xclbin info dumps (
xclbinutil --info)
- Working tree diff vs
origin/master
- Host strace during the hang
VCK5000 ES1: kernel hangs at
vx_ready_waitdespite XRT AFU done handshake fix (related to #262, #263, #278)Summary
On Xilinx VCK5000 ES1 (Versal), Vortex kernels uploaded successfully via the
XRT runtime hang indefinitely at
vx_ready_waitaftervx_start. The sameRTL passes in simx, rtlsim, AND xrtsim with
DEBUG=3/DBG_TRACE_AFU—all three software simulators show the AFU FSM transitioning correctly
(
STATE_IDLE → STATE_INIT → STATE_RUN → STATE_DONE → STATE_IDLE), yeton real hardware execution never completes.
This appears to be the same class of issue as #262 (U250 hang at
wait_for_completion), #263 (the explicit observation that "theapplications within the tests directory are primarily designed for
simulation environments and require significant modifications to run
correctly on actual FPGA hardware"), and #278 (DECERR-related hang
where some users report ray24777's
+0x4000000000AXI offset workaroundfixes it but bighead-liat reports the hang persists after applying it).
I'm filing this as a separate issue because my hardware (VCK5000 ES1
Versal) is different from #262/#263/#278 (U250 UltraScale+), I have a
specific working
STARTUP_ADDRpatch for the single-bank case that I'dlike to upstream, and the depth of diagnostic data may help maintainers
narrow down the framework issue across all these reports.
Setup
xilinx_vck5000-es1_gen3x16_2_202020_1)31e4765(Oct 2025 squash, contains 87e613d deadlock fix and fce24b9 done handshake fix)fix/vck5000-support(master + minimal VCK5000 patches)Symptom
Identical hang behavior across 4 independent bitstreams built at
different points in our iteration (
test1,test2,test4,test7):xbutil examinereports the CU asIDLEwithUsage=0indefinitely.The hang occurs regardless of:
0x80000000and0x00100000, same hang)test4has it,test7doesn't, both hang)What I verified (RTL stack is functionally correct)
Same kernel binary, same host binary, three simulator backends, all PASS:
DEBUG=3 DBG_TRACE_AFUVX_afu_wrap.sv+VX_afu_ctrl.sv)The xrtsim trace shows the AFU FSM transitioning correctly:
This implies the RTL source (including the squash-included
87e613d2"fixed XRT AFU deadlock on exit" and
fce24b95"fixed XRT AFU donehandshake" fixes) is functionally correct for the XRT control plane.
The bug is in the synthesis-to-silicon layer.
What I ruled out
STARTUP_ADDRpatch wrongtest4(drain in) andtest7(drain out) hang identicallyPLATFORM_MEMORY_OFFSETmismatchxclbinutil --infoconfirms DDR4 base =0xC000000000matchesplatforms.mkm_axi_mem_0connectivitym_axi_mem_0 → MC_NOC0 (MEM_DDR4), range0xFFFFFFFFdev_caps,isa_caps,mem_capacityall return correct values;CTL_AP_RESETwrites work (init succeeds)What I tried to fix on our end
STARTUP_ADDRfor single-bank platforms (1-line fix intests/regression/common.mk):The default
0x180000000(6 GiB) was tuned for multi-bank Alveo cards(U50/U280: 8 GiB virtual window from 32 banks; U250/U200: 64 GiB from
4 banks of 16 GiB each), but VCK5000's single-bank
ADDR_WIDTH=32topology gives a 4 GiB virtual window —
0x180000000doesn't fit, soget_bank_info()returnedindex=1fornum_banks=1and thec4bcdc5/
>=validator rejected it.0x80000000(2 GiB) fits in every supported board's virtual window(VCK5000/zynquplus: 4 GiB; U50/U280: 8 GiB; U250/U200: 64 GiB), so
this is a universally safe lowering. Linker scripts already accept
--defsym=STARTUP_ADDR=...from the makefile viakernel/scripts/link{32,64}.ld:12.This patch is correct (verified in 3 sims) but does not fix the
FPGA hang.
ap_resetpriority overap_startinVX_afu_ctrl.sv—prioritized ap_reset bit (4) over ap_start bit (0) on AP_CTRL writes
to prevent a race when host writes both bits in one cycle. This patch
is in our local
c4bcdc5and is needed but doesn't fix the hangeither.
Drain logic in
VX_afu_wrap.sv— added a workaround that drivesm_axi_mem_*ready_a = 1and masksm_axi_mem_*valid_vx = 0duringvx_resetto consume stale NoC responses. We later reverted thisbecause it didn't help — the hang occurs both with and without it.
Lowering
STARTUP_ADDRfurther to0x00100000(1 MB) to testwhether the kernel ELF address mattered. Same hang.
PLATFORM_MEMORY_ADDR_WIDTH=34to extend the virtual window. TimingFAILED: WNS =
-0.400 ns, 815 failing endpoints, dominantcritical path is the L2 cache
g_tag_store[6]BRAM read → tagcompare →
g_data_store[6]BRAM write enable. Don't recommend thispath for single-cycle cache designs.
Cross-references
wait_for_completion" symptom after working around bank 0 issue.running the same applications with
TARGET=sim(using xrtsim orverilator) does not exhibit these memory or buffer-related errors." We
see exactly this.
+ 64'h4000000000to AXI master address calculation inVX_afu_wrap.sv:289-292resolves DECERR for some users butbighead-liat reports the demo "still doesn't finish execution on
hardware (it hangs waiting for the kernel to complete)" after applying
it — same as us. Worth noting that
xclbinutil --infoon our VCK5000bitstream shows
Base Address: 0xc000000000matching ourPLATFORM_MEMORY_OFFSET=40'hC000000000, so we don't seem to need thesame offset adjustment, but the post-fix hang is identical.
Diagnostic data
xclbin memory configuration (test4, all bitstreams identical)
Routed timing summary
dmesg around xclbin load (no errors)
No DMA errors, no AFU/PCIe errors during the hang.
strace -p <basic pid>during hangThe host process is in a tight polling loop reading
MMIO_CTL_ADDR(presumably looking for
CTL_AP_DONEbit) viaxrtKernelReadRegister.No errors, no progress.
Working tree state
hw/rtl/afu/xrt/VX_afu_wrap.svis identical to upstream master(
origin/master's31e4765).hw/rtl/afu/xrt/VX_afu_ctrl.svhas onlythe
ap_resetpriority fix on top of upstream. So we're effectivelytesting the latest upstream Vortex master on VCK5000 ES1, with minimal
patches.
What I'd like to know
Has anyone successfully run a Vortex regression kernel end-to-end on
VCK5000 (any variant) on actual hardware, or is the platform mostly
tested via xrtsim only?
Has anyone successfully run a Vortex regression kernel end-to-end on
any Alveo board with the current master? The ratio of "works in
sim" to "hangs on FPGA" reports in the issue tracker is concerning.
Is there a known set of patches (not yet upstreamed to master)
that fixes the FPGA-side hang? I noticed the commit history was
squashed in
31e4765(Oct 2025); side branches likedevelop,bug_fixes,volt, etc. have additional commits but I couldn'tidentify which ones target this hang.
Would
report_timing -hold -delay_type maxon the corner conditionsreveal anything STA missed? Our WHS is +0.030 ns which feels
suspiciously thin, and
xrtsimdoesn't model timing.Does the AFU control plane (
s_axi_ctrl) need any specificaxi_clock_converterconfiguration on Versal NoC? We rely on v++to insert it automatically; maybe ES1 needs something explicit.
The
STARTUP_ADDRlowering patch above — is there any reasonnot to lower it to
0x80000000for all XLEN=64 builds? It works onevery supported board's virtual window and removes the silent
single-bank failure mode. Happy to submit a PR if welcome.
I'm willing to share the full xrtsim trace, all four bitstreams' timing
and utilization reports, and run any additional debug builds if it would
help maintainers narrow down the issue. Thanks for reading.
Files I can attach if helpful
impl_1_hw_bb_locked_timing_summary_routed.rptimpl_1_hw_bb_locked_utilization_placed.rptxclbinutil --info)origin/master