Skip to content

VCK5000 ES1: kernel hangs at vx_ready_wait despite XRT AFU done handshake fix (related to #262, #263, #278) #333

@hwirys

Description

@hwirys

VCK5000 ES1: kernel hangs at vx_ready_wait despite XRT AFU done handshake fix (related to #262, #263, #278)

Summary

On Xilinx VCK5000 ES1 (Versal), Vortex kernels uploaded successfully via the
XRT runtime hang indefinitely at vx_ready_wait after vx_start. The same
RTL passes in simx, rtlsim, AND xrtsim with DEBUG=3/DBG_TRACE_AFU
all three software simulators show the AFU FSM transitioning correctly
(STATE_IDLE → STATE_INIT → STATE_RUN → STATE_DONE → STATE_IDLE)
, yet
on real hardware execution never completes.

This appears to be the same class of issue as #262 (U250 hang at
wait_for_completion), #263 (the explicit observation that "the
applications within the tests directory are primarily designed for
simulation environments and require significant modifications to run
correctly on actual FPGA hardware"), and #278 (DECERR-related hang
where some users report ray24777's +0x4000000000 AXI offset workaround
fixes it but bighead-liat reports the hang persists after applying it).

I'm filing this as a separate issue because my hardware (VCK5000 ES1
Versal) is different from #262/#263/#278 (U250 UltraScale+), I have a
specific working STARTUP_ADDR patch for the single-bank case that I'd
like to upstream, and the depth of diagnostic data may help maintainers
narrow down the framework issue across all these reports.

Setup

Item Value
Hardware Xilinx VCK5000 ES1 (Versal xilinx_vck5000-es1_gen3x16_2_202020_1)
OS Ubuntu 20.04.6 LTS
XRT 2.11.648 (Branch 2021.1)
Vivado / Vitis 2020.2
Vortex master HEAD 31e4765 (Oct 2025 squash, contains 87e613d deadlock fix and fce24b9 done handshake fix)
Branch fix/vck5000-support (master + minimal VCK5000 patches)
Configurations tested XLEN=64, NUM_CORES={2,4}, NUM_WARPS={4,8}, L2 {on,off}, drain logic {add,revert}

Symptom

Identical hang behavior across 4 independent bitstreams built at
different points in our iteration (test1, test2, test4, test7):

$ make -C tests/regression/basic run-xrt
open device connection
info: device name=xilinx_vck5000-es1_gen3x16_base_2, memory_capacity=0x100000000 bytes, memory_banks=1.
number of points: 1024
buffer size: 4096 bytes
allocate device memory
allocating bank0...
reusing bank0...
dev_src=0x10000
dev_dst=0x11000
run memcopy test
write source buffer to local memory
read destination buffer from local memory
verify result
upload time: 0 ms
download time: 0 ms
Total elapsed time: 0 ms                            ← memcopy passes (host↔BO)
run kernel test
Upload kernel binary
reusing bank0...
upload kernel argument
reusing bank0...
start execution                                     ← hangs here forever

xbutil examine reports the CU as IDLE with Usage=0 indefinitely.

The hang occurs regardless of:

  • Kernel ELF entry address (tested 0x80000000 and 0x00100000, same hang)
  • Whether drain logic is present (test4 has it, test7 doesn't, both hang)
  • Cores/warps (2c/4w through 4c/8w with L2)
  • Build vintage (4 different bitstream builds spanning weeks)

What I verified (RTL stack is functionally correct)

Same kernel binary, same host binary, three simulator backends, all PASS:

Backend Result Notes
simx ✅ PASS Functional C++ simulator
rtlsim ✅ PASS Verilator, simple wrapper
xrtsim with DEBUG=3 DBG_TRACE_AFU PASS Verilator with full XRT AFU wrap (VX_afu_wrap.sv + VX_afu_ctrl.sv)

The xrtsim trace shows the AFU FSM transitioning correctly:

[VXDRV] DCR_WRITE: addr=0x1, value=0x80000000   ← STARTUP_ADDR0
[VXDRV] DCR_WRITE: addr=0x2, value=0x0          ← STARTUP_ADDR1
[VXDRV] DCR_WRITE: addr=0x3, value=0x0          ← STARTUP_ARG0
[VXDRV] DCR_WRITE: addr=0x4, value=0x0          ← STARTUP_ARG1
               28657: AFU: Begin initialization              ← STATE_IDLE → STATE_INIT
               28673: AFU: Initialization completed          ← vx_reset deasserted
               28683: AFU: Begin execution                   ← vx_busy → STATE_RUN
               31445: AFU: Execution completed               ← STATE_RUN → STATE_DONE
               36561: AFU: Processor idle                    ← STATE_DONE → STATE_IDLE (after ap_done_ack)
PERF: instrs=81, cycles=801, IPC=0.101124
Test PASSED

This implies the RTL source (including the squash-included 87e613d2
"fixed XRT AFU deadlock on exit" and fce24b95 "fixed XRT AFU done
handshake" fixes) is functionally correct for the XRT control plane.
The bug is in the synthesis-to-silicon layer.

What I ruled out

Hypothesis Verdict Evidence
RTL source bug (logic error) ❌ REJECTED 3 simulators PASS the same RTL
Our STARTUP_ADDR patch wrong ❌ REJECTED Verified in 3 sims; ELF inspection confirms
Kernel ELF address out of bank ❌ REJECTED Tested 0x100000 (1 MB) and 0x80000000 (2 GB), both hang same way
Drain logic add/remove ❌ REJECTED Both test4 (drain in) and test7 (drain out) hang identically
Bitstream construction race ❌ REJECTED 4 independent builds, identical symptom
PLATFORM_MEMORY_OFFSET mismatch ❌ REJECTED xclbinutil --info confirms DDR4 base = 0xC000000000 matches platforms.mk
m_axi_mem_0 connectivity ❌ REJECTED xclbin shows m_axi_mem_0 → MC_NOC0 (MEM_DDR4), range 0xFFFFFFFF
Host MMIO control plane broken ❌ UNLIKELY dev_caps, isa_caps, mem_capacity all return correct values; CTL_AP_RESET writes work (init succeeds)
Setup timing ✅ MET WNS=+0.204 ns (test4), +0.023 ns (test7), 0 failing endpoints
Hold timing ✅ MET (thin) WHS=+0.030 ns, 0 failing endpoints — but margin is 30 ps

What I tried to fix on our end

  1. STARTUP_ADDR for single-bank platforms (1-line fix in
    tests/regression/common.mk):

    - STARTUP_ADDR ?= 0x180000000
    + STARTUP_ADDR ?= 0x80000000

    The default 0x180000000 (6 GiB) was tuned for multi-bank Alveo cards
    (U50/U280: 8 GiB virtual window from 32 banks; U250/U200: 64 GiB from
    4 banks of 16 GiB each), but VCK5000's single-bank ADDR_WIDTH=32
    topology gives a 4 GiB virtual window — 0x180000000 doesn't fit, so
    get_bank_info() returned index=1 for num_banks=1 and the
    c4bcdc5/>= validator rejected it.

    0x80000000 (2 GiB) fits in every supported board's virtual window
    (VCK5000/zynquplus: 4 GiB; U50/U280: 8 GiB; U250/U200: 64 GiB), so
    this is a universally safe lowering. Linker scripts already accept
    --defsym=STARTUP_ADDR=... from the makefile via
    kernel/scripts/link{32,64}.ld:12.

    This patch is correct (verified in 3 sims) but does not fix the
    FPGA hang
    .

  2. ap_reset priority over ap_start in VX_afu_ctrl.sv
    prioritized ap_reset bit (4) over ap_start bit (0) on AP_CTRL writes
    to prevent a race when host writes both bits in one cycle. This patch
    is in our local c4bcdc5 and is needed but doesn't fix the hang
    either.

  3. Drain logic in VX_afu_wrap.sv — added a workaround that drives
    m_axi_mem_*ready_a = 1 and masks m_axi_mem_*valid_vx = 0 during
    vx_reset to consume stale NoC responses. We later reverted this
    because it didn't help — the hang occurs both with and without it.

  4. Lowering STARTUP_ADDR further to 0x00100000 (1 MB) to test
    whether the kernel ELF address mattered. Same hang.

  5. PLATFORM_MEMORY_ADDR_WIDTH=34 to extend the virtual window. Timing
    FAILED
    : WNS = -0.400 ns, 815 failing endpoints, dominant
    critical path is the L2 cache g_tag_store[6] BRAM read → tag
    compare → g_data_store[6] BRAM write enable. Don't recommend this
    path for single-cycle cache designs.

Cross-references

Diagnostic data

xclbin memory configuration (test4, all bitstreams identical)

Memory Configuration
   Type:         MEM_DDR4
   Base Address: 0xc000000000
   Address Size: 0x200000000        # 8 GiB physical DDR
   Bank Used:    Yes

Ports
   Port:          m_axi_mem_0
   Range (bytes): 0xFFFFFFFF        # 4 GiB AXI master window
   Port Type:     addressable

Argument: MEM_0
   Port:    m_axi_mem_0
   Memory:  MC_NOC0 (MEM_DDR4)

Routed timing summary

test4 (XLEN=64, ADDR_WIDTH=32, no L2):
  WNS=+0.204 ns  TNS=0  WHS=+0.030 ns  THS=0  All constraints met.

test7 (XLEN=64, ADDR_WIDTH=32, L2 enabled):
  WNS=+0.023 ns  TNS=0  WHS=+0.030 ns  THS=0  All constraints met (very thin).

test5 (XLEN=64, ADDR_WIDTH=34, L2 enabled):  ← abandoned
  WNS=-0.400 ns  TNS=-174 ns  815 failing endpoints  DO NOT USE.

dmesg around xclbin load (no errors)

xocl 0000:0a:00.1: icap_lock_bitstream: bitstream 7c720838-... locked, ref=1
xocl 0000:0a:00.1: kds_add_context: Client pid(...) add context CU(0xffffffff) shared(true)
xocl 0000:0a:00.1: kds_del_context: Client pid(...) del context CU(0xffffffff)
xocl 0000:0a:00.1: icap_unlock_bitstream: bitstream 7c720838-... unlocked, ref=0

No DMA errors, no AFU/PCIe errors during the hang.

strace -p <basic pid> during hang

The host process is in a tight polling loop reading MMIO_CTL_ADDR
(presumably looking for CTL_AP_DONE bit) via xrtKernelReadRegister.
No errors, no progress.

Working tree state

modified:   hw/rtl/afu/xrt/VX_afu_wrap.sv      # drain logic reverted to upstream master state
modified:   hw/syn/xilinx/xrt/vitis.ini        # commented sp= lines added (no effect)
modified:   tests/regression/common.mk         # STARTUP_ADDR patch (proposed)

hw/rtl/afu/xrt/VX_afu_wrap.sv is identical to upstream master
(origin/master's 31e4765). hw/rtl/afu/xrt/VX_afu_ctrl.sv has only
the ap_reset priority fix on top of upstream. So we're effectively
testing the latest upstream Vortex master on VCK5000 ES1, with minimal
patches.

What I'd like to know

  1. Has anyone successfully run a Vortex regression kernel end-to-end on
    VCK5000 (any variant) on actual hardware
    , or is the platform mostly
    tested via xrtsim only?

  2. Has anyone successfully run a Vortex regression kernel end-to-end on
    any Alveo board with the current master?
    The ratio of "works in
    sim" to "hangs on FPGA" reports in the issue tracker is concerning.

  3. Is there a known set of patches (not yet upstreamed to master)
    that fixes the FPGA-side hang? I noticed the commit history was
    squashed in 31e4765 (Oct 2025); side branches like develop,
    bug_fixes, volt, etc. have additional commits but I couldn't
    identify which ones target this hang.

  4. Would report_timing -hold -delay_type max on the corner conditions
    reveal anything STA missed?
    Our WHS is +0.030 ns which feels
    suspiciously thin, and xrtsim doesn't model timing.

  5. Does the AFU control plane (s_axi_ctrl) need any specific
    axi_clock_converter configuration on Versal NoC?
    We rely on v++
    to insert it automatically; maybe ES1 needs something explicit.

  6. The STARTUP_ADDR lowering patch above — is there any reason
    not to lower it to 0x80000000 for all XLEN=64 builds? It works on
    every supported board's virtual window and removes the silent
    single-bank failure mode. Happy to submit a PR if welcome.

I'm willing to share the full xrtsim trace, all four bitstreams' timing
and utilization reports, and run any additional debug builds if it would
help maintainers narrow down the issue. Thanks for reading.


Files I can attach if helpful

  • Full xrtsim DEBUG=3 trace (~50 KB)
  • All 4 bitstreams' impl_1_hw_bb_locked_timing_summary_routed.rpt
  • All 4 bitstreams' impl_1_hw_bb_locked_utilization_placed.rpt
  • xclbin info dumps (xclbinutil --info)
  • Working tree diff vs origin/master
  • Host strace during the hang

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions