Move driver and nvrtc cython and internal layers to new generator#1972
Move driver and nvrtc cython and internal layers to new generator#1972mdboom wants to merge 20 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
1 similar comment
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
|
/ok to test |
|
/ok to test |
1 similar comment
|
/ok to test |
|
/ok to test |
|
/ok to test |
| cuda_bindings/cuda/bindings/_bindings/cyruntime_ptds.pxd | ||
| cuda_bindings/cuda/bindings/_bindings/cyruntime_ptds.pyx | ||
| cuda_bindings/cuda/bindings/_internal/_nvml.pyx | ||
| cuda_bindings/cuda/bindings/_internal/cufile.pyx |
There was a problem hiding this comment.
nit: we should add them here now that both are auto-gen'd?
| cuda_bindings/cuda/bindings/_internal/cufile.pyx | |
| cuda_bindings/cuda/bindings/_internal/driver.pyx | |
| cuda_bindings/cuda/bindings/_internal/nvrtc.pyx | |
| cuda_bindings/cuda/bindings/_internal/cufile.pyx |
There was a problem hiding this comment.
We only want to ignore the .pyx files that come from .pyx.in files. .pyx files that are checked into cuda-python (even if generated by cybind) should not be on this list.
There was a problem hiding this comment.
Ah @mdboom I see the confusion. Like cufile.pyx & co below, they are all auto-generated at build time too (from cufile_<platform>.pyx). It's different from the .pyx.in templates but still is generated code, hence it'd be nice to be consistent and ignore the generated files.
|
LGTM overall, @mdboom it seems the build fails at cythonization again 😛 |
Yeah, my agent found something easy, but it just fixed Python 3.15 by breaking everyone else ;) Need to work deeper. |
It turns out the exception signature needs to match exactly when declaring function pointer types (which cuda-core does in one spot). cython-gen was generating |
|
@leofang: This is passing all the tests now. I think this is good to go (and let's merge the generator-side thing at roughly the same time, if we can). |
I assume that this also helps us avoid any potential (Cython) ABI breakage? |
|
I wish we have the ABI test running in the CI 😛 Eyeball'd as much as I can. Thanks, Mike! The CI flakiness should be fixed in #2195. I retried a few times here to make it green. |
leofang
left a comment
There was a problem hiding this comment.
Re-reviewing the symbol-loading rework end-to-end against main. Two file-anchored concerns below, plus one branch-state issue:
Missing CUDA 13.3 symbols. The following are declared and loaded in _bindings/cydriver.pyx.in on main (all at cudaVersion=13030) but absent from every new file in this PR (driver_linux.pyx, driver_windows.pyx, _internal/driver.pxd, cydriver.pyx):
cuLogicalEndpoint{Create,Destroy,AddDevice,BindAddr,BindMem,Unbind,Export,Import,Query,GetLimits,IdRelease,IdReserve}— 12 functionscuStreamBeginRecaptureToGraph— 1 function
Looks like the branch was regenerated before these landed on main. Calling any of them post-merge would be a hard breakage, so this needs a rebase + regen before going in. Not a design fix.
Cancelling my prior approval until that and the inline comments are addressed.
-- Leo's bot
| _F_cuGetProcAddress_v2('cuMemHostUnregister', <void **>&__cuMemHostUnregister, 4000, ptds_mode, NULL) | ||
|
|
||
| global __cuMemcpy | ||
| _F_cuGetProcAddress_v2('cuMemcpy', <void **>&__cuMemcpy, 4000, ptds_mode, NULL) |
There was a problem hiding this comment.
For ~48 symbols (every cuMemcpy*, cuMemset*, and cuStream* core op), the old code used different cudaVersion args for the PTDS vs DEFAULT load paths — cuMemcpy for example was loaded with cudaVersion=7000 in the PTDS branch and cudaVersion=4000 in the default branch. The new unified path uses the default-mode value (4000) for both. The PTDS flag should still steer the driver to the _ptds variant, but this is a real behavior change that I don't think CI covers today.
Could we run a manual pass with CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1 on the driver range we support, and confirm the resolved function pointers actually point at the _ptds/_ptsz variants (and not the plain ones)? Easiest check would be to snapshot the __cuMemcpy / __cuMemcpyAsync / __cuMemsetD8 etc. addresses from a fixture under both modes against a known-good build.
-- Leo's bot
| _F_cuGetProcAddress_v2('cuStreamGetCtx', <void **>&__cuStreamGetCtx, 9020, ptds_mode, NULL) | ||
|
|
||
| global __cuStreamGetCtx_v2 | ||
| _F_cuGetProcAddress_v2('cuStreamGetCtx_v2', <void **>&__cuStreamGetCtx_v2, 12050, ptds_mode, NULL) |
There was a problem hiding this comment.
The new generator passes the literal _v2-suffixed name to cuGetProcAddress_v2, while the old generator passed the base name and let cudaVersion disambiguate:
- _F_cuGetProcAddress_v2('cuStreamGetCtx', &__cuStreamGetCtx_v2, 12050, PTDS, NULL) # main, _bindings/cydriver.pyx.in:799
+ _F_cuGetProcAddress_v2('cuStreamGetCtx_v2', &__cuStreamGetCtx_v2, 12050, ptds_mode, NULL) # this PRSame pattern applies to cuLaunchHostFunc_v2 (line 2090), cuCtxGetDevice_v2, cuCtxSynchronize_v2, cuMulticastBindAddr_v2, cuMulticastBindMem_v2 — six symbols in total.
The documented cuGetProcAddress contract is to pass the base name; that's the whole point of the API and what gives us versioned-symbol mapping for free. Passing the literal versioned name happens to work today (the driver likely also indexes by the literal name), but it's not the documented contract and bypasses the version arg. If a future driver tightens resolution to base-name-only, these six symbols silently become NULL and we get a hard-to-diagnose crash — exactly the breakage cuGetProcAddress was designed to prevent.
This is also the "re-creating the versioned symbol mapping" piece you flagged as missing from the new generator — could cybind be updated to emit the base name + cudaVersion and reconstruct the mapping, matching the old behavior?
-- Leo's bot
This is a continuation of the work in #1900. Now adds
driverto the mix and bothnvrtcanddriverare generated from the "real" new generator.