Skip to content

Add the cuda.core.experimental.system singleton#256

Merged
leofang merged 15 commits into
NVIDIA:mainfrom
keenan-simpson:add_system
Dec 10, 2024
Merged

Add the cuda.core.experimental.system singleton#256
leofang merged 15 commits into
NVIDIA:mainfrom
keenan-simpson:add_system

Conversation

@keenan-simpson

@keenan-simpson keenan-simpson commented Nov 28, 2024

Copy link
Copy Markdown
Contributor

Add the cuda.core.experiemental.system singleton which is an instance of System.

closes #149

@copy-pr-bot

copy-pr-bot Bot commented Nov 28, 2024

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@keenan-simpson keenan-simpson changed the title Add the system class Add the cuda.core.experiemental.system singleton Nov 28, 2024
@leofang leofang added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Nov 28, 2024
@leofang leofang added this to the cuda.core beta 2 milestone Nov 28, 2024
@keenan-simpson keenan-simpson marked this pull request as ready for review November 29, 2024 00:47

@leofang leofang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Keenan! Left a few quick comments.

Comment thread cuda_core/docs/source/api_private.rst Outdated
Comment thread cuda_core/docs/source/release.md Outdated
Comment thread cuda_core/docs/source/release/0.2.0-notes.md Outdated
Comment thread cuda_core/docs/source/release/0.2.0-notes.md Outdated
Comment thread cuda_core/tests/test_system.py Outdated
@keenan-simpson keenan-simpson changed the title Add the cuda.core.experiemental.system singleton Add the cuda.core.experimental.system singleton Dec 3, 2024

@leofang leofang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksimpson-work I "hacked" the docs in commit 1a3f1e6. It's not pretty but it does the job:
截圖 2024-12-07 上午1 22 45

One improvement we need to make: We need to rewrite the docstrings as a one-liner, as opposed to a legal Sphinx docstring. It is because the autodata directive only expects simple one-liners:
https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#automatically-document-attributes-or-data
This is also the reason that there are 3 warnings raised if you build this locally.

Comment on lines +103 to +106
n_pops = len(lines)
lines.extend(lines_new)
for _ in range(n_pops):
lines.pop(0)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try this?

Suggested change
n_pops = len(lines)
lines.extend(lines_new)
for _ in range(n_pops):
lines.pop(0)
lines.clear()
lines.extend(lines_new)

@keenan-simpson

Copy link
Copy Markdown
Contributor Author

Screenshot 2024-12-09 120806

@keenan-simpson

Copy link
Copy Markdown
Contributor Author

latest commit fixes the build warning and properly formats the Returns field. Thanks for the hack @leofang

Co-authored-by: Leo Fang <leo80042@gmail.com>
@keenan-simpson

Copy link
Copy Markdown
Contributor Author

/ok to test

@leofang leofang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@leofang leofang merged commit c1fea41 into NVIDIA:main Dec 10, 2024
leofang added a commit that referenced this pull request Jun 9, 2026
* CI: allow specifying custom driver versions in test matrix

Extends the DRIVER field in ci/test-matrix.yml beyond 'latest'/'earliest'
to accept an explicit version string (e.g. '580.65.06'). For Linux,
ci/tools/install_gpu_driver.sh (adapted from nv-gha-runners/vm-images
PR #256) swaps the driver in-job via nsenter when the row uses a custom
version; for Windows, ci/tools/install_gpu_driver.ps1 is split into
install + configure_driver_mode, with the install step gated on the
DRIVER value and the mode step always running.

The matrix row is routed to a 'latest' runner image when the DRIVER is
a custom version (the install scripts perform the swap themselves).
Container privileges on Linux (--privileged --pid=host) are added only
on rows with a custom DRIVER. Custom DRIVER + FLAVOR=wsl is rejected
eagerly in the compute-matrix step.

Two existing nightly-numba-cuda rows exercise the new path:
- Linux amd64 / 13.3.0 / l4 -> 580.65.06
- Windows amd64 / 13.3.0 / l4 -> 610.47

Closes #293
Closes #1265

* CI: fix Linux driver nsenter re-exec, swap Windows version, enable ci.yml dispatch

- install_gpu_driver.sh: pipe the script body to the host-side bash via
  stdin (bash -s < "$0") instead of re-execing "$0". The script lives
  in the GH workspace mount (container-only), so the relative path
  doesn't resolve after nsenter switches the mount namespace.
  The < "$0" fd is opened before nsenter and survives the flip.
- test-matrix.yml: Windows nightly-numba-cuda row 610.47 -> 596.36
  (610.47 isn't published on the CDN; install hit 404).
- ci.yml: add workflow_dispatch: trigger so the pipeline can be
  re-run manually. The existing should-skip / detect-changes gates
  already handle non-PR events.

* CI: move 'Ensure GPU is working' after 'Install GPU driver' on Linux

So nvidia-smi validates the post-install driver state on custom-DRIVER
rows. Windows test-wheel + coverage already use Install -> Configure ->
Ensure; this brings the Linux test-wheel job into line.

* CI: flip two PR-matrix Linux rows to DRIVER=610.43.02

Exercises the custom-driver install path on every PR (not just nightly).
Both rows are amd64 / 13.3.0 / local-CTK, on l4 and rtxpro6000 -- both
in the 'open' kernel-module flavor (only Volta needs 'legacy').

* CI: restart nvidia-persistenced on Linux; poll nvidia-smi on Windows

Linux: After install_gpu_driver.sh stops nvidia-persistenced and the apt
purge removes the package, the .run installer reinstalls the systemd
service but leaves it stopped. cuda.core's test_persistence_mode_enabled
fails with NVML_ERROR_UNKNOWN on driver 610.43.02 when the daemon is
not running; explicitly start it again at the end of host_install().

Windows: configure_driver_mode.ps1's trailing 'Start-Sleep -Seconds 5'
is not enough on slower-coming-back-up multi-GPU rows (observed: 2x
H100 MCDM). Replace it with a poll-until-success loop on nvidia-smi
with a 60s deadline, matching the runner-team nvgha-driver.ps1 pattern.
Previously masked because every Windows row used to run the full
install pipeline; with custom-DRIVER plumbing, latest/earliest rows
skip the install and the cycle is no longer preceded by warm-up time.

* CI: re-enable persistence mode after Linux driver swap

Runner-latest L4 images come up with Persistence-M=On (set somewhere in
the runner team's image setup, not in cuda-python). Our .run install
leaves it Off, which breaks cuda.core's test_persistence_mode_enabled
on driver 610.43.02 -- the test calls device.is_persistence_mode_enabled
= False on a device that already reports False, and 610.43.02 returns
NVML_ERROR_UNKNOWN for that no-op set.

Restore the runner baseline by calling `nvidia-smi -pm 1` at the end of
host_install() (sets the kernel persistence flag directly via NVML).
Also daemon-reload + start nvidia-persistenced.service best-effort so
tools that look for the daemon find it; `set -x` around this trailing
block so the next run's log confirms which lines fired.

* CI: preserve SUID bit when refreshing container nvidia binaries

refresh_container_libs() used 'cp -f --remove-destination' (verbatim
from the runner team's nvgha-driver), which without -p/--preserve
strips the SUID/SGID bits on the destination. /usr/bin/nvidia-modprobe
ships 4755 and NVML's state-changing calls (e.g.
nvmlDeviceSetPersistenceMode) route through it; once SUID is gone the
container-side call returns NVML_ERROR_UNKNOWN, which is what cuda.core's
test_persistence_mode_enabled was hitting.

Add a stat diagnostic line at the end of refresh_container_libs() so
the next CI log records nvidia-modprobe's post-refresh mode.

* CI: exec nvidia-persistenced directly after Linux driver swap

The `--silent --no-questions` .run installer drops /usr/bin/nvidia-
persistenced but does not reliably install a usable systemd unit, so
`systemctl start nvidia-persistenced.service` was a no-op (verified
in CI logs: `+ true` after the start). With the daemon down, the
/run/nvidia-persistenced/socket bind-mounted into the test container
is stale, and NVML state-changing calls (e.g.
nvmlDeviceSetPersistenceMode) made by root inside the container
return NVML_ERROR_UNKNOWN -- which is what cuda.core's
test_persistence_mode_enabled has been failing on.

Verified on ComputeLab with the same driver (610.43.02), same GPU
arch (Ada L40S), root in container: with the daemon up, the SET call
returns NVML_SUCCESS; with the daemon down it returns UnknownError.

Fix: exec /usr/bin/nvidia-persistenced directly. The binary
self-daemonizes and creates the socket on its own. (Same latent gap
exists in nv-gha-runners/vm-images' nvgha-driver; will flag upstream.)

* CI: pass --user root to nvidia-persistenced after Linux driver swap

nvidia-persistenced defaults to `--user nvidia-persistenced`, which
our apt-purge of `nvidia-compute-utils-*` removed. Without that user
the daemon's setuid(3) post-fork fails and the process exits silently
-- the `nvidia-smi -pm 1` right after sees Persistence-M briefly On
(daemon held it), then it flips back to Off (daemon gone), and the
test container's NVML SET call later returns NVML_ERROR_UNKNOWN.

Pass --user root so the daemon doesn't depend on a user account that
the purge deleted. Also add a `pgrep nvidia-persistenced` + `ls -la
/run/nvidia-persistenced/` diagnostic so the next CI log proves the
daemon is alive when the test starts.

* CI: add fast-feedback probe-driver-swap job (workflow_dispatch only)

Allocates one L4 GPU + privileged container, runs install_gpu_driver.sh
with DRIVER=610.43.02, then drives nvmlDeviceSetPersistenceMode via
raw ctypes -- the exact NVML call that cuda.core's
test_persistence_mode_enabled exercises. Exits 1 on
NVML_ERROR_UNKNOWN so the smoke test fails loudly when the install
path leaves the daemon dead.

Total runtime ~5 min vs ~30 min for the full test matrix.

Triggered by workflow_dispatch only -- this is an opt-in debugging
job, not regular PR or nightly traffic.

* CI: drop workflow_dispatch gate on probe-driver-swap so it runs on every PR

* CI: stop refresh_container_libs from clobbering /run/nvidia-persistenced

refresh_container_libs() walks /proc/self/mountinfo for entries
containing 'nvidia' or 'libcuda'. /run/nvidia-persistenced/socket
matches that pattern and was being umount'd + cp'd over -- which
breaks the container's view of the daemon's IPC socket (the
container ends up with a 0-link unlinked socket inode instead of
the live host one). Without a working socket, NVML state-changing
calls inside the container return NVML_ERROR_UNKNOWN -- which is
exactly what cuda.core's test_persistence_mode_enabled was hitting.

Restrict the refresh to /usr/(bin|lib) so it only touches the
actual binaries + shared libraries that change version with the
driver swap. /dev/nvidia*, /proc/driver/nvidia, /run/nvidia-*,
/tmp/nvidia-mps are all left as the toolkit set them up.

Same latent gap exists in nv-gha-runners/vm-images' nvgha-driver;
their CUDA-runtime validation workload never queries the daemon
socket so they haven't surfaced it.

* CI: take down nvidia-persistenced via pkill, not systemctl

The packaged nvidia-persistenced.service has
`RuntimeDirectory=nvidia-persistenced`, which makes systemd `unlink()`
/run/nvidia-persistenced/ when the unit stops. The container has that
directory bind-mounted from the host as of container-start time. When
systemd removes the inode and our subsequent
`/usr/bin/nvidia-persistenced --user root` call re-creates it, the
container's bind mount is stranded on the deleted inode -- its
/run/nvidia-persistenced/socket shows up with link count 0 and NVML
state-changing calls return NVML_ERROR_UNKNOWN.

`pkill -TERM nvidia-persistenced` sends SIGTERM directly to the
daemon, which exits cleanly without involving systemd's
RuntimeDirectory cleanup. The host dir keeps its inode across the
swap; the container's bind mount stays valid; the new daemon's
socket is visible to in-container NVML clients.

* CI: re-bind /run/nvidia-persistenced into container after driver swap

The container's bind mount of /run/nvidia-persistenced/ is taken at
container-start time and pinned to the host directory's then-current
inode. Across the install the host directory gets recreated under a
fresh inode (the daemon's shutdown + restart cycle replaces it), and
the container is stranded on the deleted inode -- socket file shows
up with link count 0 inside the container, NVML state-changing calls
return NVML_ERROR_UNKNOWN.

After refresh_container_libs, umount the stale bind, mkdir the local
mount point if missing, and re-bind from /proc/1/root/run/nvidia-
persistenced (the host's current view via the privileged container's
host-pid-ns access). CAP_SYS_ADMIN required, which custom-DRIVER rows
already grant via --privileged --pid=host.

* CI: drop install_gpu_driver.sh experiments that turned out non-load-bearing

- Revert `pkill -TERM nvidia-persistenced` to `systemctl stop`; pkill
  alone didn't prevent the host dir's inode from flipping, the re-bind
  of /run/nvidia-persistenced/ is what restores the container's view.
- Drop `nvidia-smi -pm 1`; the test exercises NVML's set call, which
  succeeds once the daemon socket is reachable regardless of current
  Persistence-M state.
- Trim `set -x` blocks and `pgrep`/`ls -la`/`stat` diagnostics that
  served their purpose during debugging.

Keeps the load-bearing changes (nsenter bash -s, /usr/(bin|lib)
refresh filter, exec nvidia-persistenced --user root, the
/run/nvidia-persistenced re-bind, cp --preserve=mode) and brings the
diff against Justin's nvgha-driver back down to the strict minimum.

* Revert: remove the probe-driver-swap fast-feedback job

Added in a3f1573 for fast iteration on install_gpu_driver.sh; no
longer needed now that the script has stabilized.

* CI: address Mike's review comments on PR 2176

- ci.yml: `workflow_dispatch:` -> `workflow_dispatch: {}` so the empty
  mapping reads as intentional rather than ambiguous YAML.
- test-wheel-linux.yml: declare `util-linux` in `Install dependencies`
  instead of running a second apt-get inline; util-linux ships in
  ubuntu:22.04 by default so this is mostly belt-and-suspenders, but
  it removes the redundant apt-get call.
- install_gpu_driver.sh: drop `2>/dev/null` on `systemctl stop` so
  real errors surface (`|| true` keeps the script non-fatal). The
  redirect was inherited verbatim from nv-gha-runners/vm-images PR 256
  with no specific need.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cuda.core.system

2 participants