Leaving Xorg fixed everything I'd blamed on NVIDIA

Maybe you read my last post about How to stop your Ubuntu laptop from freezing after it eats your whole RAM. I mentioned there that the NVIDIA GeForce GTX 1050 Ti was worth its own article, so here we are. Just to remind you: I’m using a Dell G3 3579 laptop, bought 8 years ago. Old, but good enough once I got rid of its problems.

For as long as I can remember, I’ve had a lot of problems with NVIDIA Optimus, starting back with my netbook (do you remember those?), an Asus 1215N. It never worked on Linux as well as it did on Windows. Configuring it and making it at least usable took me hours and was always a trip through hell. I was never fully satisfied.

After buying the Dell G3 I disabled the NVIDIA GPU in the BIOS. I worked that way for about a year, then I wanted to use an external screen and my problems came back. The Dell’s HDMI port only worked with the NVIDIA card enabled, so I had to turn it back on in the BIOS. Shortly afterwards the computer started getting loud, and the heat killed the battery capacity. My laptop became a desktop PC, permanently connected to AC. I knew it was the NVIDIA GPU’s fault. I tried a lot of potential solutions: upgrading drivers, using unofficial drivers, the Bumblebee project, managing the NVIDIA GPU with Primus, aggressive power management. They all promised a lot, but I never got fully satisfied.

To help myself and my computer, every 3 months I cleaned it inside and changed the thermal paste and pads. Btw, this is quite a good practice — but to be honest, I’d never have done it at these intervals without those problems.

A few years later I decided to bring my laptop’s primary function back to life, so I replaced the battery. Unfortunately, after about a year the heat killed the capacity again. I started seriously thinking about buying a new machine.

A few months ago I started digging one more time, looking for an easy solution to a completely different problem: plugging in HDMI, unplugging it, or suspending the computer took the machine a long time before it showed anything besides a mouse cursor on a black screen. It also hung very often during this procedure. Luckily, I found out the solution was right here, ready to use.

What was in the logs?

Checking journalctl:

# last event per boot
boot -1 ended: (NO CLEAN SHUTDOWN MARKER — likely hard reset)
boot -2 ended: (NO CLEAN SHUTDOWN MARKER — likely hard reset)
boot -3 ended: (NO CLEAN SHUTDOWN MARKER — likely hard reset)
boot -4 ended: (NO CLEAN SHUTDOWN MARKER — likely hard reset)
boot -5 ended: (NO CLEAN SHUTDOWN MARKER — likely hard reset)
boot -6 ended: (NO CLEAN SHUTDOWN MARKER — likely hard reset)

The last 6 boots all ended with a hard reset… this shows the scale of the problem. Let’s check Xorg.log (yes, I was still using Xorg):

[129515.894] NVIDIA(GPU-0): LG Electronics LG ULTRAWIDE: connected
[129515.938] NVIDIA(GPU-0): LG Electronics LG ULTRAWIDE: connected  ← 44ms later
[129516.412] NVIDIA(GPU-0): LG Electronics LG ULTRAWIDE: connected  ← 474ms
[129516.455] NVIDIA(GPU-0): LG Electronics LG ULTRAWIDE: connected
[129520.246] NVIDIA(GPU-0): LG Electronics LG ULTRAWIDE: connected
 …and 10 more in 5 seconds

A storm of hotplug events. The NVIDIA Xorg driver receives 15 identical events in 5 seconds, and for each one it reconfigured the display. Somewhere in that crowd it appears to hit a race condition with the PRIME bridge, which is what hung the whole rendering pipeline.

First step to heaven — Wayland and on-demand

One command:

sudo prime-select on-demand

Now log out, click the cog icon on Ubuntu’s login screen and choose Wayland. That’s it. I expected HDMI to stop working, but nothing of the sort happened. Unplug it, plug it back in, and it comes up faster than a blink — I’m in heaven. Problem solved.

Why that fixed it

The freezes were never really about HDMI. They were about which GPU was in charge — and on X11 I didn’t get to choose.

The HDMI port on this laptop is wired to the NVIDIA chip, not the Intel one. Under Xorg, the only way to make it work was to run the NVIDIA driver as the primary PRIME provider. The moment I tried an Intel-only setup with the NVIDIA driver out of the way, HDMI stopped working properly. So prime-select nvidia was never a preference — it was the price of having an external screen at all.

And that mode is exactly what broke. With NVIDIA primary it renders everything, including the desktop on the laptop’s own panel, and the display side runs through the X.Org NVIDIA driver — decades-old X11 code. The built-in screen is wired to Intel, so even the internal display was routed dGPU → PRIME → Intel scanout. Every HDMI hotplug landed in that NVIDIA X driver, which tried to reconfigure the whole layout for each of the fifteen connect events, and somewhere in that pile it hit the race the logs pointed to and rendering hung.

before · X11, NVIDIA primary:     HDMI hotplug → NVIDIA DDX → reconfig per event → hang
after  · Wayland, Intel primary:  HDMI hotplug → kernel DRM event → mutter handles it

prime-select on-demand plus Wayland breaks the bind. The Intel iGPU becomes the primary, display-driving GPU, and the NVIDIA card drops to offload — powered up only when something needs it, including lighting up that HDMI-attached output. Mutter, GNOME’s Wayland compositor, does its own mode-setting and hotplug handling straight through the kernel’s DRM/KMS, with no X.Org server and no NVIDIA DDX in the path. I get the external screen without making NVIDIA primary — the thing X11 never let me do. A monitor going in or out is now just a kernel DRM event the compositor absorbs. No legacy X driver to choke on it, no PRIME bridge to race.

What about power consumption?

Well, not good. Turning my computer back into a laptop would probably frustrate me more than just accepting the current state. New battery, about an hour and a half of work off AC. Not terrible, but I expected more. Let’s see. I ran this command:

nvidia-smi -q -d POWER,PERFORMANCE,CLOCK,TEMPERATURE

The interesting part is here:

    Performance State                                  : P8
    ...
    Module Power Readings
        Average Power Draw                             : N/A
        Instantaneous Power Draw                       : N/A
        Current Power Limit                            : N/A
        Requested Power Limit                          : N/A
        Default Power Limit                            : N/A
        Min Power Limit                                : N/A
        Max Power Limit                                : N/A

And the next command shows the current power consumption:

upower -i $(upower -e | grep BAT)

And the part we care about:

energy-rate:         33,3288 W

Clearly, the power consumption is too high for a laptop at idle.

What are P-states and D-states?

P-states (P0–P8) describe how hard the GPU works while it’s awake — P8 is the deepest idle, clocks gated down, but the silicon still powered. D-states are a separate axis: they describe whether the card is on the PCIe bus at all. D0 is fully online. D3cold means the device is effectively powered off — about as close to zero watts as the hardware gets. What I want is D3cold. What I’m stuck with is D0 + P8: idle in the only way the driver knows how, but never actually off.

Second step — let the driver suspend the GPU itself

So I want the card deeper than P8. I was debugging this with an AI assistant (Claude, as it happens), and it told me about Fine-Grained Dynamic Power Management — NVIDIA’s feature for runtime-suspending the GPU when nothing is using it — and that my GP107M was on a “limited support” list for it. The driver README ships right there in /usr/share/doc/nvidia-driver-580/, so the claim was at least checkable, and it sounded plausible enough. It’s only a module flag — set it explicitly, reboot, and see what happens. Worth a try.

It’s one module option, one file:

  # /etc/modprobe.d/nvidia-power-management.conf
  options nvidia "NVreg_DynamicPowerManagement=0x02"

Then rebuild the initramfs (NVIDIA loads early, so the option has to be there from the start) and reboot:

  sudo update-initramfs -u
  sudo reboot

After the reboot, the param had stuck:

cat /proc/driver/nvidia/params | grep DynamicPower
  DynamicPowerManagement: 2
  DynamicPowerManagementVideoMemoryThreshold: 200

The driver took the option. So I checked the draw again:

upower -i $(upower -e | grep BAT) | grep energy-rate

Result:

energy-rate:         14,9264 W

From 33 W down to 15. I’d cut idle power by more than half with one flag and a reboot — on a chip the README said wouldn’t support it. I closed the laptop feeling clever.

I was wrong about why.

Third step — the wall, and what’s on the other side of it

Here’s the first crack: the dGPU was still sitting in D0. If the flag had really enabled runtime suspend, the card should have spent at least some time asleep. I can check directly — every PCI device in /sys reports its current power state and how long it has spent in each:

for d in /sys/bus/pci/devices/*; do
  [ "$(cat $d/vendor 2>/dev/null)" = "0x10de" ] || continue
  bdf=$(basename $d)
  echo "=== $bdf ==="
  echo "runtime_status: $(cat $d/power/runtime_status)"
  echo "power_state:    $(cat $d/power_state)"
  echo "active_time:    $(cat $d/power/runtime_active_time) ms"
  echo "suspended_time: $(cat $d/power/runtime_suspended_time) ms"
done

Output, after 28 hours of uptime:

=== 0000:01:00.0 ===     ← the GPU itself
runtime_status: active
power_state:    D0
active_time:    102395816 ms
suspended_time: 0 ms

=== 0000:01:00.1 ===     ← HDMI audio on the same chip
runtime_status: suspended
power_state:    D3hot
active_time:    24123 ms
suspended_time: 102371729 ms

Same chip, opposite outcome

Read that twice. The audio function and the GPU sit on the same physical chip, share the same driver, the same DPM setting, the same power/control: auto. The audio function has been asleep for 28 hours straight, woken up for 24 seconds total. The GPU itself, in those same 28 hours, has not been suspended for a single millisecond. Zero.

At first I blamed userspace. There are obvious suspects holding the card busy:

sudo lsof /dev/nvidia* | awk 'NR==1 || /nvidia[0-9]|nvidiactl|nvidia-modeset/'

There they are — gnome-shell, Xwayland, brave, all keeping /dev/nvidia-modeset open permanently under Wayland. On hardware that supports runtime suspend, a single open handle is enough to block it, so this looked like the answer. There’s even an irony to it: Wayland is the same thing that rescued me in step one, which would make the fix for the freezes the very thing keeping the card awake.

Tidy theory. It has a hole. The audio function shares the chip and the driver and still reaches D3hot, so something on this package can suspend past held handles. And the flag that supposedly turned all of this on does nothing — which I only worked out by reading the README myself instead of trusting what I’d been told.

Fine-Grained DPM, and NVIDIA runtime suspend in general, needs Turing or newer on a Coffee Lake or newer chipset. Pascal isn’t on the list — not the “limited support” the assistant had promised, not on it at all. On a pre-Ampere chip the driver accepts NVreg_DynamicPowerManagement and then ignores it. The 2 in params means the string parsed, nothing more.

So I tested the one thing that settles it: removed the flag, rebooted, and measured idle on battery under the same conditions, with it and without it.

# with the flag    → ~13 W idle, GPU in D0, suspended_time 0
# without the flag → ~14 W idle, GPU in D0, suspended_time 0

Identical. The flag changes nothing. Which means the drop I was so pleased about came from the reboot, not the option — I’d changed two things at once and credited the wrong one. The card was never going to suspend either way.

I still tried to force it. I wrote a udev rule that, on AC unplug, would drop autosuspend_delay_ms to 100 ms and shove the card down fast. Except reading that file in /sys/bus/pci/devices/0000:01:00.0/power/ just returns Input/output error — and strace shows the open succeeds while the read fails, which pins it to the attribute’s own show callback. That turned out to be the kernel being precise, not broken: autosuspend_delay_ms only returns a value once a driver has actually enabled autosuspend on the device, and returns EIO until it does. Runtime PM here is even set to auto and enabled — but the NVIDIA driver never switches autosuspend on, so there’s no delay to read, let alone tune. Same wall as the flag, one layer down: a dial for a mechanism this GPU doesn’t run. The rule was dead before it ran.

It was never a config I was missing. It’s a GPU generation I don’t have.

The held handles aren’t the wall. They’d matter on a card that can do runtime D3 — this one can’t. D3cold (which would also need full ACPI cooperation) isn’t one flag away — it’s one silicon generation away. RTD3 was never built for GP107M, so the card stays in P8 idle and never goes dark, whatever I set.

I deleted the udev rule. It was elegant and completely useless on this hardware.

So where does that leave the laptop? The dGPU never fully powers off — it can’t, not on Pascal — but I stopped caring, because that was never the thing making the machine unusable. The freezes were: six hard resets in six boots, every resume a gamble. Those are gone — resume, suspend, HDMI in and out, all clean. And the thread running under all of it — the freezes, the dGPU-as-primary layout, the heat — was the old X11 + NVIDIA setup. Xorg was the villain the whole time. Leaving it for Wayland fixed the problem I actually had.

A subjective note

I never logged temperatures or battery life carefully back when things were bad, so take this part as impression, not measurement. But the difference in daily use isn’t subtle.

The laptop used to run hot enough to feel through the chassis, with the fans pinned. And on a G3 the fans don’t pin discreetly — this machine is notorious for them spinning up hard and late, as the GPU climbs toward the upper end of its range. A healthy GTX 1050 Ti idles around 40 °C. The chip is rated to run into the 80s and only throttles up near the mid-90s. Mine was loud, so it was living somewhere up there. Now the dGPU sits at ~42 °C and the fans are barely audible.

Battery off AC went from about an hour and a half to close to four. None of that came from the power-management flag I burned an evening on — it came from getting off Xorg. That’s the whole lesson, really: I spent the effort chasing a config knob, and the actual fix was abandoning a stack I should have left years ago.

Leaving Xorg fixed everything I'd blamed on NVIDIA

What was in the logs?

First step to heaven — Wayland and on-demand

Why that fixed it

What about power consumption?

What are P-states and D-states?

Second step — let the driver suspend the GPU itself

Third step — the wall, and what’s on the other side of it

Same chip, opposite outcome

A subjective note

Comments (0)

Add a comment

What was in the logs?

First step to heaven — Wayland and on-demand

Why that fixed it

What about power consumption?

What are P-states and D-states?

Second step — let the driver suspend the GPU itself

Third step — the wall, and what’s on the other side of it

Same chip, opposite outcome

A subjective note

Comments (0)

Add a comment

Subscribe to newsletter