Each time I try AMD graphics, something is fucked for me. Back with fglrx, fglrx just sucked, so I used Nvidia. Then I had an AMD right around when they finally had opensource drivers, but it was still buggy as hell. So I went with Nvidia again (first a GTX 790, then a GTX 1060). In the meantime I had a new work notebook where I also went with an AMD APU, and had driver crashes for a long time when I was in video calls and it had to decode multiple streams. That thankfully stabilized with Linux 6.4.

Since sooo many people in the community swear by AMD, I thought “dammit, let’s try it again for my new desktop” and got an 7800rx … and I have to reboot ~5 times until I finally make it to a running xserver or wayland session. Apparently I am hit by this problem (at least I hope so). But that doesn’t even read nice … the fix seems to be to revert another fix for powermanagement. So I either have a mostly non-booting card or suboptimal power management.

I start to regret having chosen AMD … again :-/ I seem to be cursed.

  • topperharlie
    link
    fedilink
    05 months ago

    oh man, reading the comments fill me with fear, as I just ordered a new computer after stretching my old laptop for 8 years or so. I was super close to getting an AMD but went with Nvidia in the end… but so much bad juju in the comments for Nvidia too…

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      Hmm, interesting idea. I need to investigate that. The dmesg output is full of amdgpu irq errors, but of course that could also happen with an issue on the board.

      I would rule out a generic hardware issue, since 1) I get graphics during boot up until it needs to do a modeswitch (I guess) and b) it works fine so far on Windows.

      I did have a similar issue after the first boot on Windows as well and assumed so far that the modeswitch after the initial driver install caused the problem. But Windows likely also installed chipset drivers at that time, so PCIe could be a possibility. Then again… I know that Windows reloads graphics drivers on-the-fly… but chipset drivers? Probably not. Which would speak against that theory.

      • @acockworkorange@mander.xyz
        link
        fedilink
        05 months ago

        I have no clue how Linux initiates the communication with a PCIe board, and whether the amdgpu driver would take care of that. But hardware excluded, some misconfiguration on the driver’s part could be present. Good luck!

      • @acockworkorange@mander.xyz
        link
        fedilink
        05 months ago

        I have no clue how Linux initiates the communication with a PCIe board, and whether the amdgpu driver would take care of that. But hardware excluded, some misconfiguration on the driver’s part could be present. Good luck!

  • @heartsofwar@lemmy.world
    link
    fedilink
    05 months ago

    You’re riding the edge too close. Fedora 39 hasn’t even moved to a 6.7 kernel yet – They’re on 6.6.14-200.

    If you’re running a newer kernel than the latest released Fedora, you better be a Linux guru or you’re gonna pay with pain, and thats coming from someone with 23+ years experience running / working on Linux and I have an AMD RX 7900 XTX

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      I did live like this with all my intel/nvidia systems just fine, though. If AMD tends to have bugs like this, they still seem to suffer from the same shitty software development attitude as they did back in the fglrx days… with the added advantage that people from the community can now firefight some of the problems. For a product I paid a few hundred euros for I expect some quality assurance for its driver development - that seems to work with nvidia.

      • @heartsofwar@lemmy.world
        link
        fedilink
        0
        edit-2
        5 months ago

        I’ve done some work with AMD and Nvidia that I shan’t disclose more of, but to be totally honest / transparent, my experiences with either of their internal workings was kind of eye opening in a not so good kind of way; however, that isn’t to say I distrust them or their work, because I could say that about several prominent Tech companies that most individuals would ordinarily think the best of. At the end of the day, I don’t think my experiences are 100% representational of an entire company, but after being in the industry for 23+ years… you kind of learn to stay away from that BCBS: if you know you know.

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      Not really. I use Linux as my main driver since about 2006. My intel laptops and the mentioned nvidia gpus work fine.

  • @the_q@lemmy.world
    link
    fedilink
    05 months ago

    “I keep hitting myself in the foot with a hammer! Why does the hammer keep doing this to me!”

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      If other people apparently have no probiems, it can’t be AMD in general. Never trying AMD because of a bad experience 10 years ago would also be extremely unwise. “We’ve always done it like this around here” is how you fall behind the times.

  • Blaster M
    cake
    link
    fedilink
    05 months ago

    RX 6700 XT here… once I refreshed the thermal pads and the thermal paste, it works great in Windows and Linuxes… Ubuntu, Mint, Fedora, Bazzite (Immutable Fedora but for gaming), it had no issues with the amdgpu driver builtin on any of them.

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      It’s a completely new card, so I will not fiddle around with it. Also it runs almost flawless on Windows (aside from a similar crash on the very first boot during driver install).

  • @c10l@lemmy.world
    link
    fedilink
    05 months ago

    Run sudo dmesg | grep amdgpu and look for errors.

    You may have a firmware file missing, for instance. If that’s the case, it’s an easy fix - just download the firmware files from the kernel tree and put them wherever your system wants them.

    This is how I do it on Debian but it should be easy enough to adapt to whatever distribution you’re using (it might be exactly the same tbh): https://blog.c10l.cc/09122023-debian-gaming#firmware

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      Thanks for the idea!

      dmesg shows the same errors as in the referenced bug ticket. So I don’t think missing firmware is the issue. I would not be surprised however, if the problem in general is a combination of amdgpu and firmware behavior. (IMO the hardware should not crash as hard as it does, so the firmware seems to be a bit wonky too)

  • @AMDIsOurLord@lemmy.ml
    link
    fedilink
    05 months ago

    It could be your monitor or even monitor cable. I have this monitor which absolutely fucking refuses to work with AMD oved HDMI. If you have inexplicable system sleep issues, black screen issues, startup issues, etc. It could be the monitor at fault

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      Thanks for the suggestion!

      While it’s a possibility, I think it’s unlikely, since the machine works fine with Windows. I also compiled the tkg 6.7.2 kernel which includes the revert-patch for the offending change and so far the machine booted three times without issues, so it seems to fit.

      • @AMDIsOurLord@lemmy.ml
        link
        fedilink
        05 months ago

        That doesn’t rule out the possibility of display issues tho, back when I had the faulty monitor it was much more severe under Linux, I never managed to track it down tho (using AMD hardware for over 10 years now, this one issue busted my nuts pretty hard)

        If you have a TV or something, at least try it to rule out possible outside factors

        • @aksdb@lemmy.worldOP
          link
          fedilink
          05 months ago

          It can’t hurt. I’ll grab another display and another cable and try a few combinations. Thanks!

  • Captain Janeway
    link
    fedilink
    05 months ago

    I’ve had similar issues. I don’t understand the love for AMD. My whole rig is AMD, but it’s constantly having GPU crashes. All games run at high FPS and my CPU temps seem nominal. But the games will crash. Everything from RimWorld to Baldurs Gate 3. They all run pinned at 60fps but randomly crash. I’ve tried a thousand different configurations and drivers. I’ve tried Ubuntu and Linux Mint. I’m now just accepting that I can’t rely on it as a gaming rig. I like that AMD is trying to be progressive with open source drivers but the quality doesn’t seem to be there. My next rig might be Nvidia and Intel. But we will see.

    • @bazsy@lemmy.world
      link
      fedilink
      05 months ago

      Did you check the system logs to see what caused it?

      Many things can result in seemingliy random crashes. Any overclock (including XMP and Expo) or undervolt or even a bios version can be problematic.

      I would check first if it’s stable on windows.

      • Captain Janeway
        link
        fedilink
        05 months ago

        It’s not stable on Windows either. But I haven’t looked at logs because I didn’t really know what - or how - to check.

        • @bazsy@lemmy.world
          link
          fedilink
          English
          05 months ago

          Most distros use systemd and its logging solution: journald. You can use journalctl to read the logs around the time of the crash for e.g.:

          • journalctl -S -5m this shows the last 5 minutes. Use this when a game crashes but the system continues working and did not reboot.
          • journalctl -b -1 -S -10m this shows the last 10 minutes from the previous boot. Use this if the crash froze the whole system and rebooted.

          Look for red lines (errors) and what wrote them. AMD GPU faults usually have the ‘amdgpu’ mentioned, memory errors could appear as ‘protection fault’.

          • Captain Janeway
            link
            fedilink
            05 months ago

            journalctl -S -5m

            Looks like this is the errors I’m seeing. I know it’s not helpful to just drop this in the chat, but I’m doing it for posterity (and to let you know your comment did in fact help me)!

            Feb 04 16:47:40 computer kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
            Feb 04 16:47:40 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=17063130, emitted seq=17063132
            Feb 04 16:47:40 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 161654 thread redDispatcher9 pid 161668
            Feb 04 16:47:40 computer kernel: amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
            Feb 04 16:47:40 computer kernel: amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
            Feb 04 16:47:40 computer kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
            Feb 04 16:47:40 computer kernel: amdgpu 0000:0b:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
            Feb 04 16:47:40 computer kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
            Feb 04 16:47:40 computer kernel: [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
            
            • @bazsy@lemmy.world
              link
              fedilink
              English
              0
              edit-2
              5 months ago

              Happy to help! Tough you are right, this is a rather generic error that doesn’t help much just confirms that the GPU is the issue.

              At this point it could be a driver issue since there are similar open bug reports. A hardware problem is still possible since you previously said that it’s unstable on windows too, and power related issues can also lead to this error message.

              • Captain Janeway
                link
                fedilink
                English
                0
                edit-2
                5 months ago

                EDIT: Tentative solution: CoreCtrl

                CoreCtrl allowed me to underclock my Radeon 5600XT GPU (currently set values to GPU 800MHz and memory set to 500MHz). I say “tentative” because this problem has been persistent for years, but I’ve been running Cyberpunk for 1 hour at 60FPS on High settings (and mostly 60FPS on Ultra, but I had some FPS drops). Even if this solution isn’t 100% perfect, I think some combination of changing the GPU values is probably going to make my rig much more functional.

                I found CoreCtrl based on a Reddit thread last night but didn’t have time to test it until this evening after work. Seems to have made a world of a difference.


                Yeah I’ve tried just about every feasible kernel parameter for amdgpu module, updated my kernel, to 6.2 on Linux Mint, and I’ve tried several different BIOS settings. My system runs everything reasonably. Even Cyberpunk 2077 is generally at 60FPS. But after about 5minutes of gaming on Cyberpunk 2077, it crashes. Other games last longer, which is why I use Cyberpunk 2077 to stress test my system.

                These are my system specs:

                • PSU: 850 Watt 80 PLUS Gold Fully Modular ATX
                • CPU: AMD Ryzen 7 2700 Eight-Core Processor × 8
                • GPU: Radeon 5600XT
                • RAM: G-SKill DDR4-3600 CL16-19-19-39 1.35V (2x16GB = 32GB total system memory)
                • SSD: Samsung (MZ-V7E500BW) 970 EVO SSD 500GB - M.2 NVMe
                • MOBO: Asus x470 Pro
                • Other: TP-Link AC1200 PCIe WiFi Card for PC (Archer T5E) - Bluetooth 4.2, Dual Band Wireless Network Card installed in PCIEx1_3 which seems like it could be a variable I should remove, but I’ve tried removing it and didn’t see any changes in behavior. I’ve tried various PCIEx1_* slots with similar results.

                I don’t really see where I might be going wrong here. I bought this all ~4 years ago and I’ve always had these intermittent crashes. It’s admittedly worse on Linux, but it still occurred on Windows.

                Anyways, I spent about 5 hours last night reading bug forums, testing various amdgpu mod parameters, settings in my BIOS, and even re-configuring my fans to provide (potentially) more optimal cooling. None of this really made a difference. I run two 1080p monitors (not exactly breaking the bank here). I had a lot of hope regarding one forum about ring gfx_1.0.0 errors related to how AMD reads the GPU in Linux. My graphics card is detected as: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] and apparently some machines used to accidentally use the total allocated memory for 5700XT instead of the 5600XT. This resulted in some form of corrupt memory allocation. That sort of behavior would make sense for my system since it runs well, but just fails suddenly.

                Other errors I’ve seen are:

                Feb 04 20:17:01 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=116669, emitted seq=116671
                Feb 04 20:17:01 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 3668 thread redDispatcher12 pid 3684
                ...
                Feb 04 20:26:16 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=34068, emitted seq=34071
                Feb 04 20:26:16 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 4208 thread redDispatcher13 pid 4232
                Feb 04 20:26:17 computer kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:77:crtc-0] hw_done or flip_done timed out
                ...
                Feb 04 21:00:43 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.3.0 timeout, signaled seq=3085, emitted seq=3086
                Feb 04 21:00:43 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 3771 thread redDispatcher8 pid 3783
                ...
                Feb 04 22:28:50 computer kernel: [drm:amdgpu_device_ip_early_init [amdgpu]] *ERROR* early_init of IP block  failed -19
                Feb 04 22:28:50 computer kernel: [drm:amdgpu_device_ip_early_init [amdgpu]] *ERROR* early_init of IP block  failed -19
                Feb 04 22:36:57 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=171774, emitted seq=171776
                Feb 04 22:36:57 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 4122 thread redDispatcher5 pid 4131
                ...
                Feb 04 22:45:46 computer kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:77:crtc-0] hw_done or flip_done timed out
                Feb 04 22:45:56 computer kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:80:crtc-1] hw_done or flip_done timed out
                Feb 04 22:46:19 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.0 timeout, signaled seq=123, emitted seq=124
                Feb 04 22:46:19 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 4187 thread redDispatcher8 pid 4202
                ...
                Feb 04 23:49:45 computer kernel: [drm:gfx_v10_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
                Feb 04 23:49:45 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=435155, emitted seq=435157
                Feb 04 23:49:45 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 3668 thread redDispatcher12 pid 3690
                ...
                Feb 04 23:58:58 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=66268, emitted seq=66270
                Feb 04 23:58:58 computer kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GameThread pid 4180 thread redDispatcher11 pid 4196
                Feb 04 23:58:58 computer kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:77:crtc-0] hw_done or flip_done timed out
                

                ^ These are all errors which occurred from various tests of amdgpu module settings and/or BIOS settings. The common thread is some form of ring XXXX timeout.

                These two threads seemed like my best chance, but their proposed solutions didn’t help:

                1. https://bugzilla.kernel.org/show_bug.cgi?id=201957
                2. https://bugzilla.kernel.org/show_bug.cgi?id=202665#c7
      • @iopq@lemmy.world
        link
        fedilink
        05 months ago

        My issue was the GPU fan and the PSU fan would blow into each other. I opened the PSU and reversed the fan

        • @cevn@lemmy.world
          link
          fedilink
          05 months ago

          Hah, I would not expect that to kill it. Maybe a small build. The other day I was switching the cards and realized my CPU fan and case fan were both disconnected, idk how the hell it was running without overheating… except I always have the side of the case off because the 3080 will shut me down otherwise.

          • @iopq@lemmy.world
            link
            fedilink
            05 months ago

            Yeah, but having the fans off just means the heat is passively dissipated. Having another fan blow the hot air back in is worse since it just stays there

      • Captain Janeway
        link
        fedilink
        05 months ago

        What does that mean? Genuinely don’t know what it means that it runs Wayland.

  • @CarlosCheddar@lemmy.world
    link
    fedilink
    05 months ago

    On EndeavorOS I haven’t had issues with a Vega64 and now with a 6800XT. I followed the AMD Gpu guides from Arch wiki to get everything up and running but that was back when I started the build with the Vega 64. After the upgrade I didn’t even need to touch anything and all non anti-cheat games work quite well. Maybe I got lucky though.

  • @StefanT@lemmy.world
    link
    fedilink
    05 months ago

    I use an AMD 7900rx with an AMD 7950x processor since almost a year with Gnome / Wayland on Arch. No problems up to now. Yes, I am a gamer too.

    As others said it depends on the distribution you use.

  • @cevn@lemmy.world
    link
    fedilink
    05 months ago

    I just got a 7600XT. My only complaint is that it isn’t pushing quite enough frames so I would need something more beefy, but then I will also lose GSync because of my monitors so I will probably simply return it and go back to the 3080. Lower TDP and thermals was quite nice though and wayland was much less buggy. No crashes, I’m on ubuntu tho.

          • @cevn@lemmy.world
            link
            fedilink
            05 months ago

            Blah, I kinda tried, but no dice yet, only managed to stop my suspend from working. I have modprobe/nvidia.conf and with the tmpfile option, updated initramfs, added the services… but only my monitors turn off. I can probably live without it for now though.

  • @Samueru@lemmy.world
    link
    fedilink
    05 months ago

    I have a similar story with an RX580, I replaced my GTX 1060 3GB for a 8GB RX 580 mostly because the 3GB of vram were an issue for BeamNG.

    Now I can’t record my 3 displays with the RX 580, it just fails when trying to do so, and 2 displays results in constant encoder overloads, something that the 1060 had issues at all, also my colors are off when recording and I have no idea why, it even happens when recording with the CPU:

    https://bbs.archlinux.org/viewtopic.php?id=292196

    Also kernel 6.6 broke the power reporting on all polaris GPUs, thankfully that was fixed recently in kernel 6.7.2, but holy shit it took like 6 months to fix that.

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      I probably shouldn’t have read tests and forums, but simply searched for crashes and open bugs to get a feeling for what I am getting into. Then again I also read from people with very ugly problems with nvidia, so it’s not a really good measure.

      I really want AMD to be good; they offer more VRAM where nvidia always seems to cheap out in pretty suspicious ways. Then again nvidia seems to be more power efficient.

      • @Samueru@lemmy.world
        link
        fedilink
        05 months ago

        My time with nvidia on linux was 0 issues in performance or usability.

        The only sort of issue that I had was that the GTX 1060 drew 20W at idle when using the 3 displays, this was a bug that nvidia fixed for the RTX 20 series and newer cards but never fixed for pascal lol.

        But even on BeamNG, there was a period were the native linux version didn’t work on mesa while it worked for nvidia, now to be fair with amd this was because the vulkan implementation of beamng is horrible and right now it does not work on either lol.

  • @Fredol@lemmy.world
    link
    fedilink
    05 months ago

    the most bug-free gpu experience I have with Linux is Nvidia GPU + KDE X11 with compositor disabled. Pure bliss. I’ve had a 6700XT and it was terrible too, now I have a 4070. For my laptops, intel igpu works decently well with wayland KDE, but there are few bugs, like having to clear some apps gpucache (vscode) quite often

    • @aksdb@lemmy.worldOP
      link
      fedilink
      05 months ago

      At least with my 1060 compositing wasn’t an issue. But true, I rarely used Wayland. Do you have specific issues when compositing is enabled or do you just prefer the simpler rendering?

      • @Fredol@lemmy.world
        link
        fedilink
        05 months ago

        I prefer without for the aesthetics but also for functionality: compositing x11 with multi monitors of different refresh rates is still broken, everything becomes locked at 60hz instead of the max for each monitor.