Jump to content

Linux - Strange system lock up

Go to solution Solved by Bingus_Bongus,

Nevermind everyone, seems I've found the problem, and it's all of my system memory being faulty.

 

I invested in my own form of PCDR Service Center 16. Here's a picture of $800 down the drain.

 

Thanks G.Skill, I really appreciate it. I do hope you're ready to give me a warranty though.

 

IMG_20240206_203655_181.thumb.jpg.c52d9c733eedd31602e590ce60a960ea.jpg

I recently upgraded to kernel 6.6.13 from 6.5.0-5 (which seems odd to me, whatever, not a kernel dev), and ever since then I've been getting strange type of internal "latency" as I would best describe it.

 

It seems to happen under the following and sometimes specific conditions:

-Uptime is nearing the 7-10 hour mark per Neofetch

-The system time is 12-2am (I work and study late)

-The system can sometimes be idling during this point from the same morning hours

 

These two conditions seem to the be the only two I can count on, as my overall application uses and general use case has been about the same for the past two years. Generally nothing that I do on the computer has changed very much. These cases are mostly gaming, software development and virtualization and just general Chromium-based browsing.

 

My system specifications are as follows (also per Neofetch plus some extra details afterwards)

 

OS: Debian GNU/Linux trixie/sid x86_64
Kernel: 6.6.15-x64v3-xanmod1
Uptime: 12 mins (just restarted after this whole issue happened again)
Packages: 3077 (dpkg)
Shell: bash 5.2.21
Resolution: 2560x1440, 1440x900, 1920x1080, 1080x1920
DE: Plasma 5.27.10
WM: KWin
Theme: [Plasma], Breeze [GTK2/3]
Icons: candy-icons [Plasma], candy-icons [GTK2/3]
Terminal: konsole
CPU: AMD Ryzen 9 5950X (32) @ 3.400GHz
GPU: AMD ATI Radeon RX 7900 XT/7900 XTX
Memory: 7704MiB / 128724MiB

 

Motherboard: ASUS TUF Gaming X570-Plus WiFi
 

When it comes to my operating system and the respective choice, I chose this because I wanted somewhat more up-to-date software/firmware to run the 7900XTX (mesa, opengl, etc). Generally everything has been running great outside of issues like this, where they either get resolved or I make the move again back down to Debian 12 (Stable)

 

Kernel with XanMod was my attempt to fix this issue by using a non-standard kernel, which as of posting has clearly not worked as it happened while running it.

 

When it comes to memory, I have 128GiB of DDR4 3600 CL18 from G.Skill.

 

All hardware has passed diagnostics per my own purchased copy of PCDR Service Center 16 on a bootable USB.

 

Onto the issue itself, it seems to be similar to Minecraft server tick lag, where you can still move around, but interacting with things or objects moving does not work. For example, when the issue happens, small things will never finish such as running Neofetch in terminal, it freezes halfway through, it never fully completes the display of information in text. The process stays and tries to finish in terminal, but it never does and even shows in the task listing as idling. Other instances of this are the following:

 

-Shutting down my system resulted in the system hanging and stopping the same SystemD services/daemons multiple times. The computer never made it to full power off before I held the power button.

-YouTube videos specifically will stop and then endlessly buffer, reloading the pages does't work. Interesting enough, other webpages work just fine, it's just YouTube that has issues.

-Closing my browser completely to try and fix this results in my browser then refusing to open, the process crashes each time and Brave never again opens until the system is fully hard-restarted.

-Terminal commands don't work, when running something like "sudo apt update" and then goes to the prompt asking for a password, it never makes it to giving such a prompt, it get's stuck on this part.

 

These are some if not the largest indicators/symptoms of my system needing to be reset to be usable again.

 

What I have tried so far in attempt to fix the issue:

-XanMod kernel - has not worked.

-Loading into a Linux Mint bootable environment from a USB and running fsck -AR to scan all drives for filesystem errors, to which everything reported without error.

 

What I have not tried:

-BIOS/UEFI firmware update

 

These are all of the details I could remember now as it's 3am, I'm sure if questions are asked or if I remember something else I'll post it.

 

For now, thanks as any input or suggestions are appreciated.

Link to comment
Share on other sites

Link to post
Share on other sites

It sounds like it works but gets interrupted randomly like program overflow, so even closing so doesn't work or and while running gets interrupted too or issuing commands works once but after its interrupted again.

 

You can try live linux and run YouTube for while and see if it gets interrupted again as your main system.

I'm jank tinkerer if it works then it works.

Regardless of compatibility 🐧🖖

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, C2dan88 said:

When you get the lag, try using jounalctl to examine kernel or system process errors.

https://www.digitalocean.com/community/tutorials/how-to-use-journalctl-to-view-and-manipulate-systemd-logs

 

 

Checking the logs from last night when it was happening shows a lot of red and even more yellow, I'll paste the ones that look like the problem(s) to me:

 

firststarts.thumb.png.63b2c2ca36953055facc8fb744bd85bb.png

 

Issue seems to begin around 10:55PM - 11PM at night with this first big chunk of yellow warning text.

 

firstredtext.thumb.png.91c44003bf81aa41d43f56af43334d8c.png

 

Shortly after that the first red error messages appear, at this point everything on my system is still working AFAIK.

 

secondredtext.thumb.png.599732f8d5ecdafa60cb8f48822930be.png

 

Then more a few minutes later.

 

chunk1.thumb.png.0dd51210f14c1bf2b4b7b34fc87fb34a.png

 

Starts reporting filesystem errors on my Timeshift backup partition where I have Timeshift set to make a system backup every hour and to keep three snapshots.

 

Screenshot_20240206_192802.thumb.png.0aa69f0cf701007d919ecf7f3e7aeced.png

 

Screenshot_20240206_192836.thumb.png.300fd593b6c56873fdac50baa630f28a.png

 

Screenshot_20240206_192905.thumb.png.145bf28f688f9c6fe50979ce7f76f287.png

 

It just keeps going at this point, I'm still doing school work here and things are running okay, still don't notice anything.

 

Screenshot_20240206_193031.thumb.png.58a378cb6ed1f5d40001c1d526714f00.png

 

Here though, after midnight is when it seems to really go off the deep end. I'm working out in my room at this point as I have Bluetooth headphones and I have a YouTube video playing for music. This is when the system technically starts idling is when I'm working out, however I still come over every 5 minutes or so to change the place in the video or go to another one, so the system shouldn't be going to sleep at all.

 

Screenshot_20240206_193113.thumb.png.4c3d2b1fa3afb9be1deb7945bc2802a0.png

 

Looks to keep going for about that entire hour until 1AM, still hitting the iron at this point and the music is still going so I don't notice anything.

 

Screenshot_20240206_193205.thumb.png.1d089b13e1ecfdce406a3037c482bf99.png

 

Here is the real meat and potatoes, 2am is when I stop working out and start getting ready for bed and recovery and I just want to relax. The big error here that I think is causing the symptoms I initially described is the first line reporting the error message:

"watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [rsync:40965]

 

This error message continues for about 1024s which is about the 10-15 minutes that I spent trying to get my system to fully unfreeze and/or trying to reload YouTube videos, which strangely are the only ones that have issues. All other pages like my school one are working just fine, seems to be interactive media like YT that just breakdown and buffer endlessly.

 

02:23 is about the point in which I press the reset button on my desktop and it takes me back to GRUB to load my OS again. Checking logs after that, there is any yellow or red text. Even now as I'm typing this reply everything is working and checking latest logs show no errors of any kind. This always seems to happen when the uptime is either 7-10 hours of use or the system has been idling and/or some unknown process tries to start and slowly starts failing.

 

This type of stuff is generally why I have two NVMe drives for my operating system, one for root and one for home. In case something comes along and I have to reinstall I can just keep my home directory and all of my app settings. I'm just worried that if it is a hardware problem, reinstalling to Debian stable (12) won't solve anything.

 

Thanks for your reply, hope this is useful.

Link to comment
Share on other sites

Link to post
Share on other sites

Nevermind everyone, seems I've found the problem, and it's all of my system memory being faulty.

 

I invested in my own form of PCDR Service Center 16. Here's a picture of $800 down the drain.

 

Thanks G.Skill, I really appreciate it. I do hope you're ready to give me a warranty though.

 

IMG_20240206_203655_181.thumb.jpg.c52d9c733eedd31602e590ce60a960ea.jpg

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×