Jump to content

Truenas system keeps disconnecting from network after a while

I have a DIY NAS that runs TrueNAS-13.0-U6.1. I put this system together a while ago and noticed that if I run the machine for a while (few days for example), the web GUI becomes unavailable and I also can't see the device being connected to the router. After reading a bit about this, the recommendation seems to be to use different NIC rather than the (supposedly crappy) Realtek one on the motherboard. Generally the recommendation was to get an Intel one if I'm only going to use a gigabit connection, so I did buy a new Intel EXPI9301CTBLK Gigabit PRO 1000CT card, this was one of the recommended ones in one of the articles I've found. And the issue still persist - although doesn't seem to be as severe as earlier (I only checked it after around 2 weeks of running as after few days it was still fine). I don't even know how to test this issue properly so I'm not sure when did the network shutdown for example. I tried to reboot it by connecting a keyboard to it but it didn't react to it for some reason so in this scenario the only option I have AFAIK is to press the reset button on the case. The recommendation is to run the NAS constantly as it is better for the hard drives and all the automated checks (SMART, Scrub) can go through regularly, I get it. But if it keeps disconnecting, I don't think it's that great for this system that I have to force restart quite regularly. What if I restart it while scrubbing is happening? Is there any way to check if any scheduled task is happening right now while the web GUI is not available? The command line is not really showing anything - especially if you can't interact with it (not reacting to keyboard).

Any ideas what can cause this issue and how to solve it?
Some info about the system:
- ASRock B450 Steel Legend motherboard

- AMD Ryzen 5 1400 CPU

- 16GB (2x 8GB) unbuffered ECC RAM

- Boot drive is a 128GB M.2 NVME drive (TS128GMTE110S)

- There's 14 hard drives in the system in total. There is 3 vdevs, each consists of 4 drives (RAIDz1), 2 hard drives are spares. 8 hard drives are connected to an HBA, 4 to motherboard, the 2 spares to a simple dual PCIE X1 SATA card. They are a combination of WD Purple and Red drives (non of them are SMR drives).

Link to comment
Share on other sites

Link to post
Share on other sites

Have you verified that the cable going from the device to the router has not gone bad?

Link to comment
Share on other sites

Link to post
Share on other sites

I don't even know how to check such a thing. Generally everything works for a while but typically after a few days, it just doesn't work anymore. Recently I was also trying to connect a USB keyboard to the system when it was in this funny state and it doesn't recognise it, so it's probably some system related issue rather than cable (it does recognise the exact same keyboard when I plug it in after few minutes of running, so when it still works properly).

Link to comment
Share on other sites

Link to post
Share on other sites

It would still seem to be due to hardware issues, most probably from RAM, processor or motherboard. To check out whether the hardware does have issues, you may run TrueNAS, the NICs & all disks on another platform, or install another system on this machine and leave its power on for a few days.🤔

Link to comment
Share on other sites

Link to post
Share on other sites

On 1/11/2024 at 4:08 AM, gyorfitam said:

I have a DIY NAS that runs TrueNAS-13.0-U6.1. I put this system together a while ago and noticed that if I run the machine for a while (few days for example), the web GUI becomes unavailable and I also can't see the device being connected to the router. After reading a bit about this, the recommendation seems to be to use different NIC rather than the (supposedly crappy) Realtek one on the motherboard. Generally the recommendation was to get an Intel one if I'm only going to use a gigabit connection, so I did buy a new Intel EXPI9301CTBLK Gigabit PRO 1000CT card, this was one of the recommended ones in one of the articles I've found. And the issue still persist - although doesn't seem to be as severe as earlier (I only checked it after around 2 weeks of running as after few days it was still fine). I don't even know how to test this issue properly so I'm not sure when did the network shutdown for example. I tried to reboot it by connecting a keyboard to it but it didn't react to it for some reason so in this scenario the only option I have AFAIK is to press the reset button on the case. The recommendation is to run the NAS constantly as it is better for the hard drives and all the automated checks (SMART, Scrub) can go through regularly, I get it. But if it keeps disconnecting, I don't think it's that great for this system that I have to force restart quite regularly. What if I restart it while scrubbing is happening? Is there any way to check if any scheduled task is happening right now while the web GUI is not available? The command line is not really showing anything - especially if you can't interact with it (not reacting to keyboard).

Any ideas what can cause this issue and how to solve it?
Some info about the system:
- ASRock B450 Steel Legend motherboard

- AMD Ryzen 5 1400 CPU

- 16GB (2x 8GB) unbuffered ECC RAM

- Boot drive is a 128GB M.2 NVME drive (TS128GMTE110S)

- There's 14 hard drives in the system in total. There is 3 vdevs, each consists of 4 drives (RAIDz1), 2 hard drives are spares. 8 hard drives are connected to an HBA, 4 to motherboard, the 2 spares to a simple dual PCIE X1 SATA card. They are a combination of WD Purple and Red drives (non of them are SMR drives).

When you added the new nic, did you disable the built in Realtek one on the bios?

The problem with Realtek is that it causes a panic crash that effects networking at the kernel level if I remember correctly. 

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, m9x3mos said:

When you added the new nic, did you disable the built in Realtek one on the bios?

The problem with Realtek is that it causes a panic crash that effects networking at the kernel level if I remember correctly. 

I didn't but I can try. Planning to report back at some point if and how I managed to make it more reliable :).

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, gyorfitam said:

I didn't but I can try. Planning to report back at some point if and how I managed to make it more reliable :).

Cool. Please keep me posted. I know I have always had issues with Realtek and even swapping kernels and drivers didn't really help. 

Link to comment
Share on other sites

Link to post
Share on other sites

So the entire machine is non-responsive, not just networking? Could be issue with the nic but I'd guess it's something else in the hardware if you can't even control it with a local keyboard. Have you run memtest on the RAM? Definitely do that. Also look through the logs (dmesg, /var/log/, etc) to see if there's any clues. You could maybe play around with cpu sleep states to see if going into the deeper sleep states might cause issues but that can be a bit hard depending on the hardware.

Link to comment
Share on other sites

Link to post
Share on other sites

  • 3 weeks later...
On 1/20/2024 at 6:53 PM, scrumptious starfruit said:

So the entire machine is non-responsive, not just networking? Could be issue with the nic but I'd guess it's something else in the hardware if you can't even control it with a local keyboard. Have you run memtest on the RAM? Definitely do that. Also look through the logs (dmesg, /var/log/, etc) to see if there's any clues. You could maybe play around with cpu sleep states to see if going into the deeper sleep states might cause issues but that can be a bit hard depending on the hardware.

I also brought up this topic in the Truenas Forum and they pointed me to check if the RAM is overclocked and also to check the CPU power saving stuff. So RAM and CPU is not overclocked now and I turned off all the CPU power saving stuff. It runs for more than 2 weeks and it's still running great. So thanks for the help, it looks like this issue is solved.

Link to comment
Share on other sites

Link to post
Share on other sites

Was it overclocked before and now it is not? If so then it might be worth experimenting with turning some of the power saving features back on (slowly over time if there are multiple, just one at a time so that if it introduces instability, you know which one). Depending on what power saving features are disabled, it could end up sucking quite a bit of power that it didn't used to. Of course depends on your situation whether burning some extra electricity is a problem worth solving or not.

 

Anyway glad you figured it out!

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, scrumptious starfruit said:

Was it overclocked before and now it is not? If so then it might be worth experimenting with turning some of the power saving features back on (slowly over time if there are multiple, just one at a time so that if it introduces instability, you know which one). Depending on what power saving features are disabled, it could end up sucking quite a bit of power that it didn't used to. Of course depends on your situation whether burning some extra electricity is a problem worth solving or not.

 

Anyway glad you figured it out!

I don't think it was overclocked, just made sure it's not - they said that this way the system is more stable.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×