Jump to content

File corruption on new hdd

Sza

Hi,

I recently (~ half yr ago) bought a 14TB seagate drive (ST14000NM001G) for plex.

A few weeks ago some problems like stuttering and skipping minutes appeared when streaming the movies. 

 

I checked the dmesg logs to find this appearing 1-2 times a day:

 

[Sat May 28 00:01:17 2022] ata3.00: exception Emask 0x50 SAct 0x40000 SErr 0x4090800 action 0xe frozen
[Sat May 28 00:01:17 2022] ata3.00: irq_stat 0x00400040, connection status changed
[Sat May 28 00:01:17 2022] ata3: SError: { HostInt PHYRdyChg 10B8B DevExch }
[Sat May 28 00:01:17 2022] ata3.00: failed command: READ FPDMA QUEUED
[Sat May 28 00:01:17 2022] ata3.00: cmd 60/00:90:00:dc:e6/02:00:5b:03:00/40 tag 18 ncq dma 262144 in
                                    res 40/00:90:00:dc:e6/00:00:5b:03:00/40 Emask 0x50 (ATA bus error)
[Sat May 28 00:01:17 2022] ata3.00: status: { DRDY }
[Sat May 28 00:01:17 2022] ata3: hard resetting link
[Sat May 28 00:01:22 2022] ata3: link is slow to respond, please be patient (ready=0)
[Sat May 28 00:01:27 2022] ata3: COMRESET failed (errno=-16)
[Sat May 28 00:01:27 2022] ata3: hard resetting link
[Sat May 28 00:01:32 2022] ata3: link is slow to respond, please be patient (ready=0)
[Sat May 28 00:01:33 2022] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[Sat May 28 00:01:33 2022] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.PRT2._GTF.DSSP], AE_NOT_FOUND (20200925/psargs-330)
[Sat May 28 00:01:33 2022] ACPI Error: Aborting method \_SB.PCI0.SAT0.PRT2._GTF due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[Sat May 28 00:01:33 2022] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.PRT2._GTF.DSSP], AE_NOT_FOUND (20200925/psargs-330)
[Sat May 28 00:01:33 2022] ACPI Error: Aborting method \_SB.PCI0.SAT0.PRT2._GTF due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
[Sat May 28 00:01:33 2022] ata3.00: configured for UDMA/133
[Sat May 28 00:01:33 2022] ata3: EH complete

 

After this, i installed brand new sata cable and used a different port on the motherboard. The problem was not fixed by this.

 

I checked for bad sectors, havent find any:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   076   064   044    Pre-fail  Always       -       41407842
  3 Spin_Up_Time            0x0003   096   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       215
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   045    Pre-fail  Always       -       121130438
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4536
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       210
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       8590065666
190 Airflow_Temperature_Cel 0x0022   065   049   040    Old_age   Always       -       35 (Min/Max 34/36)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       204
193 Load_Cycle_Count        0x0032   087   087   000    Old_age   Always       -       26948
194 Temperature_Celsius     0x0022   035   040   000    Old_age   Always       -       35 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       7
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2779h+23m+29.028s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       41886643024
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       132461587666

 

When i run fsck it gave interesting results that showed me some files with multiply-claimed blocks. When i answered yes to apply the fix it freezed for a day then i restarted the system.

 

The drive has newest available firmware.

 

Can any1 help what can i do to fix it? The streaming error looked like the drive has faulty sectors but after testing, no smart error reported (i cant rma it this way)

 

Thanks

Link to comment
Share on other sites

Link to post
Share on other sites

That command timeout value looks very bad for a half-a-year drive. Definitely not a good sign, try replacing the sata cable just to rule that out, if it doesnt fix the issues contact seagate. I have a 10tb ironwolf with 4 and a half year power on time and that has a bit less than half of that amount and still got kicked out by zfs.....

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, jagdtigger said:

That command timeout value looks very bad for a half-a-year drive. Definitely not a good sign, try replacing the sata cable just to rule that out, if it doesnt fix the issues contact seagate. I have a 10tb ironwolf with 4 and a half year power on time and that has a bit less than half of that amount and still got kicked out by zfs.....

I tried to replace the power and sata data cables, and used a different port on the mb.

 

But it did not help.

 

Also, it was working fine for months, can a cable fail just by itself?

 

 

 

Other part what i dont understand is why the kernel falls back to sata 2 and sometimes even sata 1 (1.5gbps). I have to restart the system to get back to full speed sata 3

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, Sza said:

Other part what i dont understand is why the kernel falls back to sata 2 and sometimes even sata 1 (1.5gbps). I have to restart the system to get back to full speed sata 3

Well if no other drive does this and the sata cable isnt a garbage tier one the only explanation is the drive has some serious issues. If you can test it in a different system with a sata cable you didnt used with it before that could show pretty accurately if its the hdd or something in that system it was in. If the hdd produces the same symptoms there too we can safely conclude that the hdd is faulty and you have to contact seagate.

 

26 minutes ago, Sza said:

Also, it was working fine for months, can a cable fail just by itself?

In a few months id say no, but you cant know it for sure until you replace it 😉 . These cables are pretty inexpensive and doesnt hurt if you have one or two lying around.

Link to comment
Share on other sites

Link to post
Share on other sites

On 5/28/2022 at 7:57 PM, jagdtigger said:

Well if no other drive does this and the sata cable isnt a garbage tier one the only explanation is the drive has some serious issues. If you can test it in a different system with a sata cable you didnt used with it before that could show pretty accurately if its the hdd or something in that system it was in. If the hdd produces the same symptoms there too we can safely conclude that the hdd is faulty and you have to contact seagate.

 

In a few months id say no, but you cant know it for sure until you replace it 😉 . These cables are pretty inexpensive and doesnt hurt if you have one or two lying around.

 Hi,

 

I contacted the seagate support and they told me to put the hdd into a windows machine and run their seatools diagnostics toolkit. The short tests are all successfully completed with a pass. The long tests are taking forewer to complete, i cant have the machine running for days just for hdd tests. I tried to use it by installing some games and playing them and it looks like everything is fine. idk what to do now, maybe ill install it back to the nas and start over with a fresh filesystem.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Sza said:

The long tests are taking forewer to complete, i cant have the machine running for days just for hdd tests.

Which is more important, your data or the few € you save on electricity by not doing rhe test?

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×