Jump to content

Persistent errors in zfs while no errors in smart test

Hello,

Most often on reboot (but not necessarily), one (and the same) drive show errors and go in faulted stat on zfs.

But extended smart test show what drive is ok.

On hardware level, the server is composed of:

  • Deskmini x300
  • ryzen 5600g
  • 64GB DDR4 3200
  • Zfs pool:
    • 2x samsung SATA PM883 2TB
    • 1x samsung NVME 980Pro 2TB

Software:

  • Debian 12

Is it mean what drive is dying?

zfs status:

  pool: Data_VM
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: resilvered 108M in 00:00:01 with 0 errors on Thu Oct 12 12:25:00 2023
config:

	NAME                           STATE     READ WRITE CKSUM
	Data_VM                        DEGRADED     0     0     0
	  raidz1-0                     DEGRADED     0     0     0
	    wwn-0x5002538e1060b7fc     ONLINE       0     0     0
	    wwn-0x5002538e1060b81c     FAULTED      1    17     0  too many errors
	    nvme-eui.002538b931b1cd1b  ONLINE       0     0     0

errors: No known data errors

but smart show no issues (short and extended tests done):

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7LH1T9HMLT-00003
Serial Number:    S5MNNA0N611269
LU WWN Device Id: 5 002538 e1060b81c
Firmware Version: HXT76F3Q
User Capacity:    1,920,383,410,176 bytes [1.92 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Oct 12 12:31:30 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 100) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       22278
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       87
177 Wear_Leveling_Count     0x0013   099   099   005    Pre-fail  Always       -       18
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       5815
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   057   045   000    Old_age   Always       -       43
194 Temperature_Celsius     0x0022   057   045   000    Old_age   Always       -       43 (Min/Max 22/55)
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       33
202 Exception_Mode_Status   0x0033   100   100   010    Pre-fail  Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       55
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       25457707401
242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       208681355372
243 SATA_Downshift_Ct       0x0032   099   099   000    Old_age   Always       -       5
244 Thermal_Throttle_St     0x0032   100   100   000    Old_age   Always       -       0
245 Timed_Workld_Media_Wear 0x0032   100   100   000    Old_age   Always       -       65535
246 Timed_Workld_RdWr_Ratio 0x0032   100   100   000    Old_age   Always       -       65535
247 Timed_Workld_Timer      0x0032   100   100   000    Old_age   Always       -       65535
251 NAND_Writes             0x0032   100   100   000    Old_age   Always       -       25694887040

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22270         -
# 2  Short offline       Completed without error       00%     22237         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

PC Specs - AMD Ryzen 7 3700X - Asrock AB350 ITX - 64GB DDR4-3600MHz - Geforce GTX 1080 - Samsung 960Pro - Monsterlabo's "The First" - Corsair SF450

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Nord1ing said:
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       33
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       55

Track these two values and see if they increase, could be that the SSD isn't getting a proper/clean power off for some reason and data in the DRAM cache isn't being written to NAND causing corruption and the device being marked as bad.

 

1 hour ago, Nord1ing said:
12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       87
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       55

Also the ratio of Power on Recoveries vs Clean Power On is quite bad. Compare these against your other PM883.

 

The other thing to try is replacing the SATA cable, rare but can be a problem.

Link to comment
Share on other sites

Link to post
Share on other sites

36 minutes ago, leadeater said:

The other thing to try is replacing the SATA cable

CRC Error count > 0 makes me think this is the issue. I've seen it a few times, actually.

Main System (Byarlant): Ryzen 7 5800X | Asus B550-Creator ProArt | EK 240mm Basic AIO | 16GB G.Skill DDR4 3200MT/s CAS-14 | XFX Speedster SWFT 210 RX 6600 | Samsung 990 PRO 2TB / Samsung 960 PRO 512GB / 4× Crucial MX500 2TB (RAID-0) | Corsair RM750X | Mellanox ConnectX-3 10G NIC | Inateck USB 3.0 Card | Hyte Y60 Case | Dell U3415W Monitor | Keychron K4 Brown (white backlight)

 

Laptop (Narrative): Lenovo Flex 5 81X20005US | Ryzen 5 4500U | 16GB RAM (soldered) | Vega 6 Graphics | SKHynix P31 1TB NVMe SSD | Intel AX200 Wifi (all-around awesome machine)

 

Proxmox Server (Veda): Ryzen 7 3800XT | AsRock Rack X470D4U | Corsair H80i v2 | 64GB Micron DDR4 ECC 3200MT/s | 4x 10TB WD Whites / 4x 14TB Seagate Exos / 2× Samsung PM963a 960GB SSD | Seasonic Prime Fanless 500W | Intel X540-T2 10G NIC | LSI 9207-8i HBA | Fractal Design Node 804 Case (side panels swapped to show off drives) | VMs: TrueNAS Scale; Ubuntu Server (PiHole/PiVPN/NGINX?); Windows 10 Pro; Ubuntu Server (Apache/MySQL)


Media Center/Video Capture (Jesta Cannon): Ryzen 5 1600X | ASRock B450M Pro4 R2.0 | Noctua NH-L12S | 16GB Crucial DDR4 3200MT/s CAS-22 | EVGA GTX750Ti SC | UMIS NVMe SSD 256GB / Seagate 1.5TB HDD | Corsair CX450M | Viewcast Osprey 260e Video Capture | Mellanox ConnectX-2 10G NIC | LG UH12NS30 BD-ROM | Silverstone Sugo SG-11 Case | Sony XR65A80K

 

Camera: Sony ɑ7II w/ Meike Grip | Sony SEL24240 | Samyang 35mm ƒ/2.8 | Sony SEL50F18F | Sony SEL2870 (kit lens) | PNY Elite Perfomance 512GB SDXC card

 

Network:

Spoiler
                           ┌─────────────── Office/Rack ────────────────────────────────────────────────────────────────────────────┐
Google Fiber Webpass ────── UniFi Security Gateway ─── UniFi Switch 8-60W ─┬─ UniFi Switch Flex XG ═╦═ Veda (Proxmox Virtual Switch)
(500Mbps↑/500Mbps↓)                             UniFi CloudKey Gen2 (PoE) ─┴─ Veda (IPMI)           ╠═ Veda-NAS (HW Passthrough NIC)
╔═══════════════════════════════════════════════════════════════════════════════════════════════════╩═ Narrative (Asus USB 2.5G NIC)
║ ┌────── Closet ──────┐   ┌─────────────── Bedroom ──────────────────────────────────────────────────────┐
╚═ UniFi Switch Flex XG ═╤═ UniFi Switch Flex XG ═╦═ Byarlant
   (PoE)                 │                        ╠═ Narrative (Cable Matters USB-PD 2.5G Ethernet Dongle)
                         │                        ╚═ Jesta Cannon*
                         │ ┌─────────────── Media Center ──────────────────────────────────┐
Notes:                   └─ UniFi Switch 8 ─────────┬─ UniFi Access Point nanoHD (PoE)
═══ is Multi-Gigabit                                ├─ Sony Playstation 4 
─── is Gigabit                                      ├─ Pioneer VSX-S520
* = cable passed to Bedroom from Media Center       ├─ Sony XR65A80K (Google TV)
** = cable passed from Media Center to Bedroom      └─ Work Laptop** (Startech USB-PD Dock)

 

Retired/Other:

Spoiler

Laptop (Rozen-Zulu): Sony VAIO VPCF13WFX | Core i7-740QM | 8GB Patriot DDR3 | GT 425M | Samsung 850EVO 250GB SSD | Blu-ray Drive | Intel 7260 Wifi (lived a good life, retired with honor)

Testbed/Old Desktop (Kshatriya): Xeon X5470 @ 4.0GHz | ZALMAN CNPS9500 | Gigabyte EP45-UD3L | 8GB Nanya DDR2 400MHz | XFX HD6870 DD | OCZ Vertex 3 Max-IOPS 120GB | Corsair CX430M | HooToo USB 3.0 PCIe Card | Osprey 230 Video Capture | NZXT H230 Case

TrueNAS Server (La Vie en Rose): Xeon E3-1241v3 | Supermicro X10SLL-F | Corsair H60 | 32GB Micron DDR3L ECC 1600MHz | 1x Kingston 16GB SSD / Crucial MX500 500GB

Link to comment
Share on other sites

Link to post
Share on other sites

This “smells” like a SATA cable/backplane issue to me.

Looking to buy GTX690, other multi-GPU cards, or single-slot graphics cards: 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Well, after testing looks like some issues with deskmini itself or memory?

After replacing ssd, sata cable, switching place of ssd's the 2nd samsung pm883 and spare MX500 have errors (read and write, from time to time) beeing connected to the same sata port on deskmini motherboard 😕

The first one do not show any errors after beeing connected to pool via usb

image.thumb.png.a2c2a5ee059ea8db88c09bdaf0d8c585.png

 

full memtest in progress

PC Specs - AMD Ryzen 7 3700X - Asrock AB350 ITX - 64GB DDR4-3600MHz - Geforce GTX 1080 - Samsung 960Pro - Monsterlabo's "The First" - Corsair SF450

Link to comment
Share on other sites

Link to post
Share on other sites

Unfortunately, Memtest86 RAM testings shown no errors.

As last test step, I put ram and drives on my older deskmini A300 with R 3400G and so far 0 errors in the pool (400GB file read/write test)

😕

So it can be board issue with X300?

PC Specs - AMD Ryzen 7 3700X - Asrock AB350 ITX - 64GB DDR4-3600MHz - Geforce GTX 1080 - Samsung 960Pro - Monsterlabo's "The First" - Corsair SF450

Link to comment
Share on other sites

Link to post
Share on other sites

Sounds like it. Could be a bad solder joint on the sata port, or damage to the PCB. 

Looking to buy GTX690, other multi-GPU cards, or single-slot graphics cards: 

 

Link to comment
Share on other sites

Link to post
Share on other sites

After some additional testing:

  • SDD 100% ok, there is no errors if SSD connected to deskmini via usb and mounted to pool
  • With 3400g on this x300 : no issues
  • With 2nd knowing working 5600g from another PC on this x300: same R/W issues
    • Both 5600G are from different batches

After exchange with Asrock tech support, they "forwarded feedback to our BIOS department if they heard about similar issue and/or have some more ideas to check with"

They advice to test under windows. Is it possible to mount ZFS pool in Windows 10 ?🤔

 

P.S.: @leadeater is it possible to move topic to troubleshooting section (https://linustechtips.com/forum/46-troubleshooting/)?

 

PC Specs - AMD Ryzen 7 3700X - Asrock AB350 ITX - 64GB DDR4-3600MHz - Geforce GTX 1080 - Samsung 960Pro - Monsterlabo's "The First" - Corsair SF450

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×