Jump to content

Having trouble with mdadm RAID 0

Hi everyone! I have a home server running Ubuntu Server 20.04.1 LTS, mainly as a Samba share and some other stuff. For a great part of the Samba share I've been running a RAID 0 made with 3 old HDDs, coming to 1.8TB of storage, but recently I changed one of the drives by removing the array, formatting the drives, etc. and building a new one. I use this array to put backups on it with Aomei Backupper. The problem is that now Backupper gives me very frecuent errors of not being able to write the file, and effectively when I try to write anything to that share Windows gives me a read-only error and I have to reset the server. Also, one time I tried to copy a big file to test this problem and Windows copied like 1% of it and the dropped to 0MB/s for about 5 minutes, returning to 15 or 20 MB/s (normal is 120MB/s for my connection) for a few minutes and dropping again, very strange behaviour nad it only affected that specific share where the RAID was mounted. I made a skdump of the drives that I'll attach to the post, and every one of them has an "Overall status: BAD_SECTOR" so I tried to

sudo badblocks -v /dev/sdx > badsectorsx.txt

for every single drive and got the results in 3 .txt files, but now I cannot perform a fsck or e2fsck on each drive because it does nothing and trying to fsck the md device gives me this

server@server:~$ sudo fsck -p /dev/md127
fsck from util-linux 2.34
/dev/md127: clean, 704/122077184 files, 164898792/488284800 blocks

What can I do to repair those bad sectors or to make the array work OK again? Also, I just noticed that the filesystem is reporting to have 700GB occupied when I barely have anything on that folders!

 

imagen.png.c38931c8a87d391a6d8ee55e1a187a6d.png

 

imagen.png.cbfefc68f6e189f989a529d991ed25d0.png

 

Here are the skdumps:

server@server:~$ sudo skdump /dev/sda
Device: sat16:/dev/sda
Type: 16 Byte SCSI ATA SAT Passthru
Size: 476940 MiB
Model: [WDC WD5000AAKX-00ERMA0]
Serial: [WD-WCC2EDE45999]
Firmware: [15.01H15]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was completed without error.]
Total Time To Complete Off-Line Data Collection: 8400 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: yes
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 85 min
Conveyance Self-Test Polling Time: 5 min
Bad Sectors: 10 sectors
Powered On: 2.9 years
Power Cycles: 4131
Average Powered On Per Power Cycle: 6.2 h
Temperature: 39.0 C
Attribute Parsing Verification: Good
Overall Status: BAD_SECTOR
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good Good/Past
  1 raw-read-error-rate         200   200    51   2           0x020000000000 prefail online  yes  yes
  3 spin-up-time                184   140    21   1.8 s       0xff0600000000 prefail online  yes  yes
  4 start-stop-count             96    96     0   4170        0x4a1000000000 old-age online  n/a  n/a
  5 reallocated-sector-count    200   200   140   0 sectors   0x000000000000 prefail online  yes  yes
  7 seek-error-rate             200   200     0   0           0x000000000000 old-age online  n/a  n/a
  9 power-on-hours               65    65     0   2.9 years   0x366400000000 old-age online  n/a  n/a
 10 spin-retry-count            100   100     0   0           0x000000000000 old-age online  n/a  n/a
 11 calibration-retry-count     100   100     0   0           0x000000000000 old-age online  n/a  n/a
 12 power-cycle-count            96    96     0   4131        0x231000000000 old-age online  n/a  n/a
192 power-off-retract-count     199   199     0   949         0xb50300000000 old-age online  n/a  n/a
193 load-cycle-count            199   199     0   3257        0xb90c00000000 old-age online  n/a  n/a
194 temperature-celsius-2       104    92     0   39.0 C      0x270000000000 old-age online  n/a  n/a
196 reallocated-event-count     200   200     0   0           0x000000000000 old-age online  n/a  n/a
197 current-pending-sector      200   200     0   10 sectors  0x0a0000000000 old-age online  n/a  n/a
198 offline-uncorrectable       200   200     0   0 sectors   0x000000000000 old-age offline n/a  n/a
199 udma-crc-error-count        200   200     0   0           0x000000000000 old-age online  n/a  n/a
200 multi-zone-error-rate       200   200     0   0           0x000000000000 old-age offline n/a  n/a
server@server:~$ sudo skdump /dev/sdb
Device: sat16:/dev/sdb
Type: 16 Byte SCSI ATA SAT Passthru
Size: 476940 MiB
Model: [TOSHIBA DT01ACA050]
Serial: [745GDZEAS]
Firmware: [MS1OA750]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was suspended by an interrupting command from host.]
Total Time To Complete Off-Line Data Collection: 3571 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: no
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 1 min
Extended Self-Test Polling Time: 60 min
Conveyance Self-Test Polling Time: 0 min
Bad Sectors: 27 sectors
Powered On: 1.8 years
Power Cycles: 862
Average Powered On Per Power Cycle: 18.0 h
Temperature: 39.0 C
Attribute Parsing Verification: Good
Overall Status: BAD_SECTOR
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good Good/Past
  1 raw-read-error-rate         100   100    16   0           0x000000000000 prefail online  yes  yes
  2 throughput-performance      142   142    54   n/a         0x470000000000 prefail offline yes  yes
  3 spin-up-time                131   131    24   168 ms      0xa800b6000300 prefail online  yes  yes
  4 start-stop-count            100   100     0   2317        0x0d0900000000 old-age online  n/a  n/a
  5 reallocated-sector-count    100   100     5   27 sectors  0x1b0000000000 prefail online  yes  yes
  7 seek-error-rate             100   100    67   0           0x000000000000 prefail online  yes  yes
  8 seek-time-performance       115   115    20   n/a         0x220000000000 prefail offline yes  yes
  9 power-on-hours               98    98     0   1.8 years   0x713c00000000 old-age online  n/a  n/a
 10 spin-retry-count            100   100    60   0           0x000000000000 prefail online  yes  yes
 12 power-cycle-count           100   100     0   862         0x5e0300000000 old-age online  n/a  n/a
192 power-off-retract-count      98    98     0   2448        0x900900000000 old-age online  n/a  n/a
193 load-cycle-count             98    98     0   2448        0x900900000000 old-age online  n/a  n/a
194 temperature-celsius-2       153   153     0   39.0 C      0x27000e002e00 old-age online  n/a  n/a
196 reallocated-event-count     100   100     0   35          0x230000000000 old-age online  n/a  n/a
197 current-pending-sector      100   100     0   0 sectors   0x000000000000 old-age online  n/a  n/a
198 offline-uncorrectable       100   100     0   0 sectors   0x000000000000 old-age offline n/a  n/a
199 udma-crc-error-count        200   200     0   0           0x000000000000 old-age online  n/a  n/a
server@server:~$ sudo skdump /dev/sdd
Device: sat16:/dev/sdd
Type: 16 Byte SCSI ATA SAT Passthru
Size: 953869 MiB
Model: [Hitachi HDS721010CLA332]
Serial: [JP2940HZ2B30JC]
Firmware: [JP4OA3GH]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was suspended by an interrupting command from host.]
Total Time To Complete Off-Line Data Collection: 9455 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: no
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 158 min
Conveyance Self-Test Polling Time: 0 min
Bad Sectors: 1310965 sectors
Powered On: 5.6 years
Power Cycles: 9861
Average Powered On Per Power Cycle: 5.0 h
Temperature: 43.0 C
Attribute Parsing Verification: Good
Overall Status: BAD_SECTOR
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good Good/Past
  1 raw-read-error-rate         100    85    16   0           0x000000000000 prefail online  yes  yes
  2 throughput-performance      137   100    54   n/a         0x5a0000000000 prefail online  yes  yes
  3 spin-up-time                137   100    24   250 ms      0xfa002e010600 prefail online  yes  yes
  4 start-stop-count             97    97     0   14109       0x1d3700000000 old-age online  n/a  n/a
  5 reallocated-sector-count     96    47     5   1310740 sectors 0x140014000000 prefail online  yes  yes
  7 seek-error-rate             100   100    67   0           0x000000000000 prefail online  yes  yes
  8 seek-time-performance       140   100    20   n/a         0x1e0000000000 prefail offline yes  yes
  9 power-on-hours               93    93     0   5.6 years   0xf9bf00000000 old-age online  n/a  n/a
 10 spin-retry-count            100   100    60   0           0x000000000000 prefail online  yes  yes
 12 power-cycle-count            98    98     0   9861        0x852600000000 old-age online  n/a  n/a
183 runtime-bad-block-total     100   100     0   0           0x000000000000 old-age online  n/a  n/a
184 end-to-end-error            100   100    97   0           0x000000000000 prefail online  yes  yes
185 attribute-185                93    93     0   n/a         0xffff07000000 old-age online  n/a  n/a
187 reported-uncorrect            1     1     0   68350 sectors 0xfe0a01000000 old-age online  n/a  n/a
188 command-timeout             100    83     0   8752798774  0x361cb5090200 old-age online  n/a  n/a
189 high-fly-writes             100   100     0   0           0x000000000000 old-age online  n/a  n/a
190 airflow-temperature-celsius  57    43     0   43.0 C      0x2b002b2a0000 old-age online  n/a  n/a
192 power-off-retract-count      88    88     0   14487       0x973800000000 old-age online  n/a  n/a
193 load-cycle-count             88    88     0   14487       0x973800000000 old-age online  n/a  n/a
194 temperature-celsius-2       139   120     0   43.0 C      0x2b000e003900 old-age online  n/a  n/a
196 reallocated-event-count     100   100     0   20          0x140000000000 old-age online  n/a  n/a
197 current-pending-sector       97    46     0   225 sectors 0xe10000000000 old-age online  n/a  n/a
198 offline-uncorrectable       100    46     0   0 sectors   0x000000000000 old-age offline n/a  n/a
199 udma-crc-error-count        200   200     0   0           0x000000000000 old-age online  n/a  n/a
server@server:~$

Thank you all in advance!

Link to comment
Share on other sites

Link to post
Share on other sites

You have 2 drives with current pending sectors, those are bad.

 

Id really try to replace those if possible, its only gonna get worse.

 

But if you can, do a full zero wipe with dd first, that can often fix those sectors as the drive will reallocate them its self.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Juanmacaam said:

Bad Sectors: 1310965 sectors Powered On: 5.6 years Power Cycles: 9861

Wow.

 

The 3rd one is definitely gone. The other 2 aren't in great shape either.

 

  

1 hour ago, Juanmacaam said:

or every single drive and got the results in 3 .txt files, but now I cannot perform a fsck or e2fsck on each drive because it does nothing and trying to fsck the md device gives me this

You need -f to force a check.

But this checks the filesystem, not the underlying hardware. See post above for good recommendations.

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Electronics Wizardy said:

But if you can, do a full zero wipe with dd first, that can often fix those sectors as the drive will reallocate them its self.

4 hours ago, Kilrah said:

The 3rd one is definitely gone. The other 2 aren't in great shape either.

So, should I try to dd'em first or should I just go and replace the Hitachi one and dd the other two? Yeah, I guess that disk is very old, I bought it circa 2011 😅 Even though the SMART data shows that, that disk was functioning well until a few weeks, is it possible that it went bad in so little time? Thank you both!

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Juanmacaam said:

So, should I try to dd'em first or should I just go and replace the Hitachi one and dd the other two? Yeah, I guess that disk is very old, I bought it circa 2011 😅 Even though the SMART data shows that, that disk was functioning well until a few weeks, is it possible that it went bad in so little time? Thank you both!

 

Yea, hdds will go from fine to bd very quickly.

 

DD them separately, you can do it all at once if you want.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×