Jump to content

Hello All!

I am stumped with this issue that happened with a client's server resulting in 2 days of data corruption. I suspect the fault may of happened with the RAID controller itself, but honestly I am grasping at straws at the moment. An important note is that this server backs up nightly, and runs Firebird.

Middle of the day Tuesday, the client was getting an authentication error suddenly. We performed a clean restart of the database server. At this point, the vendor was contacted, as the day's data was deemed corrupted. We worked to restore Monday's data, which was also corrupted in what appears to be the same fashion. We were able to get a good restore from Sunday night's backup, but this causes serious issues due to almost two days of lost information, not to mention the future implications if we can't rely on the setup. The facts as I know them:

This server has 3 drives set up in RAID5. It is tested weekly, and it was noticed that there was a drive failure on the Friday prior. Note: This drive was known bad almost 48 full hours before the last "good" restore image. It was set to be swapped the following weekend when the device could be offline for the rebuilding of the array.

 

There was also a setting enabled that was supposed to be disabled to allow data/drive caching.

Going back into the vendor logs, it looks like there may have been read/write errors dating back to February.

I am starting to wonder if I have a failing controller on my hands, but I am not sure of my next steps. The vendor is being spectacularly unhelpful, simply stating that we should have used their backup service. I fail to see how that would have helped, since it is likely that data would have been corrupted as well, but that is neither here nor there.

Any guidance you can provide would be greatly appreciated.

Link to comment
https://linustechtips.com/topic/1214257-raid5-bug/
Share on other sites

Link to post
Share on other sites

What raid card and hardware? What os?

 

3 minutes ago, GeneralTek said:

Going back into the vendor logs, it looks like there may have been read/write errors dating back to February.

what does the vendor supply? The software?

 

Is there logs in the os showing any errors?

 

Do they drive report any errors?

 

 

Link to comment
https://linustechtips.com/topic/1214257-raid5-bug/#findComment-13756825
Share on other sites

Link to post
Share on other sites

The software is reporting a database corruption. Here is the play by play as reported by the vendor's IT:

1) The Medical.gdb file corrupt after local IT performed a measured and clean re-start of the practice OP Database Server due to an RDP Licencing issue

2) Many attempts were made by myself and [Additional Tech] to repair and restore the medical database file. After working close to 2 hours, it was determined that the medical.gdb file was beyond repair.

3) I worked with [General TEK], local IT to restore Monday nights backup of the Medical database file. That file as well was deemed corrupt and beyond repair.

4) [General TEK] had to restore the Medical database file from this past Sunday night, June 21, 2020. The practice has lost all their patient data from Monday June 22- Tuesday June 23rd, 2020 due to the corrupt and damaged Medical.gdb database file

5) I am still reviewing log files and possible root cause as to what could have damaged and corrupted the Medical database file.

Link to comment
https://linustechtips.com/topic/1214257-raid5-bug/#findComment-13756949
Share on other sites

Link to post
Share on other sites

4 minutes ago, GeneralTek said:

The drive monitor was reporting 1 drive as unresponsive. I test weekly, and all three drives were up and "healthy" the week prior.

what drive monitor? what os?

 

What hardware?

 

Are the drives reporting any issues>

 

What filesystem? What does it disk check show?

 

This could be a lot of things that can induce corruption, but a bad raid card or disk seems to be the most likely.

Link to comment
https://linustechtips.com/topic/1214257-raid5-bug/#findComment-13756959
Share on other sites

Link to post
Share on other sites

Hard Disk Sentinel on Windows Server 2017.

The Controller is a p410i. I should note I only took over the contact about 3 months ago. I did not do any of the initial setup, and I am still finding things I would have never done.

 

The two remaining drives are coming up as healthy. Drive 3 is "not responding".

 

Link to comment
https://linustechtips.com/topic/1214257-raid5-bug/#findComment-13756994
Share on other sites

Link to post
Share on other sites

2 minutes ago, GeneralTek said:

 

The two remaining drives are coming up as healthy. Drive 3 is "not responding".

If you had a bad drive, that might have caused your issue, there are some cases where a bad drive can screw up a raid array.

 

2 minutes ago, GeneralTek said:

Hard Disk Sentinel on Windows Server 2017.

Does htis show the status of the actual drives or just the virtual disk presented to the os?

 

 

 

If I had to make a few guesses of what could have been done better:

 -keep better logs so disks get replaced quickly if errors are showin, I think you said the logs showed issues for a while.

-run patrol reads on the raid card if your not already doing it, makes hidden errors on the disk much more rare.

-id probalby stay awawy from raid 5 in the future, go 6 or 10

 

Link to comment
https://linustechtips.com/topic/1214257-raid5-bug/#findComment-13757009
Share on other sites

Link to post
Share on other sites

I don’t know why any business would run straight raid5 for critical production data. Only 1 more drive away from raid 10 which is significantly better for databases and easier to rebuild. especially when the use case means data backups rapidly lose their usefulness after only a few days. 

Home PC: Apple M1 Mini, 16gb, 1TB, 10Gig-E.  Adobe CC and Ripping things + Daily stuff.

Gaming PC: Ryzen 7 5800x, 32GB, Nvidia RTX 3080Ti stuffed into a Corsair 380T.

Asgard the FreeNAS Plex Server: AMD EPYC 7443p 24 Core, SuperMicro H12SSL-CT Mobo, 256GB DDR4 3200mhz, Norco 4224 Rack Mount. 100TB+ TrueNAS Core.

 

Toys:

2017 Focus RS | Frozen White | Daily Driver

1989 Pontiac TransAm | GM Triple White | Heads/Cammed LT1 + T56 swap | Suspension goodies up the wazoo. | HPDE Weekend Warrior toy.

Link to comment
https://linustechtips.com/topic/1214257-raid5-bug/#findComment-13758315
Share on other sites

Link to post
Share on other sites

P410 is a HP controller. Do you have the HP software installed as you can get reporting and additional command line troubleshooting options out of it. Its in the service pack for proliant distribution HP does. You can install it on non HP devices and it works fine if you just install the software. If you get the full service pack disk you will need to locate and extract just the SSA software.

 

Its called HP Smart Storage Administrator

 

Might give you some more logging, information and capabilities than you were able to provide above to work out whats wrong exactly.

Link to comment
https://linustechtips.com/topic/1214257-raid5-bug/#findComment-13758770
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×