Jump to content

Operating Systems/hardware - How do you counter bit flips, and how is the instruction cycle maintained perfectly?

Gat Pelsinger

I know that bit flips are a bit common in networking, and that's why your system always checks the file's SHA encoding, cryptography or whatever that process is called (idk), to check if the file is transferred properly or if there were bit flips or packet losses. But how do you counter such bit flips when your PC is running? I remember bit flips in hardware being common way back when the CPU was in its earliest stages, perhaps even mechanical. But I never hear about bit flips in CPUs, or memory, or in whichever computer component at all.

 

How is the hardware/software built so perfectly that there is never 1 bit that is accidentally flipped and crashes the system?

 

And also, how is that the instruction cycle is maintained so perfectly? The OS programmer doesn't know which program you are going to run at which point, so nothing is pre-programmed, it's a cycle that keeps running with itself. With the CPU running billions of instructions per second, how is that nothing runs of out sync and crashes the system? What are the chances?

Microsoft owns my soul.

 

Also, Dell is evil, but HP kinda nice.

Link to comment
Share on other sites

Link to post
Share on other sites

The solution to unexpected bit flips and corruption is always the same; parity bits and checksums. Depending on the number of parity bits it's possible to correct errors as well, at the cost of making the data larger.

9 minutes ago, Hensen Juang said:

How is the hardware/software built so perfectly that there is never 1 bit that is accidentally flipped and crashes the system?

Hardware running mission critical software that can't afford any amount of corruption uses systems like ECC ram to correct bit flips. For storage you can use RAID and other similar parity systems together with regular backups.

 

As for why your computer doesn't often spontaneously crash, even though individual bits can and do get corrupted relatively often, the main reason is that a single bit flipping is often not enough to cause a critical error that prevents your computer from running. As software has become larger and more complex, single points of failure have been largely eliminated; does it really matter if during one of billions of cycles per second one operation gives you an incorrect result? Most of the time, not really. It's still possible that a bit flip will outright crash your system but it's highly unlikely.

 

On a side note, modern hardware is actually more susceptible to bit flipping than older hardware since modern electronics are so small that it takes very little energy to change its state. This is why space tech often uses older manufacturing nodes since it's both critical that all calculations be correct in space and more likely that a bit will flip due to the higher amount of radiation.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

37 minutes ago, Hensen Juang said:

And also, how is that the instruction cycle is maintained so perfectly? The OS programmer doesn't know which program you are going to run at which point, so nothing is pre-programmed, it's a cycle that keeps running with itself. 

Processors have hardware timers that are running at a known speed that the OS can refer to. That's how we got past the 80s issue of programs running too fast as processors became faster.

F@H
Desktop: i9-13900K, ASUS Z790-E, 64GB DDR5-6000 CL36, RTX3080, 2TB MP600 Pro XT, 2TB SX8200Pro, 2x16TB Ironwolf RAID0, Corsair HX1200, Antec Vortex 360 AIO, Thermaltake Versa H25 TG, Samsung 4K curved 49" TV, 23" secondary, Mountain Everest Max

Mobile SFF rig: i9-9900K, Noctua NH-L9i, Asrock Z390 Phantom ITX-AC, 32GB, GTX1070, 2x1TB SX8200Pro RAID0, 2x5TB 2.5" HDD RAID0, Athena 500W Flex (Noctua fan), Custom 4.7l 3D printed case

 

Asus Zenbook UM325UA, Ryzen 7 5700u, 16GB, 1TB, OLED

 

GPD Win 2

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, Sauron said:

Hardware running mission critical software that can't afford any amount of corruption uses systems like ECC ram to correct bit flips. For storage you can use RAID and other similar parity systems together with regular backups.

For really mission critical and consequential stuff as well there is a tendency to run the same operations twice comparing results (usually done through having 2 of the same chip inside).

 

To @Hensen Juang I'd guess that most of the time a bit flip doesn't really affect parts of the computer that matters...especially when you consider the amount of space the code actually consumes and how infrequent most of the code is called.  As an example, I had a corrupted version of Windows (like harddrive failure that had tons of read issues); the OS booted and got to desktop and usually operated for a few seconds before completely crashing.  It's all about where the bit flips happen, and I'm not sure Windows or Linux does too much to compensate for it (then again I haven't really studied it).

 

I also had a stick of ram that was faulty, which would randomly flip bits.  Thing was stable anywhere from 1 minute to 1 hour...a memory check showed issues with the memory within a few minutes.  So I think that sort of shows that a lot of the code isn't too important for most people to notice.

 

At least when data is transferred as well there is a CRC check, like CPU to RAM so it knows and can handle it as it so chooses.  HDD's and SSD's also have CRC checks to detect bad data.

 

Although with that said, when you have lets say sketchy power, or live near a radio tower you are more likely to see computers crashing because of events like bit flips.

 

Here's a Google research:

https://research.google/pubs/pub35162/

Slight napkin math from the Google numbers I came to about 2 - 10 bits in a 24 hour period for a system...which given the code size it's a pretty small target for the bit flip to hit anything important.

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

In networking, it's part of the protocol (TCP/IP), data is arranged in packets or "frames" and each packet contains a checksum (a code generated from the data in the packet, the "payload") - if a bit is changed in the packet, the checksum won't match and the network card can talk back to the the other side and request the packet to be transmitted again.

This protocol and arranging in frames is one of the reason you'd never really achieve the maximum speed of your network card using a regular file transfer ... on a 1 gbps (125 MB/s) you would get maybe at most around 122 MB/s because the other 3 MB is the headers and checksum of each small packet of data that passes through the network cable.

Other protocols give up this error correction and retry because they don't need it - for example UDP protocol doesn't do correction, because it's assumed it will be used for broadcast data like videos where if there's one packet corrupted, the video player will be able to recover on the next video frame or after a few seconds and it's not critical to preserve the data... also in video games for example, your game can receive through UDP information about position of other players, changes to the map, if you lose a small packet the game can synchronize later.

 

Hard drives also arrange data in packets on the platters and put extra error correction information with the data. This Anandtech article page gives a good explanation, it's part of an article about changing from 512 byte sectors to 4096 byte sectors (which results in less ecc info stored overall, but gives drives ability to use it better so it's overall better) : Western Digital’s Advanced Format: The 4K Sector Transition Begins

 

Link to comment
Share on other sites

Link to post
Share on other sites

On 6/30/2023 at 8:56 PM, Hensen Juang said:

With the CPU running billions of instructions per second, how is that nothing runs of out sync and crashes the system? What are the chances?

Overclocking and undervolting can have that effect. The manufacturer will always put CPU-s into a range where it can run reliably, this practice is called binning.

ಠ_ಠ

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×