Error control
Both in networking and on computers and their storage devices, there are a family of techniques called error control. There is no one best strategy, because different mechanisms, using the data, respond differently to errors. Indeed, there are different kinds of error that can have different effects on the same system.
The core technology is error detection, of which the most common case is finding that a bit has changed in transmission or storage, corrupting the data structure containing it. In some cases, such as Voice over Internet Protocol, it is quite adequate to discard, silently, an occasional unit containing errored bits. When the digitized voice is converted back to sound, the human ear is quite tolerant to occasional interruption of sound. Hearing is much less tolerant to the delay between quanta of sound, but delay variations are not generally considered within the scope of error control.
If the application cannot tolerate bit errors, the next question is how to correct the error. The most common method, at least in networking, is to retransmit the data received in error, until it is received correctly or other mechanisms determine that the communications channel is unusable.
Error detection
Perhaps the most basic error detection methods is parity checking. Assume that the units of information come in groups of 7 data bits. On transmission, the sender counts the number of "one" bits in the data. If that total is odd, the 8th "parity bit" is set on, assuming "odd parity" is the default. If the total is even, the parity bit is set to zero.
At the other end of the channel, the receiver counts the "one" bits, computes the parity of the received bits, and compares it to the parity bit. If the parity bit setting does not match the parity of the data bits, the entire group of 8 bits is assumed to be errored, which covers the contingency of the data bits actually being correct, but the parity bit was corrupted.
Simple parity has fairly basic limitations. It can detect a single bit error, but if two bits are changed, the parity will remain the same and the error will not be detected.
There are a variety of more powerful error detection algorithms, which produce error-checking fields longer than one bit. 16 or 32 bit fields are common. Depending on the particular mechanism, the entire field may be discarded. Alternatively, some methods can reconstruct the correct information; to do that, redundant error control bits must be sent with the data. There is a constant tradeoff between the overhead imposed by sending, with every unit data, enough information to reconstruct the correct data, and simply having the errored data retransmitted.
Different strategies apply to storage devices and communications network. If there was an error in writing disk data, or if the disk became corrupted, repeated reads will still produce data with a failure. In networks, however, it is entirely likely that the error took place in transmission, and retransmitting may cause a transfer of the correct information.
Error correction
As mentioned, there are a number of methods of correcting errors, each of which has its own performance tradeoffs.
Retransmission
One of the most basic retransmission methods is called "stop and wait", or "ACK-NAK", ACK standing for acknowledgement. In this message, the data are sent with an error-detecting field. The transmitter will not send another unit of data until it receives a positive acknowledgement that the data were received correctly. A given error-correcting protocol may or may not have a "negative acknowledgement", which are rarely used.
Even with explicit acknowledgement systems, the transmitter can start a timer when it sends the data. If the timer expires and the data has not been acknowledged, it needs to be retransmitted. Conceivably, if the NAK could be delivered much faster than the transmit acknowledgement timer expiration, there might be a performance benefit, but the transmitter still has to have a timer to cover against the contingency of an ACK or NAK being dropped in the return path. Transmission Control Protocol is a common example where only positive acknowledgements are sent.
Stop-and-wait is inherently inefficient, if there is traffic flow in both directions, since the sender has to wait for the data to be transmitted, checked, and then the response transmitted. There are several techniques, which can be used in combination, to increase efficiency. All require that the units of information be numbered, with a different sequence number space in both directions of transmission.
Assuming a TCP connection from A to B, as B receives traffic from A and sends its own data to A, the messages it sends can contain an acknowledgement of the number of units of data that have been succesfully received. This piggybacking' techniques allows simultaneous flow of data and acknowledgements.
Redundant transmission
Since having bad data written to storage may not be recoverable, the constantly dropping cost-per-bit of storage can make it reasonable to write to a Redundant Array of Inexpensive Disks (RAID) systems. There are a great number of variants on RAID, some, of course, for protection against errors, but also for performance enhancement. The latter technique, called striping, treats two or more physical disks as if they were one logical volume; striping allows fast computer processors to do concurrent reading and writing to several slower disks.
Striping writes different information to different media. Mirroring writes more than one copy of the same data to multiple media, protecting the physical-level information from individual failures. Just as in sending error-checking or correcting fields with data across a network, these methods create metadata that, for example, shows what parts of a virtual file exist in which physical locations on multiple disks, or where the backup copy(ies) reside. Of course, the metadata is itself critical, and it needs thorough protection.
In networking, some critical applications use multiple physical transmission paths to send the same data. In the Service Specific Connection Oriented Protocol (SSCOP) protocol used for internal telephony networks carrying the Signaling System 7 control information, there are no single points of hardware failure. Everything is at least duplicated.
SSCOP, like LAP-B and TCP, can correct errors by retransmission. If one link of a pair stops working, then retransmission is the only alternative. With two working links, however, the receiver can look at the error checking fields of two received frames, and if one passes the error detector but the other does not, no retransmission is needed; the receiver keeps the good copy and throws away the bad. Even if both were correct, one would still be discarded.
Forward error correction
Another technique used in both networking and storage is forward error correction (FEC). In FEC, the error-checking field is larger than needed for simple error detection. A FEC coding system can typically detect errors affecting (N+1) bits, but correct errors of N bits. The algorithms involved are quite complex, but a simplified example might illustrate. Assume a square array of M×M bits. Form one set of checksums across each vertical column, and an independent checksum across each horizontal row. If the vertical check fails but the horizontal does not, the value of a particular bit is the value that will, with the other bits, produce the correct error detection code.
FEC is commonly used in high-speed modems, where the transmission error rate is high enough to justify the overhead. Another application is where retransmission is effectively impossible, as from a deep space probe that is light-minutes or more from the receiving station.