Re: Computer troubles...
You really have to consider how the errors will, likely, be introduced.
Do you want to detect if the memory device has been removed? Failed completely? Do bits "age"? If so, do they age-to-0 or age-to-1? Are there failure mechanisms inside the device that cause groups of bits to fail (e.g., a shared sense amplifier)?
Optical media, for example, assumes there will be scratches that will take out MANY "bits" -- but, that they will be related IN SPACE (cuz scratches are "contiguous events" across the surface)
The other problem is that someone may just inherit a checksum algorithm in a body of code and not consider how the hardware may have changed/evolved in the time since the algorithm was first designed/selected. So, the failure modes that it was designed to protect against might no longer be valid and, in fact, the algorithm may be ill-advised for the current hardware implementation!
I was tasked with making some changes to a product that relied on nonvolatile memory (BBSRAM) to hold accounting data (i.e., money). The twit who had designed the system assumed that storing the data in triplicate would buy him reliability. In theory, it would let him detect and correct any single bit error in a replicated datum!
E.g., if the three copies of bit #29 of a datum appear as (1,1,1), then you can probably assume it represents a '1'. Likewise, (0,0,0) to represent a '0'. The sets (0,0,1), (0,1,0) and (1,0,0) all suggest a '0' datum that has degraded and should be corrected to (0,0,0). Likewise, (1,1,0), (1,0,1) and (0,1,1) suggest a '1' datum that has been degraded and should be corrected to (1,1,1).
Fine, with 3 bits, the Hamming distance will only allow for a single bit detect/correct.
But, when you treat larger data (e.g., long words) as composite entities in your check algorithm INSTEAD OF GROUPS OF BITS, you lose all the benefits of this redundancy.
E.g., if you have the nybbles '9', '9' and '1' (1001, 1001, 0001), you can note that 1 != 9 but 9 == 9 so the '1' should be corrected to a '9' -- a single bit was in error.
OTOH, if you have '9', '8' and '1', you might say 9 != 8, 8 != 1, 9 != 1 and, therefore, you have no way to recover (no two nybbles are the same!). If, instead, you treat this as groups of 4 bits and notice that you really have two single bit errors (1001, 1000, 0001), then you can correct both of them. But, not if you treat the nybbles as the raw data! I.e., the twit had naively treated the nybbles as the data and could thus only correct a small subset of errors.
Originally posted by tom66
View Post
Do you want to detect if the memory device has been removed? Failed completely? Do bits "age"? If so, do they age-to-0 or age-to-1? Are there failure mechanisms inside the device that cause groups of bits to fail (e.g., a shared sense amplifier)?
Optical media, for example, assumes there will be scratches that will take out MANY "bits" -- but, that they will be related IN SPACE (cuz scratches are "contiguous events" across the surface)
The other problem is that someone may just inherit a checksum algorithm in a body of code and not consider how the hardware may have changed/evolved in the time since the algorithm was first designed/selected. So, the failure modes that it was designed to protect against might no longer be valid and, in fact, the algorithm may be ill-advised for the current hardware implementation!
I was tasked with making some changes to a product that relied on nonvolatile memory (BBSRAM) to hold accounting data (i.e., money). The twit who had designed the system assumed that storing the data in triplicate would buy him reliability. In theory, it would let him detect and correct any single bit error in a replicated datum!
E.g., if the three copies of bit #29 of a datum appear as (1,1,1), then you can probably assume it represents a '1'. Likewise, (0,0,0) to represent a '0'. The sets (0,0,1), (0,1,0) and (1,0,0) all suggest a '0' datum that has degraded and should be corrected to (0,0,0). Likewise, (1,1,0), (1,0,1) and (0,1,1) suggest a '1' datum that has been degraded and should be corrected to (1,1,1).
Fine, with 3 bits, the Hamming distance will only allow for a single bit detect/correct.
But, when you treat larger data (e.g., long words) as composite entities in your check algorithm INSTEAD OF GROUPS OF BITS, you lose all the benefits of this redundancy.
E.g., if you have the nybbles '9', '9' and '1' (1001, 1001, 0001), you can note that 1 != 9 but 9 == 9 so the '1' should be corrected to a '9' -- a single bit was in error.
OTOH, if you have '9', '8' and '1', you might say 9 != 8, 8 != 1, 9 != 1 and, therefore, you have no way to recover (no two nybbles are the same!). If, instead, you treat this as groups of 4 bits and notice that you really have two single bit errors (1001, 1000, 0001), then you can correct both of them. But, not if you treat the nybbles as the raw data! I.e., the twit had naively treated the nybbles as the data and could thus only correct a small subset of errors.
Comment