Error correction Codes

Error correction codes give us the ability to identify when
there has been a mistake, and potentially to correct that mistake.  Why do we
care?  Following reason:

    Dynamic memory cells use a very small capacitor to store or not store a
    charge.  Originally these capacitors held many millions of electrons, but
    over the years as the size of the devices has shrunk, as has the number
    of electrons stored.  And?  Think of throwing a rock into a large body of
    water.  Little happens, right?  A splash, but the overall water level does
    not change measurably.

    Now imagine you throw the same rock into a puddle on the sidewalk.  It is
    obviously possible for the rock to knock all the water out of the puddle,
    leaving nothing but a wet rock.  This is analogous to what is happening
    in semiconductors - the capacitors are getting so small that they only
    store a few thousand electrons, and if some form of radiation comes along
    (playing the part of the rock in the above example) the stored charge is
    completely lost.  This happens more and more in current technologies.  
    Sometimes the radiation comes from the packaging itself!

How do we deal with this?  Think about a regular binary code:

000
001
010
011
100
101
110
111

Suppose we had bit pattern 100, and the bottom bit changed.  The resulting bit
pattern would be 101, which is a valid code word, so we would have no way to
tell something bad has happened.  

IN ORDER TO MAKE ERROR DETECTION WORK, we need both *valid* and *invalid* code
words.  All we have to do is ensure all valid code words require at least 2 
bits to change to move between them.  (Remember Hamming distance?  The Hamming 
distance is the number of bits that change between two patterns.  The Hamming 
distance between 1001 and 1110, for example, is 3.  Between 1001 and 0110, 4.)  
What we are saying is we have to add enough bits to the code so that it is 
possible to identify invalid code words.  

Example - Take the above 3-bit set of code words, add 1 more bit.  Set this bit
to be 1 if the XOR of all the bits in the word is a 1, and set it to a zero if
the XOR is a zero.  What do we have?

0000
0011
0101
0110
1001
1010
1100
1111

The extra bit added is called a Parity Bit, and in this case we are calculating
"even" parity (the number of bits in the final word is an even number).  If we
were doing odd parity, the extra bit would be set to make the count of 1's in
the word an odd number.

Why is this useful?  Let's go back to the previous example.  We had a 100 which
changed to 101.  Well, in this case, 100 would be 1001, and so if it changed to
1011, we can see that this is not a valid code word.  It does not have even
parity, so I know something has gone wrong.  I can't work backward from 1011 -
I don't know which original bit pattern had a bit go bad, all I know is
something went wrong.

Parity bits are used extensively in computers.  Modems used to have this
problem a lot - you had to get the parity set right on both ends or
communications wouldn't work.

We can draw it something like this:  (A and B are valid code words - in class
we drew a circle around each with the circles touching.  All the points on the 
circle around A represent code words that have a Hamming distance 1 from A, 
all the points on the circle around B have hamming distance 1 from B).

    A    X    B
    .    .    .

The point (X) where the circles touch is a code word that is reachable by A and 
B but not a valid code word.  In our above example, A would be 1001, X would be
1011, and B could be 1010, or 1111, or 0011, etc.  (All of these patterns are 1
away from 1011.)

So making the Hamming distance between valid code words at least two will allow
us to do Single Error Detection (SED).  What if we make the distance 3?
Picture would look like this:  (Circle around A goes through W, circle around B
goes through Y).

    A    W    Y    B
    .    .    .    .

By making the distance three, then if a single bit changes we can identify
which valid code word it came from and correct it.  This approach will allow us
to identify and correct single errors (SEC, Single Error Correction).  The
problem is, what if a double error occurs?  If we have bit pattern A and two
bits change, then it will map to pattern Y, and we will incorrectly correct
into B.

If we make the Hamming distance 4, like this, with the circles still going
around A through W and around B through Y:

    A    W    X    Y    B
    .    .    .    .    .

Then if a single bit in pattern A changes, we will be able to detect and
correct it, and if two bits in A change, it will map to point X, which is
not in either A or B's sphere of influence.  Thus, with this approach we can
correct single errors and detect double errors (SECDED, Single Error Correction
Double Error Detection).

One way to think of this, if you are a science fan, is to think about
gravitational wells.  Imagine A and B are planets of equal mass.  Then there
will be a point half-way between them that will have equal gravitational pull,
and thus there would be no way to know which planet a body at that point came
from.  This would be point X.  Locations inside each planetary well would fall
back to that planet (points W, Y).

How do we ensure valid code words have a hamming distance of 3 between them?
By adding parity bits.  In the case of a 4-bit data word, we can do it like
this:

 7   6   5   4   3   2   1	Bit position
d3  d2  d1  C2  d0  C1  C0 	Where d=data, C=Check (parity)

C0 will be the parity bit over bits 3,5,7
C1 will be the parity bit over bits 3,6,7
C2 will be the parity bit over bits 5,6,7

Where did this come from, you ask?  Well, for this to work each data bit must 
be covered (checked) by at least 2 parity bits.  It doesn't matter which parity
bits cover which bits, but if you do it this way it has some really slick
advantages which we will demonstrate below.

Which bits are checking which?  Well, C0 is in location 1, and it
checks all the bits (except for itself) with bit 1 high (3,5,7,9,11,13, etc.)
C1 is in location 2, and it checks all bits (except for itself) with bit 2 high
(3,6,7,10,11,14,15), C2 checks all bits (except itself) with bit 4 high
(5,6,7,12,13,14,15), etc.

0000 (0) 
0001 (1)
0010 (2)
0011 (3)
0100 (4)
0101 (5)
0110 (6)
0111 (7)
1000 (8)
1001 (9)
1010 (10)
1011 (11)
1100 (12)
1101 (13)
1110 (14)
1111 (15)

How does this all work?  Here's an example:

Some circuit generates the valid code word 

0 1 1 0

We want to send it to memory, but memory has been acting up so we are going to
use SEC to help ensure we get the right value back eventually. 

Encoding proceeds as follows:  We need to generate a C0, C1 and C2 to send
along to memory with the data bits.  So we start with 

 7   6   5   4   3   2   1	Bit position
d3  d2  d1  C2  d0  C1  C0 	Where d=data, C=Check (parity)

 0   1   1       0

We calculate C0 by doing even parity over bits 3,5,7, which in this case is 
0 1 0.  Thus, C0 = 1.

We calculate C1 by doing even parity over bits 3,6,7, which in this case is 
0 1 0.  Thus, C1 = 1.

We calculate C2 by doing even parity over bits 5,6,7, which in this case is 
1 1 0.  Thus, C2 = 0.

(Note the parity bits are only calculated over data bits).

Plugging these new parity bit values into the code, we wind up with the pattern

 0   1   1   0   0   1   1

Which gets sent to memory.  This is the encoded version of the data word 0110.

Some time later, we fetch this value from memory, and we get

 0   0   1   0   0   1   1

Is this a valid code word?  Check by recreating the check bits and see if they
match the ones we got:

We calculate C0 by doing even parity over bits 3,5,7, which in this case is 
0 1 0.  Thus, C0 = 1.

We calculate C1 by doing even parity over bits 3,6,7, which in this case is 
0 0 0.  Thus, C1 = 0.

We calculate C2 by doing even parity over bits 5,6,7, which in this case is 
1 0 0.  Thus, C2 = 1.

This is the slick part - because of the way we have laid this out, we can now
XOR the newly generated check bits with the check bits in the retrieved word,
and the result will tell us exactly where in the word the error occured.
Watch:


    0   0   1   0   0   1   1
XOR             1       0   1
-----------------------------
                1       1   0

This says location 6 was the spot that changed.  Bingo!  Just like magic, if
you look at the original and this one, you will see that in fact location 6
(data bit d2) was the one that changed.  (Is that cool or what?  :-)

This is why the numbers of the bits run from 1 to 7, instead of the usual 0-6.
We need to be able to reserve a bit pattern that indicates that there were no
errors.  When the recreated check bits match the original check bits, then the
result of the XOR is 0, so we use that bit position to indicate no errors.
(Note there is no location 0 in the layout).

Another example - take our original pattern 

 0   1   1   0   0   1   1

and change bit 2 (for example).  

 0   1   1   0   0   0   1

What happens?  Generating check bits for this pattern produces C0=1, C1=1 and
C2=0.  The XOR

    0   1   1   0   0   0   1
XOR             0       1   1
-----------------------------
                0       1   0


indicates location 2 is the culprit, which in fact it is!

(Editor's note - when *checking* the returned value, you can actually collapse 
the last two steps.  If you calculate 

C0 = 1,3,5,7,9, ...
C1 = 2,3,6,7,10,11, ...
C2 = 4,5,6,7,12,13,14,15, ...

then the resulting value (C2,C1,C0) will be the location of the error.  If you 
think about it, you are just combining the two steps - instead of generating
the check bits and then XORing them with the ones that are there, you are doing
both calculations at the same time.  If this makes sense, great!  If it
doesn't, don't worry about it.)

How many parity bits are needed?  Well, the equation is the following:

2^K - 1 >= M+K, where M=data and K=check bits.  This makes sense if you think
about it - if you are going to use the check bits to point at the location that
has a bit bad, then you have to have enough parity bits to point at all the
locations.  If K=3, then 2^K - 1 = 7, and M can be up to 4.  What if you have
M=5?  5+3 = 8, and while it is possible for 3 bits to point at 8 things,
remember that one of the patterns has to be reserved for the "no error"
indicator (the all-zero pattern).  So if you have 8 data bits, for example, you
would need 4 check bits (8+4=12, and 4 check bits can point at 15 errors).  

Why is this included in the combinational circuit part of the class?
Because it is obviously just a bunch of XOR's.  You take the data word
generated by the processor, run it through a fixed set of XOR gates to generate
the check bits, then ship the result off to memory.  When a word is received
from memory, you repeat the process with an extra set of XOR's at the end to
compare the fetched parity bits from the newly created ones, and use the result
of that XOR to flip the appropriate bit (unless it is the all-clear pattern, of
course).

Some technical details - in order to detect T errors, the distance between 
valid codewords is 2T.  In order to correct T errors, the distance between
valid codewords has to be 2T+1.  In order to detect (T+1) errors and correct T
errors, the distance has to be 2T+1+1.

In the examples we did, T was one.  Thus the distance between valid codewords
for Singe Error Detection was 2, for SEC was 3, and for SECDED was 4.  If you
wanted to detect double errors, T would be 2 and the distance between valid
code words would be 4 (2T).  (Note in SECDED we are detecting double errors,
and sure enough the distance is 4!)  If you wanted to correct double errors,
the distance would be 5, etc.

Where do these circuits exist in real life?  Think about space travel - when
you shoot a satellite into space, it becomes extremely difficult to pop out and
replace a faulty part.  Therefore, you want to build in as much correction as
you can afford, so that if a part fails you can still proceed.  This is the
case for any circuitry that is placed in a "hostile environment" - robots that
are used to run around in a Nuclear plant, for example, can benefit a lot from
error correction.  Also, certain machines that prize speed over all else
(CRAY-class machines) don't want to design for the worst-case path, they want
to design for the best-case path.  They want to run right on the hairy edge of
correctness, and by adding circuitry to correct things when they inevitably
fail, they don't have to stop the machine all the time.  The CRAY has a 64-bit
data word, but actually uses 72 bits per word with 8 bits/word of SECDED.