Tuesday, September 27, 2016

Check digit: the poor man's error correction code

The Israeli ID number and the Luhn Algorithm


The Israeli ID number, assigned to every Israeli resident or citizen, has 9 digits: the left 8 digits are the actual number, whereas the right-most digit is a check digit.
The check-digit is computed using the Luhn algortihm which is:
1. Multiply digits in even places by 2.
2. Add the digits in the odd places (not including the check-digit itself) to the sum of digits of numbers you got in (1).
3. Set the 9th digit to be the number required to complete the sum from (2) to the next multiple of 10.

For example


The number 99035612 will be processed as follows:
9 - odd place, leave as is -> 9
9 - even place, multiply by 2 and take sum of digits -> 18 -> 9
0 - odd place, leave as is -> 0
3 - even place - multiply by 2 -> 6
5 - odd place, leave as is -> 5
6 - even place, multiply by 2 and take sum of digits -> 12 -> 3
1 - odd place, leave as is -> 1
2 - even place, multiply by 2 -> 4

sum all the above to obtain 37, which requires adding 3 to complete it to the nearest multiple of 10, thus 3 is chosen to be the check digit.

Had this been a valid ID, the check digit would have been 2 (not 9)
(image source: nbn.org.il)

Why all the fuss?


The most obvious idea for a check digit would be sum of digits modulo 10, so why all this odd-even business? It turns out it is intended to catch the very common transcription mistake of transposing two adjacent digits, and by the by also catches most of the cases of twin digits mistakes (i.e. aa->bb). In addition, it is simple, and therefore realistically usable for clerks in the pre-computers age.


The Luhn Algorithm's "unfixable" fault


Interestingly, the Luhn algorithm cannot detect all transpositions of two adjacent digits. Specifically, the sequences 09 and 90 both contribute 9 to the sum computed in step (2) so this transposition will not cause a change in the check digit and will not be detected.
So I grabbed a piece of paper and tried to come up with a similar algorithm that will catch *all* transpositions, only to find out that - if we restrict ourselves to similar algorithms (wanting to preserve the simplicity of the computation) - there is no such solution.
To be precise, let $f:\{0,...,9\}\rightarrow\{0,...,9\}$ be a permutaion, then there exist $a \neq b \in \{0,...,9\}$ such that $f(a) + b \equiv a + f(b) (\textrm{mod} 10)$.
My friend, Lior Yanovski, suggested an elegant proof for this:
Assume, by way of negation, that $a\neq b \Rightarrow f(a) + b \not\equiv a + f(b) (\textrm{mod} 10)$.
It follows that $ f(a) - a \not\equiv f(b) - b (\textrm{mod} 10)$, which means the function $g(x) := f(x) - x (\textrm{mod} 10)$ is a permutation.
If we sum $g(x)$ over $x$ we obtain $\sum_x f(x) - \sum_x x \equiv 0 (\textrm{mod} 10)$ simply because both $f$ and the identity are permutaions adding up to 55. On the other hand, if $g$ is indeed a permutation - summing up its values would give $5 (\textrm{mod} 10)$, i.e., a contradiction!
So if we were to look for a function to replace the "multiply by 2 and take sum of digits" from the Luhn algorithm, while maintaining the rest of the algorithm's structure, and covering all adjacent digits transpositions - we won't find one.


Hans Peter Luhn 1896-1964
(Image source: asist.org)


Fix #0: Find only transpositions


We could multiply every digit by its place, and take this sum modulo 10, i.e. define the check digit to be $\sum_{i=1} ^n i \cdot a_i (\textrm{mod} 10)$. This will detect all adjacent transpositions, but it will do a very poor job in detecting single digit errors. Seeing how 10 shares divisors with most of the numbers in $0,...,9$, multiplying by either of them and taking the remainder modulo 10 discards some information. For example $5\cdot x$ will always be either 0 or 5 (assuming $x$ is an integer). This means that a random error in the fifth digit will cause a change in the check digit only 50% of the times, and somewhat similar problems arise with digits in even places.

Fix #1: Abandon base 10


A closer inspection would reveal that the proof of the Luhn algorithm's fault relies heavily on the fact that 10 is an even number, since the expression $\sum_{i=1} ^n i (\textrm{mod} n)$ is $\frac n 2$ for even values of $n$ but $0$ for odd values of $n$. If we had used an odd base for this entire computation - things would be different, but perhaps less intuitive to compute by hand.
Indeed, some check digit systems, such as the one used in ISBN 10 (for bank transfers) employ a similar trick with 11 as the base of the modular operations.


Fix #2: Abandon simplicity


If we allow yet more complex computations, then there are some algorithms that produce a check digit that will detect all single-digit errors as well as all transpositions of adjacent digits. The oldest one probably being Verhoeff algorithm, from 1967. Verhoeff, who is now retired and an avid creator of math-inspired sculptures, took an empirical approach, by analyzing data from the Dutch postal service, classifying and counting common errors. He followed to create a code that not only detects all adjacent digits transpositions and single digit errors, but also detects errors based on phonetic similarities in dutch, such as confusing 15 (vijftien) with 50 (vijftig).

Jacobus Verhoeff, with his sculptures
(Image source: groene.nl)
More recently (2004), H. Michael Damm proposed an algorithm that is slightly less complicated (but only very slightly), and has similar qualities.



Some tests


Running some quick scripts in python, I wanted to find out how well the Luhn algorithm, the Damm algorithm (I skipped implementing Verhoeff's) and the algorithm proposed in Fix #0 above fare. I also added two more algorithms for fun:

  • Naive - $|\sum_{i = 1} ^{n - 1} (a_i - a_{i+1})| (\textrm{mod} 10)$.
  • RandomTable - the Damm algorithm is based on using a table of a quasi group of order 10. So I wondered how well a random table would do, so I took 10 random permutations, one above the other, and used it in the same way the Damm algorithm is implemented.

The errors introduced were:

  • Adjacent transposition: 12345 -> 12435
  • One-over transposition: 12345 -> 14325
  • Random 2-digit transposition: 12345 -> 42315
  • Single digit random error: 12345 -> 19345
  • Two-digit random errors: 12345 -> 22346
  • 3, 4, 5... 9 random errors.

The table shows the miss rate. Note that 10% miss rate is the magic number that they all converge to, given enough noise. This is because if we were to choose some random function $f:(0...9)^n\rightarrow\{0...9\}$ then for two random inputs $x \neq y \in (0...9)^n$, clearly $Pr(f(x) \neq f(y)) = 0.9$. A random function on random data has a 10% miss rate, so useful algorithms try to get significantly better than 10% for specific cases, without causing spikes in the miss rate for other errors.



error \ algo Luhn Damm Fix #0 Naive Random
Adjacent transposition 2% 0% 0% 70% 34%
One-over transposition 87% 10% 9% 63% 22%
Random transposition 45% 8% 10% 53% 27%
1-digit error 0% 0% 10% 67% 36%
2-digit error 10% 10% 10% 44% 25%
3-digit error 10% 10% 10% 27% 18%
4-digit error 10% 10% 10% 17% 14%
5-digit error 10% 10% 10% 12% 11%
6-digit error 10% 10% 10% 10% 11%
7-digit error 10% 10% 10% 10% 10%
8-digit error 10% 10% 10% 10% 10%
9-digit error 10% 10% 10% 10% 10%



Error correction codes for edit distance?


So, while this is a pre-computing-age problem, that is nowadays a toy problem, it highlights an interesting issue: the standard error-correction codes treat a very specific domain, where errors are modeled as noisy coordinates in a vector, and so the notion of distance is just the Hamming distance. Are there good codes for domains where the distance is more complicated, such as Levenshtein distance?


It turns out that once we tweak the noise model, we quickly step into the realm of open problems. While for the Binary Erasure Channel, which models noise as flipping bits, we've known the capacity ever since it was introduced, for the Deletion Channel, which models noise as removing a bit from the input string, the capacity is still an open problem.
To summarize - edit distance correction codes are an interesting idea, but I doubt we'll see any progress in them soon, as even the fundamental questions of Information Theory turn out to be extremely challenging in this domain.