Programming Praxis: Pearson Hashing

29 May 2018

Here is an implementation in Racket of the Pearson hash exercise from programming praxis. It's based on a paper by Peter K. Pearson from 1990 [1].

To compute the Pearson hash of a message, each byte of the message is subsequently added to the accumulated hash modulo 256 and looked up in a substitution table. The initial hash value is set to zero. The substitution table is a random permutation of 8-bit numbers. It is also possible to use XOR instead of addition modulo 256. Here is a procedure for generating random substitution tables using Racket's shuffle:

(define (generate-substitution-table)
  (define original-table (range 256))
  (list->vector (shuffle original-table)))

(define (print-table tbl)
  (define row 0)
  (for ([val tbl]
        [i (vector-length tbl)])
    (printf "~a " (~a val #:width 3 #:align 'right))
    (when (zero? (modulo (+ i 1) 16))
      (set! row (+ row 1))
      (printf "\n"))))

(random-seed 0)
(define substitution-table (generate-substitution-table))
(print-table substitution-table)
 80 245 227  73 137 130 182 255 122  18 200 204   5 150 133 125 
 14 123  30 209  16  93 139 110 244 111 213 102  21 135  26   8 
214  91  13  87  36 235 226  67 246 145  48 188 208 219  66 180 
 25 131 233  42 193  90 172 239  29  52 101 184 170  55 241   7 
134 194  77 196  43 177 232   4 234  20 140 112   2 173  12  37 
147  81 142 178  89  53  32  46  38 141 109 169  84  41  35  86 
156 126  39 155  70 253 222  63 237  47   3 162  97 199 212 151 
197  19 132 176 252 157 242 113  94 247 159 190 189 107  83  58 
114 161 223  92  40 211  74 215   0  71 236 119  99  54  85 202 
191 124   1  56 149 165 186 105  34 243 121 206  33 250 217  49 
103 220 183  27 118 195  72  62  22   9 166 128 205 181 117  45 
216   6 192  76 231 136  79  31 164 221 163 129 185 228  98  68 
 11 144  50 249  59 160  61  60 116  17  23 168 153  57 143  95 
 88 198  24  51 146  64 138  65 225  69 152 108 218  78 154 201 
187  75 240 148  82 203  96 207 171 179 254 127 248 174 115  44 
106 210 251  15 100 229 238 120 167 175 224  10 104 230  28 158 

This algorithm will compute a 8-bit hash for any message. To generate 16-bit hashes, the same message is hashed again, but this time, 1 is added to the first byte modulo 256. The second hash byte is appended to the first one. This can be extended to compute hashes longer than one byte. Though, after 256 bytes, the hash bytes will start repeating.

(define (pearson-hash-n-bytes message tbl fn n)
  (define result (make-vector n))
  (for ([i n])
    (vector-set! result i
                 (for/fold ([sum i]) ([byte message])
                   (vector-ref tbl (fn sum byte)))))
  result)

(define (+mod256 a b) (modulo (+ a b) 256))

Let's compute the hash for some messages.

(define message #"Programming Praxis")

(pearson-hash-n-bytes message substitution-table bitwise-xor 1)
#(11)

And here is computation of Pearson hashes side by side using two procedures: addition modulo 256 and XOR.

(for ([n-bytes (range 1 11)])
  (printf "~a: ~a ~a~n" 
          (~a #:min-width 2 #:align 'right n-bytes)
          (~a #:min-width 40
              (pearson-hash-n-bytes message substitution-table +mod256 n-bytes))
          (pearson-hash-n-bytes message substitution-table bitwise-xor n-bytes)))
 1: #(20)                                    #(11)
 2: #(20 101)                                #(11 95)
 3: #(20 101 29)                             #(11 95 132)
 4: #(20 101 29 175)                         #(11 95 132 159)
 5: #(20 101 29 175 14)                      #(11 95 132 159 35)
 6: #(20 101 29 175 14 75)                   #(11 95 132 159 35 14)
 7: #(20 101 29 175 14 75 170)               #(11 95 132 159 35 14 155)
 8: #(20 101 29 175 14 75 170 6)             #(11 95 132 159 35 14 155 219)
 9: #(20 101 29 175 14 75 170 6 179)         #(11 95 132 159 35 14 155 219 86)
10: #(20 101 29 175 14 75 170 6 179 67)      #(11 95 132 159 35 14 155 219 86 85)

A sequence of numbers is ideal for representing a hash value, but is nicer to read in string form. Here is a procedure for converting sequences of numbers into strings of hex numbers.

(define (hash-vector->string hash-vec)
  (for/fold ([result ""]) ([byte hash-vec])
    (~a result (~r byte #:base 16 #:min-width 2 #:pad-string "0"))))

(printf "~a\n" (hash-vector->string #(1 2 3)))
(printf "~a\n" (hash-vector->string (range 0 32)))
010203
000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f

References

[1] Peter K. Pearson. Fast hashing of variable-length text strings. Commun. ACM, 33(6):677–680, June 1990. [ DOI | http ]