Hashing

Notation:

  • $U$: a large universe of “keys”,
    • or "Keys come from some large universe $U$"".
  • $S$: The actual dictionary is some subset of this universe.
    • There is some set $S$ in $U$ of keys we actually care about (which may be static or dynamic).
    • $N = |S|$. Think of $N$ as much smaller than the size of U.
  • $M$: An array $A$ of some size $M$, where we use the hash function $h: U \rightarrow {0, ..., M - 1}$
    • Given an element $x$, the idea of hashing is we want to store it in $A[h(x)]$.
    • Collision happenes when $h(x) = h(y)$
  • $H$: A randomized algorithm for constructing hash functions $h$, used in universal hashing

Operations/Complexity Assumed Linked-list Structure:

  • add(x): add the key x to $S$. $\rightarrow O(1)$
  • query(q): is the key q $∈ S$? $\rightarrow O(length(A[h(x)]))$
  • delete(x): remove the key x from $S$. $\rightarrow O(length(A[h(x)]))$

Cases to Analyze

  • Static case: don’t care about adding and removing keys, we just care about fast query times
  • Dynamic case: we just add keys, the insertion-only case.

Desired properties

  • We do not have too many collisions
  • We would like our scheme to achieve property (1) without needing the table size $M$ to be much larger than the number of elements $N$, $M = O(N)$.
  • The function $h$ is fast to compute.
  • Intuition: We can’t just use a random number generator to decide where the next element goes because then we would never be able to find it again. So, we want h to be something “pseudorandom” in some formal sense.

Key Idea

  • Use randomization in our construction of $h$
  • What we will show is that for any sequence of insert and lookup operations (we won’t need to assume the set S of elements inserted is random), if we pick $h$ in this probabilistic way, the performance of $h$ on this sequence will be good in expectation.

Claim 1

For any hash function $h$, if $|U| ≥ (N −1)M +1$, there exists a set $S$ of $N$ elements that all hash to the same location.

  • Proved by the pigeon-hole principle.

Universal Hashing

A randomized algorithm $H$ for constructing hash functions $h: U → {0, . . . , M − 1}$ is universal if for all $x \neq y \in U$, we have $Pr_{h \leftarrow H} [h(x) = h(y)] \leq \frac{1}{M}$

  • Calculation method: check $\frac{|h \in H | h(a) = h(b)|}{|H|} \leq \frac{1}{m}$
    • $m$: number of all the possible value of $h(x)$
    • $|h|$: number of hash function in hash family
    • $|h \in H | h(a) = h(b)|$: the number of hash functions in H that cause a and b to collide.

We also say that a set $H$ of hash functions is a universal hash function family if the procedure “choose $h ∈ H$ at random” is universal. * (Here we are identifying the set of functions with the uniform distribution over the set.)

Theorem 1

If $H$ is universal,then for any set $S ⊆ U$ of size $N$,for any $x ∈ U$ (e.g.,that we might want to lookup), if we construct $h$ at random according to $H$, the expected number of collisions between $x$ and other elements in $S$ is at most $N/M$.

  • expected cost of query(q), delete(x) will become constant!!

    Corollary 1

    If $H$ is universal then for any sequence of $L$ insert, lookup, and delete operations in which there are at most $M$ elements in the system at any one time, the expected total cost of the $L$ operations for a random $h ∈ H$ is only $O(L)$ (viewing the time to compute $h$ as constant).

Example of Constructing Universal Hash:

  • Notations:
    • keys ($x$) is $u$-bits long
    • hash values is $b$-bits long $\rightarrow M = 2^b$
    • $h$: $u$-by-$b$ 0/1 matrix
    • hash function = $hx$, where $hx$ means
      • Adding some of the columns of $h$ (doing vector addition mod 2) where the 1 bits in x indicate which ones to add.
  • Analysis:
    • Take an arbitrary pair of keys $x, y$ such that $x \neq y$, and they only differ in the $i$-th bit.
    • Each of the $b$ different settings of the $i$-th column gives a different value of $h(y)$ and $h(x) \rightarrow$ there is exactly a $\frac{1}{2^b} = \frac{1}{M}$ chance that $h(x) = h(y)$.

Another definition: $\ell$-universal or $\ell$-wise-independent hash function

A hash function family is $\ell$-universal if

  • For every set of $\ell$ distinct keys $x1, x_2, ..., x{\ell}$
  • And every set of values $v1,v_2, ..., v{\ell} \in {0,...,M - 1}$

$Pr{h \leftarrow H} [h(x_1) = v_1~AND~h(x_2) = v_2~AND~...~AND~h(x{\ell}) = v_{\ell}] = 1/M^{\ell}$.

  • If $H$ is 2-universal then it is universal.
  • The above matrix-based hash family was not 2-universal since the hash functions all mapped 0 to 0.

Perfect Hashing

if we fix the set $S$ (the dictionary), can we find a hash function $h$ such that all lookups are constant-time?

  • We say a hash function is perfect for $S$ if all lookups involve $O(1)$ work.

$O(N^2)$ space solution

  • Claim: Let $H$ be universal and $M = N^2$, then $Pr_{h \leftarrow H} (no~collisions~in~S) \leq 1/2$.
  • Proof:
    • For each pair of keys
    • $Exist~Choose(n, 2)$ $x, y$ pairs of keys
    • Since $H$ is universal, $Pr[h(x) = h(y)] \leq 1/M$
    • $E[x~collide~with~y] \leq Choose(n, 2)/M = \frac{N(N-1)}{2N^2} < \frac{1}{2}$

      $O(N)$ space solution

results matching ""

    No results matching ""