Hashing
Notation:
- $U$: a large universe of “keys”,
- or "Keys come from some large universe $U$"".
- $S$: The actual dictionary is some subset of this universe.
- There is some set $S$ in $U$ of keys we actually care about (which may be static or dynamic).
- $N = |S|$. Think of $N$ as much smaller than the size of U.
- $M$: An array $A$ of some size $M$, where we use the hash function $h: U \rightarrow {0, ..., M - 1}$
- Given an element $x$, the idea of hashing is we want to store it in $A[h(x)]$.
- Collision happenes when $h(x) = h(y)$
- $H$: A randomized algorithm for constructing hash functions $h$, used in universal hashing
Operations/Complexity Assumed Linked-list Structure:
add(x)
: add the keyx
to $S$. $\rightarrow O(1)$query(q)
: is the keyq
$∈ S$? $\rightarrow O(length(A[h(x)]))$delete(x)
: remove the keyx
from $S$. $\rightarrow O(length(A[h(x)]))$
Cases to Analyze
- Static case: don’t care about adding and removing keys, we just care about fast query times
- Dynamic case: we just add keys, the insertion-only case.
Desired properties
- We do not have too many collisions
- We would like our scheme to achieve property (1) without needing the table size $M$ to be much larger than the number of elements $N$, $M = O(N)$.
- The function $h$ is fast to compute.
- Intuition: We can’t just use a random number generator to decide where the next element goes because then we would never be able to find it again. So, we want h to be something “pseudorandom” in some formal sense.
Key Idea
- Use randomization in our construction of $h$
- What we will show is that for any sequence of insert and lookup operations (we won’t need to assume the set S of elements inserted is random), if we pick $h$ in this probabilistic way, the performance of $h$ on this sequence will be good in expectation.
Claim 1
For any hash function $h$, if $|U| ≥ (N −1)M +1$, there exists a set $S$ of $N$ elements that all hash to the same location.
- Proved by the pigeon-hole principle.
Universal Hashing
A randomized algorithm $H$ for constructing hash functions $h: U → {0, . . . , M − 1}$ is universal if for all $x \neq y \in U$, we have $Pr_{h \leftarrow H} [h(x) = h(y)] \leq \frac{1}{M}$
- Calculation method: check $\frac{|h \in H | h(a) = h(b)|}{|H|} \leq \frac{1}{m}$
- $m$: number of all the possible value of $h(x)$
- $|h|$: number of hash function in hash family
- $|h \in H | h(a) = h(b)|$: the number of hash functions in H that cause a and b to collide.
We also say that a set $H$ of hash functions is a universal hash function family if the procedure “choose $h ∈ H$ at random” is universal. * (Here we are identifying the set of functions with the uniform distribution over the set.)
Theorem 1
If $H$ is universal,then for any set $S ⊆ U$ of size $N$,for any $x ∈ U$ (e.g.,that we might want to lookup), if we construct $h$ at random according to $H$, the expected number of collisions between $x$ and other elements in $S$ is at most $N/M$.
- expected cost of
query(q)
,delete(x)
will become constant!!Corollary 1
If $H$ is universal then for any sequence of $L$insert
,lookup
, anddelete
operations in which there are at most $M$ elements in the system at any one time, the expected total cost of the $L$ operations for a random $h ∈ H$ is only $O(L)$ (viewing the time to compute $h$ as constant).
Example of Constructing Universal Hash:
- Notations:
- keys ($x$) is $u$-bits long
- hash values is $b$-bits long $\rightarrow M = 2^b$
- $h$: $u$-by-$b$ 0/1 matrix
- hash function = $hx$, where $hx$ means
- Adding some of the columns of $h$ (doing vector addition mod 2) where the 1 bits in x indicate which ones to add.
- Analysis:
- Take an arbitrary pair of keys $x, y$ such that $x \neq y$, and they only differ in the $i$-th bit.
- Each of the $b$ different settings of the $i$-th column gives a different value of $h(y)$ and $h(x) \rightarrow$ there is exactly a $\frac{1}{2^b} = \frac{1}{M}$ chance that $h(x) = h(y)$.
Another definition: $\ell$-universal or $\ell$-wise-independent hash function
A hash function family is $\ell$-universal if
- For every set of $\ell$ distinct keys $x1, x_2, ..., x{\ell}$
- And every set of values $v1,v_2, ..., v{\ell} \in {0,...,M - 1}$
$Pr{h \leftarrow H} [h(x_1) = v_1~AND~h(x_2) = v_2~AND~...~AND~h(x{\ell}) = v_{\ell}] = 1/M^{\ell}$.
- If $H$ is 2-universal then it is universal.
- The above matrix-based hash family was not 2-universal since the hash functions all mapped 0 to 0.
Perfect Hashing
if we fix the set $S$ (the dictionary), can we find a hash function $h$ such that all lookups are constant-time?
- We say a hash function is perfect for $S$ if all lookups involve $O(1)$ work.
$O(N^2)$ space solution
- Claim: Let $H$ be universal and $M = N^2$, then $Pr_{h \leftarrow H} (no~collisions~in~S) \leq 1/2$.
- Proof:
- For each pair of keys
- $Exist~Choose(n, 2)$ $x, y$ pairs of keys
- Since $H$ is universal, $Pr[h(x) = h(y)] \leq 1/M$
- $E[x~collide~with~y] \leq Choose(n, 2)/M = \frac{N(N-1)}{2N^2} < \frac{1}{2}$
$O(N)$ space solution