Hashing

Notation:

$U$: a large universe of “keys”,
- or "Keys come from some large universe $U$"".
$S$: The actual dictionary is some subset of this universe.
- There is some set $S$ in $U$ of keys we actually care about (which may be static or dynamic).
- $N = |S|$. Think of $N$ as much smaller than the size of U.
$M$: An array $A$ of some size $M$, where we use the hash function $h: U \rightarrow {0, ..., M - 1}$
- Given an element $x$, the idea of hashing is we want to store it in $A[h(x)]$.
- Collision happenes when $h(x) = h(y)$
$H$: A randomized algorithm for constructing hash functions $h$, used in universal hashing

Operations/Complexity Assumed Linked-list Structure:

add(x): add the key x to $S$. $\rightarrow O(1)$
query(q): is the key q $∈ S$? $\rightarrow O(length(A[h(x)]))$
delete(x): remove the key x from $S$. $\rightarrow O(length(A[h(x)]))$

Cases to Analyze

Static case: don’t care about adding and removing keys, we just care about fast query times
Dynamic case: we just add keys, the insertion-only case.

Desired properties

We do not have too many collisions
We would like our scheme to achieve property (1) without needing the table size $M$ to be much larger than the number of elements $N$, $M = O(N)$.
The function $h$ is fast to compute.
Intuition: We can’t just use a random number generator to decide where the next element goes because then we would never be able to find it again. So, we want h to be something “pseudorandom” in some formal sense.

Key Idea

Use randomization in our construction of $h$
What we will show is that for any sequence of insert and lookup operations (we won’t need to assume the set S of elements inserted is random), if we pick $h$ in this probabilistic way, the performance of $h$ on this sequence will be good in expectation.

Claim 1

For any hash function $h$, if $|U| ≥ (N −1)M +1$, there exists a set $S$ of $N$ elements that all hash to the same location.

Proved by the pigeon-hole principle.

Universal Hashing

A randomized algorithm $H$ for constructing hash functions $h: U → {0, . . . , M − 1}$ is universal if for all $x \neq y \in U$, we have $Pr_{h \leftarrow H} [h(x) = h(y)] \leq \frac{1}{M}$

Calculation method: check $\frac{|h \in H | h(a) = h(b)|}{|H|} \leq \frac{1}{m}$

$m$: number of all the possible value of $h(x)$

$|h|$: number of hash function in hash family

$|h \in H | h(a) = h(b)|$: the number of hash functions in H that cause a and b to collide.

We also say that a set $H$ of hash functions is a universal hash function family if the procedure “choose $h ∈ H$ at random” is universal. * (Here we are identifying the set of functions with the uniform distribution over the set.)

Theorem 1

If $H$ is universal,then for any set $S ⊆ U$ of size $N$,for any $x ∈ U$ (e.g.,that we might want to lookup), if we construct $h$ at random according to $H$, the expected number of collisions between $x$ and other elements in $S$ is at most $N/M$.

expected cost of query(q), delete(x) will become constant!!
Corollary 1
If $H$ is universal then for any sequence of $L$ insert, lookup, and delete operations in which there are at most $M$ elements in the system at any one time, the expected total cost of the $L$ operations for a random $h ∈ H$ is only $O(L)$ (viewing the time to compute $h$ as constant).

Example of Constructing Universal Hash:

Notations:
- keys ($x$) is $u$-bits long
- hash values is $b$-bits long $\rightarrow M = 2^b$
- $h$: $u$-by-$b$ 0/1 matrix
- hash function = $hx$, where $hx$ means
  - Adding some of the columns of $h$ (doing vector addition mod 2) where the 1 bits in x indicate which ones to add.
Analysis:
- Take an arbitrary pair of keys $x, y$ such that $x \neq y$, and they only differ in the $i$-th bit.
- Each of the $b$ different settings of the $i$-th column gives a different value of $h(y)$ and $h(x) \rightarrow$ there is exactly a $\frac{1}{2^b} = \frac{1}{M}$ chance that $h(x) = h(y)$.

Another definition: $\ell$-universal or $\ell$-wise-independent hash function

A hash function family is $\ell$-universal if

For every set of $\ell$ distinct keys $x1, x_2, ..., x{\ell}$

And every set of values $v1,v_2, ..., v{\ell} \in {0,...,M - 1}$

$Pr{h \leftarrow H} [h(x_1) = v_1~AND~h(x_2) = v_2~AND~...~AND~h(x{\ell}) = v_{\ell}] = 1/M^{\ell}$.

If $H$ is 2-universal then it is universal.

The above matrix-based hash family was not 2-universal since the hash functions all mapped 0 to 0.

Perfect Hashing

if we fix the set $S$ (the dictionary), can we find a hash function $h$ such that all lookups are constant-time?

We say a hash function is perfect for $S$ if all lookups involve $O(1)$ work.

$O(N^2)$ space solution

Claim: Let $H$ be universal and $M = N^2$, then $Pr_{h \leftarrow H} (no~collisions~in~S) \leq 1/2$.
Proof:
- For each pair of keys
- $Exist~Choose(n, 2)$ $x, y$ pairs of keys
- Since $H$ is universal, $Pr[h(x) = h(y)] \leq 1/M$
- $E[x~collide~with~y] \leq Choose(n, 2)/M = \frac{N(N-1)}{2N^2} < \frac{1}{2}$
  $O(N)$ space solution

7. Hashing

Hashing

Notation:

Operations/Complexity Assumed Linked-list Structure:

Cases to Analyze

Desired properties

Key Idea

Claim 1

Universal Hashing

Theorem 1

Corollary 1

Example of Constructing Universal Hash:

Another definition: $\ell$-universal or $\ell$-wise-independent hash function

Perfect Hashing

$O(N^2)$ space solution

$O(N)$ space solution

results matching ""

No results matching ""