Streaming Algorithms
A model of computing where the amount of space we have is much less than the amount of data we examine, and hence we can store only a “summary” or “sketch” of the data.
- The question is: which functions of the input stream can you compute with what amount of time and space?
- Approximate solutions: in several cases, it will be impossible to compute the function exactly using small space. Hence we’ll explore the trade-offs between approximation and space.
- Hashing: this will be a very powerful technique.
Notation:
$a_1, a_2, ..., a_t$: the stream elements - We assume each stream element is from alphabet $Σ$ and takes $b$ bits to represent.
- $a_{[1:t]} = ⟨a_1, a_2, . . . , a_t⟩$.
Problem: Finding $ε$-Heavy Hitters
- Notation
- We have a data stream with elements $a_1, a_2, . . . , a_t$ seen by time $t$.
- Think of these elements as “arriving” and being added to a multiset $S_t$, with $S_0 = \phi, S1 = {a_1}, ...,S_i = {a_1,a_2,...,a_i}$, etc.
- $count_t(e) = {i ∈ {1,2,...,t} | a_i = e}$
- The number of times element $e$ has been seen in the stream so far.
- The multiplicity of $e$ in $S_t$.
- Call element $e ∈ Σ$ is called a $ε$-heavy hitter at time $t$ if $count_t(e) > εt$.
- That is, $e$ constitutes strictly more than $ε$ fraction of the elements that arrive by time $t$.
- Goal
- Given an threshold $ε ∈ [0, 1]$, maintain a data structure that is capable of outputting $ε$-heavy-hitters.
- At any point in time, we can query the data structure, and it outputs a set of at most $1/ε$ elements that contains all the $ε$-heavy hitters.
- Difficulty
- Note that as time passes, the set of frequent elements may change completely, so an algorithm would have to be adaptive.
- We cannot keep track of the counts for all the elements we’ve seen so far, there may be a lot of different elements that appear over time, and we have limited space.
- Crucial note:
- it’s OK to output “false positives” but we are not allowed “false nega- tives”.
- I.e., we’re not allowed to miss any heavy-hitters, but we could output non-heavy-hitters.
- (Since we only output $1/ε$ elements, there can be $≤ 1/ε$-false positives.)
Finding a Majority Element (Simplified Question 1)
The problem of finding a 0.5-heavy-hitter, a.k.a. a majority element, an element that occurs (strictly) more than half the time.
memory, counter = None, 0 When element a_t arrives if counter == 0: memory, counter = a_t, 1 else: if a_t == memory counter += 1 else counter -= 1 discard a_t