Streaming Algorithms

A model of computing where the amount of space we have is much less than the amount of data we examine, and hence we can store only a “summary” or “sketch” of the data.

  • The question is: which functions of the input stream can you compute with what amount of time and space?
    • Approximate solutions: in several cases, it will be impossible to compute the function exactly using small space. Hence we’ll explore the trade-offs between approximation and space.
    • Hashing: this will be a very powerful technique.

      Notation:

      $a_1, a_2, ..., a_t$: the stream elements
    • We assume each stream element is from alphabet $Σ$ and takes $b$ bits to represent.
    • $a_{[1:t]} = ⟨a_1, a_2, . . . , a_t⟩$.

Problem: Finding $ε$-Heavy Hitters

  • Notation
    • We have a data stream with elements $a_1, a_2, . . . , a_t$ seen by time $t$.
    • Think of these elements as “arriving” and being added to a multiset $S_t$, with $S_0 = \phi, S1 = {a_1}, ...,S_i = {a_1,a_2,...,a_i}$, etc.
    • $count_t(e) = {i ∈ {1,2,...,t} | a_i = e}$
      • The number of times element $e$ has been seen in the stream so far.
      • The multiplicity of $e$ in $S_t$.
    • Call element $e ∈ Σ$ is called a $ε$-heavy hitter at time $t$ if $count_t(e) > εt$.
      • That is, $e$ constitutes strictly more than $ε$ fraction of the elements that arrive by time $t$.
  • Goal
    • Given an threshold $ε ∈ [0, 1]$, maintain a data structure that is capable of outputting $ε$-heavy-hitters.
    • At any point in time, we can query the data structure, and it outputs a set of at most $1/ε$ elements that contains all the $ε$-heavy hitters.
  • Difficulty
    • Note that as time passes, the set of frequent elements may change completely, so an algorithm would have to be adaptive.
    • We cannot keep track of the counts for all the elements we’ve seen so far, there may be a lot of different elements that appear over time, and we have limited space.
  • Crucial note:
    • it’s OK to output “false positives” but we are not allowed “false nega- tives”.
    • I.e., we’re not allowed to miss any heavy-hitters, but we could output non-heavy-hitters.
    • (Since we only output $1/ε$ elements, there can be $≤ 1/ε$-false positives.)

      Finding a Majority Element (Simplified Question 1)

      The problem of finding a 0.5-heavy-hitter, a.k.a. a majority element, an element that occurs (strictly) more than half the time.

      memory, counter = None, 0
      When element a_t arrives
      if counter == 0:
      memory, counter = a_t, 1
      else:
      if a_t == memory
        counter += 1
      else
        counter -= 1
        discard a_t
      

results matching ""

    No results matching ""