Le Monde des Utilisateurs de L'Analyse de Données

Numéro 36

 
 

On Reservoir Sampling with Deletions. Rainer GEMULLA, Wolfgang LEHNER, Peter J. HAAS.
La revue MODULAD, numéro 36, Juillet 2007

Abstract:
Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we describe a recently proposed method for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions.
Our scheme, called “random pairing” (RP), maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions; RP reduces to the “passive” algorithm in [1] when the insertions and deletions correspond to a moving window over a data stream. We also prove that it is not possible to “resize” a bounded-size random sample upwards without accessing the base data.


Keywords: Reservoir sampling, Sample maintenance, Synopsis.

Download paper : On Reservoir Sampling with Deletions