Abstract

This disclosure describes techniques to partition a stream of N items into groups approximately of size S under the constraint that a slight alteration in the stream changes the partitions only slightly. Per the techniques, the items are sorted by a hash of their identifier. The length N of the stream is estimated using, e.g., the Flajolet-Martin algorithm. The number of bits B of the hash to be considered is given by B = floor(log2(N/S)), where S is the target partition size. A point in the stream is assigned to a bucket identified by the length-B prefix of its hash. Some advantages of the described techniques include low memory requirement; ability to look up in constant time the bucket associated with a point; consistency in partitioning; flexibility to be deployed in both streaming and distributed settings; etc. Partitioning of items (also known as points or objects) of a dataset finds application in building nearest-neighbor graphs (graphs that exhibit high homophily between their nodes) by fusing together different, potentially weak, measures of similarity.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS