Abstract

In many contexts, e.g., training of machine learning models, there is a requirement to sample a large dataset that includes data points that are classified as either positive or negative. Per techniques of this disclosure, distributed sampling is performed such that the dataset is read just once and a minimum number of each type of data point is captured without changing the ratio between positive and negative data points. This disclosure describes techniques to sample a large dataset using a mapreduce strategy. During mapping, a data point is one-to-one mapped to a partial result that includes counts of positive and negative data points and sets of sampled positive and negative data points. During the reduce phase, two partial results are combined into one in an iterative manner until only one partial result remains, which becomes the final result. Both the map and reduce phases are performed in a distributed manner. It is not necessary that the ratio of the positive and negative data points in the dataset be known in advance.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Farnsworth, Richard Steven, "Distributed multi-bucket sampling", Technical Disclosure Commons, (January 09, 2020)
https://www.tdcommons.org/dpubs_series/2855

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Distributed multi-bucket sampling

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Distributed multi-bucket sampling

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information