In many contexts, e.g., training of machine learning models, there is a requirement to sample a large dataset that includes data points that are classified as either positive or negative. Per techniques of this disclosure, distributed sampling is performed such that the dataset is read just once and a minimum number of each type of data point is captured without changing the ratio between positive and negative data points. This disclosure describes techniques to sample a large dataset using a mapreduce strategy. During mapping, a data point is one-to-one mapped to a partial result that includes counts of positive and negative data points and sets of sampled positive and negative data points. During the reduce phase, two partial results are combined into one in an iterative manner until only one partial result remains, which becomes the final result. Both the map and reduce phases are performed in a distributed manner. It is not necessary that the ratio of the positive and negative data points in the dataset be known in advance.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Farnsworth, Richard Steven, "Distributed multi-bucket sampling", Technical Disclosure Commons, (January 09, 2020)