Abstract
Techniques are disclosed for learned selection of a column encoder using column statistics. A feature vector is computed in a single pass over a column and includes base statistics (e.g., sparsity, cardinality, dominant frequency, bit-width percentiles, run-length and delta statistics, entropy, sortedness, and data type) and interaction features (e.g., sparsity clusteredness, bit-width gap, delta-entropy ratio, dominant runlength, outlier fraction, and delta uniformity). A benchmark harness evaluates candidate encoders on column samples and labels each sample with an encoder selected by a composite score combining decode performance, compression performance, and encode performance using configurable weights. A compact neural network maps the feature vector to an encoder distribution for sub-microsecond inference at encode time. Drift is detected using a normalized L2 distance between recent and training feature means, enabling selective benchmarking and fine-tuning.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Learned Encoder Selection via Column Statistics Feature Extraction for Adaptive Columnar Data Encoding", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10768