When pairwise dot products are computed between input embedding vectors and the dot product is used for further computation, the number of dot products grows quadratically with the number of embedding vectors. This can cause an efficiency bottleneck and affect performance of machine learning models. This disclosure describes techniques to obtain a compressed dot product matrix from input sparse embeddings. The compressed embeddings are used to obtain a compressed dot product. The compressed embeddings are generated using a weights matrix that is initialized randomly and learnt alongside other parts of the model. To improve performance, attention weights derived from the input embeddings can be used as the weights matrix. Still further, a high level representation of the input embeddings can be obtained and combined with a low-level representation. The described compression techniques can improve model accuracy, as measured by normalized entropy and can improve model execution efficiency. The reduction in size of the dot product matrix, enabled by the described techniques, reduces computational complexity.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.