Certain automatic designs of neural networks not only minimize prediction error but also shrink or prune the network to reduce inference latency. Targeting inference latency directly is difficult; hence, FLOP-count is often used as a proxy for inference latency. However, FLOP-count is only loosely correlated with inference latency.

This disclosure describes techniques for direct computation or measurement of targeted costs such as inference latency, energy consumption, throughput, model size, etc. By integrating such targeted costs into design procedures, high performance neural networks of low inference latency, model size, and energy consumption can be obtained. The techniques find application in domains where fast, low-powered neural networks are advantageous e.g., image classification, language translation, optical character recognition, self-driving cars, interactive augmented/virtual reality, etc.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.