Inventor(s)

Abstract

Existing profiling tools for machine learning accelerators often lack the granular visibility to diagnose performance bottlenecks within collective communication kernels, typically identifying only that a kernel is slow without pinpointing the specific cause. A method for in-kernel telemetry uses a shared memory first-in-first-out (FIFO) queue established between a host central processing unit (CPU) and an accelerator device. The accelerator kernel is instrumented with lightweight hooks that write fine-grained telemetry events, such as data arrival timestamps from peer accelerators, directly to the FIFO without blocking execution. A dedicated process on the host asynchronously polls the FIFO to consume the telemetry data. This approach provides real-time, logic-level visibility into sub-kernel events, allowing for the precise identification of performance anomalies like straggler devices or congested interconnects. The use of a lock-free shared memory buffer allows for telemetry collection with minimal performance impact on the accelerator’s high-bandwidth data path.

Keywords: In-kernel telemetry, ML accelerators, shared memory FIFO, asynchronous data extraction, sub-kernel event measurement, low-overhead profiling

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS