Abstract

Enterprise data centers employ a distributed block storage architecture for their Tier 1 workloads. Business-critical applications are hosted on servers having local compute and memory resources, but the storage is centralized within distributed storage devices. Although a distributed block storage architecture helps an enterprise data center in efficiently using storage, when an application slowdown is experienced, the troubleshooting process becomes more difficult and manual. To address these types of challenges, techniques are presented herein that support a data-driven and algorithmic approach to pinpoint the exact root-cause (e.g., a host, a storage area network (SAN), or a storage array) of a storage slowdown. The presented techniques are operable even when no obvious errors are present (e.g., the storage access is sick-but-not-dead). The presented techniques leverage the latency metrics Exchange Completion Time (ECT), Data Access Latency (DAL), and Host Response Latency (HRL) to pinpoint the exact root-cause of the storage slowdown in a distributed block storage architecture using Small Computer System Interface (SCSI) and nonvolatile memory express (NVMe) over any transport (e.g., Fibre Channel (FC), FC over Ethernet (FCoE), Internet Small Computer Systems Interface (iSCSI), NVMe over Transmission Control Protocol (NVMe/TCP), remote direct memory access (RDMA) over Converged Ethernet (RoCE), etc.).

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS