Abstract

Machines within a distributed computing network have a low-level storage system that provides remote procedure call (RPC) access to read/write files to their local disk/flash memory storage. Distributed computing networks support a library that facilitates the extraction of data flows between clients and storage systems. Attributing low-level storage operations to higher-level semantic operations is difficult because, seen from low-level storage, RPC calls are agnostic of high-level storage. This disclosure describes techniques to accurately attribute low-level storage costs to high-level storage operations in RPC call trees by maximizing a Jaccard similarity coefficient between two lists, e.g., a list of ancestor spans of each span in a trace associated with the low-level storage system, and a list of ancestor spans of each span in a trace associated with the library that facilitates the extraction of data flow between clients and storage systems. The described techniques find application in data governance and can be used to accurately estimate the resource usage associated with storage operations without making changes to logging or tracing logic.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS