Abstract
This document discloses two co-operating mechanisms for preserving privacy when documents are processed by a cloud-hosted large language model (LLM) through a security gateway. The first mechanism is a **provenance-bound, per-file pseudonymisation token**: each detected sensitive value in a source document is replaced, before egress to the cloud LLM, with an opaque token derived by truncating a keyed hash (HMAC-SHA-256 under a deployment secret) of the concatenation of (i) the cryptographic hash of the *entire source document* and (ii) the value itself. Because the per-file salt is the source-document hash, the same value produces *different* tokens in different documents — defeating cross-file correlation and known-text attacks — while remaining consistent *within* a document so the LLM can still reason coherently. The same binding makes the tokenised output and its separately-held token-to-value mapping file **provenance- and splice-verifiable**: a token carrying a document-hash salt that does not match the document it appears in fails verification, so cross-file mixing, splicing, or substitution of a mapping is detectable. The second mechanism is **field-role-aware routing**: a detected value is classified by the role the downstream task requires of it. *Reference-only* identifiers (names, e-mail addresses, account IDs) — values the model only needs to refer to, never to compute over — receive opaque tokens and are sent to the cloud LLM. *Operate-on* fields (monetary amounts, dates, IBANs and similar values the model must compute, validate, or order over) are **not** sent to the cloud as opaque tokens, because an opaque blob in a slot where the model expects a typed value induces the model to hallucinate a plausible value and reason over the fabrication; instead such fields are kept, aggregated, or the whole document is routed to a **local** model. The surrounding tokenise-on-egress / vault-the-mapping / reconstitute-on-the-response-path pattern is treated as known background; the disclosed contribution is the combination of the **document-hash provenance binding** with the **hallucination-motivated field-role split**.
**Keywords:** pseudonymisation; tokenisation; cloud LLM privacy; large language model security gateway; provenance binding; document hash salt; per-file salt; HMAC-SHA-256; splice detection; tamper-evident mapping; cross-file correlation defence; known-text attack; field-role routing; hybrid local/cloud inference; hallucination mitigation; operate-on field; reference-only identifier; quasi-identifier; data loss prevention; PII tokenisation; reversible token mapping; GDPR Article 4(5) pseudonymisation.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Rosado, Tiago, "Provenance-Bound Per-File Pseudonymisation Tokens and Field-Role-Aware Routing for Privacy- Preserving Use of Cloud Large Language Models", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10605