Abstract
Decentralized, multi-agent artificial intelligence (AI) systems can introduce structural vulnerabilities, such as emergent toxicity and alignment erosion, that existing safety paradigms may not fully address. A multi-agent safety architecture is described that can treat safety as a geometric and cryptographic property of a network's communication and consensus fabric. A method can involve mathematically defining a safety concept as a subspace within a model’s activation manifold to create an orthogonal projection operator. This operator may filter agent states to help mitigate harmful emergent content, a cryptographic egress layer with low-rank proofs can be used to verify the filter's application before messages are broadcast, and a topological consensus mechanism may adjudicate disputes based on proximity to a geometric safety anchor. This architecture can provide a distributed framework to help mitigate risks including emergent toxicity, alignment erosion, and Sybil dominance in continuously learning AI swarms.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Mohbe, Neel, "Multi-Agent Safety Architecture Using Geometric Projections and Cryptographic Egress Control", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10229