A plain transformer model typically leverages only one piece of metadata - position encoding - directly in the transformer model. The use of transformers typically involves expensive and complex external scaffolding before or after output generation to avoid issues such as hallucination, irrelevance, etc. This disclosure describes techniques to incorporate a variety of metadata types into the native architecture of transformer models. The additional signals can help avoid hallucinations, improve relevance, and minimize the need of expensive external scaffolding. Generalizing transformer operation to incorporate a diversity of metadata can be achieved in various ways such as adding a metadata embedding layer, conditioning self-attention on the metadata, conditioning with gated self-attention, employing a different encoder-decoder architecture, etc. Different types of metadata can help in different ways to improve the quality of the output generated by the transformer and reduce hallucinations. The techniques described in this disclosure can also support multimodal data, such as images, audio, video, or text, etc., with the metadata used representing the specific mode.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.