Abstract
This paper describes a gradient descent-based approach to prosodic alignment in automatic dubbing using gradient descent optimization. The technique addresses the challenge of synchronizing dubbed speech with the original speaker's cadence of phrases and pauses when replacing speech in a video with a different language. By framing prosodic alignment as an optimization problem solvable through standard machine learning (ML) workflows, the fitness function can be expressed using numeric libraries with gradient descent directly serving as the solver.
This approach handles challenges in speech pacing, word ordering inversions, and pauses during translation while ensuring natural timing of dubbed content. The technique enables the use of arbitrary loss functions beyond, for example, constrained quadratic terms, offering additional optimization opportunities and leveraging existing ML infrastructure for scalable deployments on accelerated architectures.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Sriram, KB, "Prosodic Alignment via Gradient Descent", Technical Disclosure Commons, (April 02, 2025)
https://www.tdcommons.org/dpubs_series/7960