This disclosure relates to contextualization transcription of non-verbal communication. All transcription applications today take inputs from audio capturing endpoints like microphones and apply voice/speech recognition algorithms and then speech-to-text transformations to transcribe the user’s speech. This transcription can further be enhanced by adding non-verbal inputs (inferred through video AI) to the transcription which can add contextual value to the transcription. This disclosure proposes methods to combine both video and audio inputs and transcribe it to solve this issue. With the use camera AI to add non-verbal context to transcription. This would complement audio transcription. Camera AI would look for common gestures, motions, activities and add to transcription. AI would learn new behaviors of participants.

