TransVIP

Advanced Speech to Speech Translation System with Voice and Isochrony Preservation

Overview

diagram

Overview of our speech to speech translation framework, which consists of 1) Joint encoder-decoder model for translating speech into target text, and coarse-grained speech tokens, 2) Non-autoregressive acoustic model for acoustic details; 3) Codec model to convert discrete speech tokens back to waveform.

We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker’s voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing.

Video Dubbing

Audio Clips

Audio from CVSS-T colomn is synthesized from ground truth(GT) translation label. The source speech and text label comes from CoVoST2 dataset.

Source
CVSS-T
TransVIP
GT Text Label /
TransVIP Text Out

—————————————————-

Before contacting us, check out our frequently asked questions.

Before contacting us, let's look at the questions we have.

—————————————————-

The Court of Auditors told you that.

The Court of Auditors told you that.

—————————————————-

The Government understood it.

The Government understood it.

—————————————————-

if they come or not.

Whether he's coming or not.

—————————————————-

The Front de gauche member of parliaments don’t understand you.

The parliamentarians of the front-left do not understand you.

—————————————————-

It is true, and that’s an absolute positive step forward.

It's accurate! And it's a completely positive step.

—————————————————-