Good quality with 7-8 second generation times, using about 4.5Gigs of VRAM #610

Pandaily591 · 2023-09-25T05:33:15Z

Pandaily591
Sep 25, 2023

I've been tweaking tortoise for a little while and got something i'm happy with.

You can find the code here:
https://github.com/Pandaily591/OnlySpeakTTS

Essentially,

reduce the batch size to 1, regardless of VRAM.
Keep all models loaded into memory
No need to load the model to pick the best output, since we only have 1
reduce auto_regressive samples to 1, and diffusion iterations to 7-12

There are some other things, and some issues that can be resolved.

Generating a voice from clips has some randomization involved, so you may get a bad voice.
You should keep generating voices and testing them, until you find a generation that performs well on different sentences.
Save this voice's tensors to files and load them in the future, instead of generating new ones each time

There is an example video if you're curious.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Good quality with 7-8 second generation times, using about 4.5Gigs of VRAM #610

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Good quality with 7-8 second generation times, using about 4.5Gigs of VRAM #610

Pandaily591 Sep 25, 2023

Replies: 0 comments

Pandaily591
Sep 25, 2023