Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is hoped that in future versions, the reasoning framework of CTranslate2 can be supported. :) #2

Open
YLQY opened this issue Oct 20, 2023 · 12 comments

Comments

@YLQY
Copy link

YLQY commented Oct 20, 2023

Hello, I am very happy to finally wait for the demo of UnitySentis. This surprised me. I am a speech recognition algorithm engineer. We often encounter such problems, the inference speed of the model, that is, the problem of real-time performance. Nowadays, very popular large models, such as ChatGPT and whsiper-large-v2. I want to use these models in UnitySentis. We have also tested the onnx model inference speed of Whisper in the Linux environment. Unfortunately, this inference speed is still unbearable for people. So we have two methods. The first is to reduce the size of the model, such as using whisper-small or whisper-base models, but there will still be some loss in effect. We are very happy that we have found an inference library named CTranslate2 (https://github.com/OpenNMT/CTranslate2). We only need to export the model to the format required by ctranslate2, and then with the acceleration of CTranslate2, whisper-large- The real-time rate of v2 has been significantly improved, the RTF value can be reduced to 0.03 (A30 machine), and the running speed under the CPU has also been significantly improved. Moreover, CTranslate2 also supports most of the current mainstream Transformer models, so it would be great if UnitySentis could refer to or integrate CTranslate2 in future versions. :)
I wish UnitySentis will develop better and better :)

Here are some links you may want to use
https://github.com/OpenNMT/CTranslate2
https://opennmt.net/CTranslate2
https://github.com/guillaumekln/faster-whisper/tree/master

@LiutaurasVysniauskasUnity
Copy link

LiutaurasVysniauskasUnity commented Dec 5, 2023

Thank you for the suggestion! This is known internally as Task 257

@elephantpanda
Copy link

Hi thanks we have Whisper-Tiny model example now on Hugging Face. Please try it out and give us your feedback. Thanks for the links we'll check them out.

@YLQY
Copy link
Author

YLQY commented Feb 23, 2024

@pauldog Hi, thanks for the Whisper-Tiny model example on Hugging Face.
I tried the following link here
https://huggingface.co/unity/sentis-whisper-tiny
Then I want to ask some questions.


  1. In the file https://huggingface.co/unity/sentis-whisper-tiny/blob/main/RunWhisper.cs, line 215, an error is reported in my Visual Studio. I want to know the encoding of this file.
    image

Then I tried the following:
image

There will be problems with the decoded results in the sample file,When decoding the given sample audio, [Music] will appear
This decoding result is not the result of the sample audio
https://huggingface.co/unity/sentis-whisper-tiny/blob/main/answering-machine16kHz.wav
image

Then I tried to use Chinese transcription and modified the Chinese special token id, and some undecodable results appeared.
image
image

I feel like these exceptions are related to the encoding of RunWhisper.cs, line 215


  1. I want to know how to obtain the sentis files (AudioDecoder_Tiny.sentis, AudioEncoder_Tiny.sentis) in the sample:)

I tried fine-tuning whisper-tiny model, as well as small, large-v2, etc. But I then used the optimization-cli in hugging face to export it as an onnx model, and then used onnx to unity sentis. I tried the tiny model and replaced the AudioDecoder_Tiny.sentis and AudioEncoder_Tiny.sentis in the sample.
This is the result after I use
optimum-cli export onnx --model ./whisper-tiny --task automatic-speech-recognition whisper-tiny-onnx/
This is my original whisper-tiny model
https://huggingface.co/openai/whisper-tiny
image
image
But when I run it I get the following results
image
So I can’t reproduce my fine-tuned model through examples.

@elephantpanda
Copy link

Hi, when I download the file and put it in Wordpad I get:

bool IsWhiteSpace(char c) { return !(('!' <= c && c <= '~') || ('¡' <= c && c <= '¬') || ('®' <= c && c <= 'ÿ')); }
If you copy the text from the webpage it doesn't work. We could look into fixing that. I think this should work for all countries. Otherwise you can compare to the standard extended ASCII set of codes from 0...255: https://www.ascii-code.com/

For your second question. You could change the names of the inputs on line 157. To the names of the inputs in your ONNX file. It may just have been exported with different names.

@elephantpanda
Copy link

elephantpanda commented Feb 23, 2024

Hi, when I download the CS file and open it in WordPad I get:
bool IsWhiteSpace(char c) { return !(('!' <= c && c <= '~') || ('¡' <= c && c <= '¬') || ('®' <= c && c <= 'ÿ')); }
It doesn't work if you copy it off the webpage. It should work for all languages. If not you can compare it to this table all codes are from 0..255. (0x0000...0x00FF) This is probably not the most elegant bit of code and I'm open to suggestions to simplify it.

For your second question you can change the names of the inputs to the inputs that you exported your ONNX file with on line 157-162. So you would change these to "encoder_hidden_states" and "input_ids".

Also, note that some languages may work better than others. It depends on the training data that was used.

@YLQY
Copy link
Author

YLQY commented Feb 28, 2024

@pauldog
Hi,Thank you very much for your help:)
I have solved both problems here according to the method you mentioned.
Regarding the problem of garbled characters, I did the following experiments:
For the results output by whisper, the source code contains one token and one token decoding.

I found that in Chinese, there may be two or three tokens forming a Chinese character. Therefore, if the historical whisper results are integrated and decoded, there will be no problem of garbled characters. The following is the code I modified here and the screenshot of the operation.

image
image

@YLQY
Copy link
Author

YLQY commented Mar 14, 2024

@pauldog
Hello, can you share how you obtained the LogMelSepctro.sentis model? How did you obtain its original onnx model? I am trying to use C++ to complete the feature extraction process of Whisper, but it cannot run after being packaged into Android. I am still trying to figure out whether it is a problem with the dll file. This may be a bit difficult for me :(

The following are some of my sharings. After experimental verification, they can run on Windows, but cannot run on the Android platform.
https://github.com/YLQY/FBankInterface

I can only generate encoder and decoder models on Optimum.

@YLQY
Copy link
Author

YLQY commented Mar 15, 2024

@pauldog
Hello, I have successfully run the offline whisper-tiny model on the android side. Many thanks to UnitySentis and the examples given.

I will share this demo later

The following is the direction I want to go in the future. According to the current experience results, recognition effects need to be improved. In addition to fine-tuning the whisper model, there are also quantitative methods. It can greatly reduce the size of the model. The following is the size of my tiny model before and after quantization.
image
(It can be seen that the volume of the model has been significantly reduced.)

I use huggingface/optimum here to quantify the model (here)
Here are the commands I use:

optimum-cli export onnx --model ./whisper-tiny/ --task automatic-speech-recognition whisper-tiny-onnx
optimum-cli onnxruntime quantize --arm64 --onnx_model whisper-tiny-onnx -o qu_whisper-tiny-onnx

But when I imported the quantized model, UnitySentis reported an error. The following is the error message. It seems that some operators are not supported.
image

I would like to ask, does UnitySentis not support these currently? Or is there something wrong with my converted and quantified model?

If this problem can be solved, then according to our experience, a whisper-small model with at least 244 M parameters can achieve a good offline effect experience.

Here are some links that may be helpful:
https://huggingface.co/docs/optimum/index
https://github.com/huggingface/optimum

@elephantpanda
Copy link

Hi @YLQY. You converted them correctly. I can tell you that the Sentis team is working on quantization support. The first step in this support will be the next version 1.4.0 coming out in the next couple of weeks.

@YLQY
Copy link
Author

YLQY commented Mar 16, 2024

@pauldog
Hello, this is my demo, but I found another problem, that is, if I test about 4 audios in a row, the app will crash inexplicably. Why is this? Do you have a certain idea? Is the tensor not released?

@YLQY
Copy link
Author

YLQY commented Apr 11, 2024

Hello @pauldog, I am very happy to see that UnitySentis has version 1.4. I can’t wait to try it out and also found the code update in huggingface. The onnx quantization model I converted through the above method still cannot be loaded. The following is the error message, I hope it will be helpful to you.

1712841294782

@PaulBirdUnity
Copy link

Hi, you are right some aspects of quantization are still not supported such as QuantizeLinear. Appologies. If an you have an un-quantized model, you can quantize the weights. There is an example as one of the samples in the package manager.
Please also feel free to report this issue here: https://discussions.unity.com/c/ai-beta/sentis/10 as this is the main place where we log our issues.
Also you can vote and write a message here: https://unity.com/roadmap/unity-ai/sentis On the box where it says "quantization for performance" you can click on that to request it and send a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants