Merge pull request #409 from uezo/support-nijivoice-speech-synthesizer

Add support for NijiVoice as a speech synthesizer
uezo · Dec 12, 2024 · d7854ac · d7854ac
2 parents 50be0d0 + 6c68894
commit d7854ac
Show file tree

Hide file tree

Showing 4 changed files with 208 additions and 4 deletions.
diff --git a/README.ja.md b/README.ja.md
@@ -10,7 +10,7 @@ ChatdollKitは、お好みの3Dモデルを使って音声対話可能なチャ
 
 - **生成AI対応**: ChatGPT、Anthropic Claude、Google Gemini Pro、Difyなど、複数のLLMをサポートし、ファンクションコーリング（ChatGPT/Gemini）やマルチモーダル機能にも対応
 - **3Dモデル表現**: 発話とモーションの同期、表情やアニメーションの自律制御、瞬きや口の動きの同期をサポート
-- **対話制御**: 音声認識と音声合成（OpenAI、Azure、Google、Watson、VOICEVOX、VOICEROIDなど）の統合、対話状態（コンテキスト）の管理、意図抽出とトピックのルーティング、ウェイクワード検出をサポート
+- **対話制御**: 音声認識と音声合成（OpenAI、Azure、Google、Watson、VOICEVOX / AivisSpeech、Style-Bert-VITS2、にじボイスなど）の統合、対話状態（コンテキスト）の管理、意図抽出とトピックのルーティング、ウェイクワード検出をサポート
 - **マルチプラットフォーム**: Windows、Mac、Linux、iOS、Android、およびその他のUnityサポートプラットフォーム（VR、AR、WebGLを含む）に対応
 
 
@@ -414,7 +414,7 @@ ChatdollKitはこのCoTの手法に、`<thinking> ~ </thinking>`の中身を読
 
 ## 🗣️ Speech Synthesizer (Text-to-Speech)
 
-音声合成サービスとしてクラウドサービスとして提供されるGoogle、Azure、OpenAI、Watsonをサポートするほか、キャラクターとしてより魅力的な音声を提供するVOICEVOX、VOICEROID、Style-Bert-VITS2をサポートします。
+音声合成サービスとしてクラウドサービスとして提供されるGoogle、Azure、OpenAI、Watsonをサポートするほか、キャラクターとしてより魅力的な音声を提供するVOICEVOX / AivisSpeech、VOICEROID、Style-Bert-VITS2, にじボイスをサポートします。
 音声合成サービスを使用するには、`ChatdollKit/Scripts/SpeechSynthesizer`の各サービス名が含まれる`SpeechSynthesizer`をAIAvatarオブジェクトにアタッチして、`IsEnabled`にチェックを入れてください。すでに他のSpeechSynthesizerがアタッチされている場合、使用しないSpeechSynthesizerの`IsEnabled`はチェックを外す必要がある点に注意してください。
 
 アタッチしたSpeechSynthesizerには、APIキーやエンドポイントなどのパラメーターをインスペクター上で設定することができます。これらのパラメーターの意味や設定すべき値等については各TTSサービス・製品のAPIリファレンスを参照してください。

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@
 
 - **Generative AI Native**: Supports multiple LLMs like ChatGPT, Anthropic Claude, Google Gemini Pro, Dify, and others, with function calling (ChatGPT/Gemini) and multimodal capabilities.
 - **3D model expression**: Synchronizes speech and motion, controls facial expressions and animations autonomously, supports blinking and lip-sync.
-- **Dialog control**: Integrates Speech-to-Text and Text-to-Speech (OpenAI, Azure, Google, Watson, VOICEVOX, VOICEROID, etc.), manages dialog state (context), extracts intents and routes topics, supports wakeword detection.
+- **Dialog control**: Integrates Speech-to-Text and Text-to-Speech (OpenAI, Azure, Google, VOICEVOX / AivisSpeech, Style-Bert-VITS2, NijiVoice etc.), manages dialog state (context), extracts intents and routes topics, supports wakeword detection.
 - **Multi platforms**: Compatible with Windows, Mac, Linux, iOS, Android, and other Unity-supported platforms, including VR, AR, and WebGL.
 
 
@@ -420,7 +420,7 @@ You can customize the tag by setting a preferred word (e.g., "reason") as the `T
 
 ## 🗣️ Speech Synthesizer (Text-to-Speech)
 
-We support cloud-based speech synthesis services such as Google, Azure, OpenAI, and Watson, in addition to VOICEVOX, VOICEROID, and Style-Bert-VITS2 for more characterful and engaging voices. To use a speech synthesis service, attach `SpeechSynthesizer` from `ChatdollKit/Scripts/SpeechListener` to the AIAvatar object and check the `IsEnabled` box. If other `SpeechSynthesizer` components are attached, make sure to uncheck the `IsEnabled` box for those not in use.
+We support cloud-based speech synthesis services such as Google, Azure, OpenAI, and Watson, in addition to VOICEVOX / AivisSpeech, VOICEROID, Style-Bert-VITS2, and NijiVoice for more characterful and engaging voices. To use a speech synthesis service, attach `SpeechSynthesizer` from `ChatdollKit/Scripts/SpeechListener` to the AIAvatar object and check the `IsEnabled` box. If other `SpeechSynthesizer` components are attached, make sure to uncheck the `IsEnabled` box for those not in use.
 
 You can configure parameters like API keys and endpoints on the attached `SpeechSynthesizer` in the inspector. For more details of these parameters, refer to the API references of TTS services.
 

diff --git a/Scripts/SpeechSynthesizer/NijiVoiceSpeechSynthesizer.cs b/Scripts/SpeechSynthesizer/NijiVoiceSpeechSynthesizer.cs
@@ -0,0 +1,193 @@
+using System;
+using System.Collections.Generic;
+using System.Linq;
+using System.Threading;
+using UnityEngine;
+using UnityEngine.Networking;
+using Cysharp.Threading.Tasks;
+using ChatdollKit.IO;
+using ChatdollKit.Network;
+
+namespace ChatdollKit.SpeechSynthesizer
+{
+    public class NijiVoiceSpeechSynthesizer : SpeechSynthesizerBase
+    {
+        public bool _IsEnabled = true;
+        public override bool IsEnabled
+        {
+            get
+            {
+                return _IsEnabled;
+            }
+            set
+            {
+                _IsEnabled = value;
+            }
+        }
+
+        public string EndpointUrl;
+        public string ApiKey;
+
+        [Header("Voice Settings")]
+        public string VoiceActorId = "dba2fa0e-f750-43ad-b9f6-d5aeaea7dc16";
+        public float Speed = 1.0f;
+        [SerializeField]
+        private AudioType audioType = AudioType.WAV;
+
+        public List<VoiceModelSpeed> VoiceModelSpeeds;
+
+        [SerializeField]
+        protected bool printSupportedSpeakers;
+
+        private ChatdollHttp client;
+
+        private void Start()
+        {
+            client = new ChatdollHttp(Timeout);
+            if (printSupportedSpeakers)
+            {
+                _ = ListSpeakersAsync(CancellationToken.None);
+            }
+        }
+
+        // Get audio clip from NijiVoice API
+        protected override async UniTask<AudioClip> DownloadAudioClipAsync(string text, Dictionary<string, object> parameters, CancellationToken token)
+        {
+            if (token.IsCancellationRequested) { return null; };
+
+            var textToSpeech = text.Replace(" ", "").Replace("\n", "").Trim();
+            if (string.IsNullOrEmpty(textToSpeech) || textToSpeech == "」") return null;
+
+            // Generate audio data on NijiVoice server
+            var url = (string.IsNullOrEmpty(EndpointUrl) ? "https://api.nijivoice.com" : EndpointUrl) + $"/api/platform/v1/voice-actors/{VoiceActorId}/generate-voice";
+            var speed = Speed > 0 ? Speed : VoiceModelSpeeds.FirstOrDefault(v => v.id == VoiceActorId)?.speed ?? 1.0f;
+            var data = new Dictionary<string, string>() {
+                { "script", text },
+                { "speed", speed.ToString() },
+                { "format", audioType == AudioType.MPEG ? "mp3" : "wav" },
+            };
+            var headers = new Dictionary<string, string>() { { "Content-Type", "application/json" }, { "x-api-key", ApiKey } };
+            var generatedVoiceResponse = await client.PostJsonAsync<GeneratedVoiceResponse>(url, data, headers, cancellationToken: token);
+
+#if UNITY_WEBGL && !UNITY_EDITOR
+            return await DownloadAudioClipWebGLAsync(generatedVoiceResponse.generatedvoice.audioFileUrl, token);
+#else
+            return await DownloadAudioClipNativeAsync(generatedVoiceResponse.generatedvoice.audioFileUrl, token);
+#endif
+        }
+
+        protected async UniTask<AudioClip> DownloadAudioClipNativeAsync(string url, CancellationToken token)
+        {
+            using (var www = UnityWebRequestMultimedia.GetAudioClip(url, audioType))
+            {
+                www.timeout = Timeout;
+                www.method = "GET";
+
+                // Send request
+                try
+                {
+                    await www.SendWebRequest().ToUniTask(cancellationToken: token);
+                }
+                catch (Exception ex)
+                {
+                    Debug.LogError($"Error occured while processing NijiVoice text-to-speech: {ex}");
+                    return null;
+                }
+
+                return DownloadHandlerAudioClip.GetContent(www);
+            }
+        }
+
+        protected async UniTask<AudioClip> DownloadAudioClipWebGLAsync(string url, CancellationToken token)
+        {
+            var audioResp = await client.GetAsync(url, cancellationToken: token);
+            return AudioConverter.PCMToAudioClip(audioResp.Data);
+        }
+
+        public async UniTask ListSpeakersAsync(CancellationToken token)
+        {
+            if (printSupportedSpeakers)
+            {
+                Debug.Log("==== Supported speakers ====");
+            }
+
+            VoiceModelSpeeds.Clear();
+            foreach (var s in await GetSpearkersAsync(token))
+            {
+                if (printSupportedSpeakers)
+                {
+                    Debug.Log($"{s.Key}: {s.Value.name} ({s.Value.recommendedVoiceSpeed})");
+                }
+                VoiceModelSpeeds.Add(new VoiceModelSpeed(){ id = s.Key, speed = s.Value.recommendedVoiceSpeed });
+            }
+        }
+
+        private async UniTask<Dictionary<string, VoiceActorData>> GetSpearkersAsync(CancellationToken token)
+        {
+            var speakers = new Dictionary<string, VoiceActorData>();
+
+            var speakerResponse = await client.GetJsonAsync<SpeakersResponse>(
+                (string.IsNullOrEmpty(EndpointUrl) ? "https://api.nijivoice.com" : EndpointUrl) + "/api/platform/v1/voice-actors",
+                headers: new Dictionary<string, string>(){
+                    { "x-api-key", ApiKey }
+                },
+                cancellationToken: token);
+
+            foreach (var va in speakerResponse.voiceActors)
+            {
+                speakers.Add(va.id, va);
+            }
+
+            return speakers;
+        }
+
+        private class SpeakersResponse
+        {
+            public List<VoiceActorData> voiceActors;
+        }
+
+        [Serializable]
+        public class VoiceStyle
+        {
+            public int id;
+            public string style;
+        }
+
+        [Serializable]
+        public class VoiceActorData
+        {
+            public string id;
+            public string name;
+            public string nameReading;
+            public int age;
+            public string gender;
+            public int birthMonth;
+            public int birthDay;
+            public string smallImageUrl;
+            public string mediumImageUrl;
+            public string largeImageUrl;
+            public string sampleVoiceUrl;
+            public string sampleScript;
+            public float recommendedVoiceSpeed;
+            public List<VoiceStyle> voiceStyles;
+        }
+
+        [Serializable]
+        public class VoiceModelSpeed
+        {
+            public string id;
+            public float speed;
+        }
+
+        private class GeneratedVoiceResponse
+        {
+            public GeneratedVoice generatedvoice { get; set; }
+        }
+
+        private class GeneratedVoice
+        {
+            public string audioFileUrl { get; set; }
+            public int duration { get; set; }
+        }
+    }
+}
diff --git a/Scripts/SpeechSynthesizer/NijiVoiceSpeechSynthesizer.cs.meta b/Scripts/SpeechSynthesizer/NijiVoiceSpeechSynthesizer.cs.meta