J-Moshi

日本語Full-duplex音声対話システムの試作

Towards a Japanese Full-duplex Dialogue System

大橋厚元Atsumoto Ohashi, 飯塚慎也Shinya Iizuka, 姜菁菁Jingjing Jiang, 東中竜一郎Ryuichiro Higashinaka

名古屋大学大学院情報学研究科 Graduate School of Informatics, Nagoya University

概要: Abstract: 人間同士の対話における発話のオーバーラップや相槌など，同時双方向的な特徴をモデル化できるfull-duplex音声対話システムは，近年注目を集めている．しかし日本語においては，full-duplex音声対話システムはほとんど見られず，full-duplex音声対話システムの開発に関する知見は不足している．本研究では，英語における主要なfull-duplex音声対話システムであるMoshi[1] をベースとすることで，日本語で利用可能な最初のfull-duplex音声対話システム J-Moshi[2] を試作し，公開する． Full-duplex spoken dialogue systems, which can model simultaneous bidirectional features of human conversations such as speech overlaps and backchannels, have attracted significant attention recently. However, the study of full-duplex spoken dialogue systems for the Japanese language has been limited, and the research on their development in Japanese remains scarce. In this paper, we present the first full-duplex spoken dialogue model in Japanese, which is built upon Moshi,[1] a major full-duplex dialogue model in English. Our model, J-Moshi,[2] is trained through a two-stage process: pre-training on a large-scale spoken dialogue data in Japanese, followed by fine-tuning on high-quality stereo spoken dialogue data. We further enhance the system's performance by incorporating synthetic dialogue data generated by a multi-stream text-to-speech system.

[1] J-Moshi のベースとなった Moshi の詳細については，公式のテクニカルペーパーを参照してください．
[1] For more details about Moshi, the base model of J-Moshi, please refer to the official technical report.
[2] 本ページにおける音声対話のデモ動画では，わかりやすさのためモデル名を "J-Moshi" と表記していますが，実際は，音声合成による拡張データによって学習された J-Moshi-ext を使用しています．
[2] In the demo videos on this page, the model name is shown as "J-Moshi" for clarity, but we actually use J-Moshi-ext, which is trained with augmented data from speech synthesis.

リアルタイム音声対話

Real-time Spoken Dialogue

J-Moshiとユーザによる実際の音声対話のサンプル．

Samples of real-time spoken dialogue between J-Moshi and users.

対話継続（Prompted Dialogue Continuation）

Prompted Dialogue Continuation

人間同士の10秒の対話音声（プロンプト）から，以下の各モデルが生成した20秒の対話音声サンプル．

20-second audio samples generated by each model from a 10-second human-to-human dialogue audio prompt.

Re-synthesis: 実際の20秒の対話音声を，Moshiの音声トークナイザMimiによって再合成した音声
dGSLM: 日本語音声対話データによって学習されたdGSLM
J-Moshi: 日本語音声対話データによって学習されたMoshi
J-Moshi-ext: 日本語音声対話データとMulti-stream TTS[3]による合成音声データで学習されたMoshi

Re-synthesis: Actual 20-second dialogue audio re-synthesized by Moshi's audio tokenizer Mimi
dGSLM: dGSLM trained on Japanese spoken dialogue data
J-Moshi: Moshi finetuned on Japanese spoken dialogue data
J-Moshi-ext: Moshi finetuned on Japanese spoken dialogue data and synthetic audio data generated by multi-stream TTS[3]

以下の音声サンプルのうち，ベルが鳴るまでの10秒間がプロンプト音声であり，その後の20秒間が各モデルによって生成された音声です．

In the following audio samples, the first 10 seconds until the bell rings is the prompt audio, and the following 20 seconds is the audio generated by each model.

[3] Multi-stream TTS については，MoshiテクニカルペーパーのAppendix Cを参照してください．
[3] For details on Multi-stream TTS, please refer to Appendix C of the Moshi technical report.

Multi-stream TTS

Multi-stream TTSによって，対話テキストから合成されたステレオ対話音声サンプル．

2-channel dialogue audio samples synthesized from dialogue text using Multi-stream TTS.

謝辞

Acknowledgments

本研究は，JSTムーンショット型研究開発事業，JPMJMS2011の支援を受けました．雑談対話コーパスおよび相談対話コーパスは，株式会社アイシンとの共同研究において構築しました．また本研究では，名古屋大学のスーパーコンピュータ「不老」を利用しました．最後に，Moshi のテクニカルペーパーおよびモデルを公開していただいた Kyutai Labs に感謝いたします．

This research was supported by the JST Moonshot R&D Program, JPMJMS2011. Part of dialogue data were constructed in joint research with Aisin Corporation. This research also utilized Nagoya University's supercomputer "Flow". Finally, we would like to thank Kyutai Labs for releasing the Moshi technical report and the models.

お問い合わせ

Contact

J-Moshiに関するお問い合わせは，東中研究室までお願いいたします．

For inquiries regarding J-Moshi, please contact the Dialogue System Research Group at Nagoya University.

This page was adapted from the SoundStorm project page.