Utilize Audio Models in Multi-Modal Chat

4 min readJun 23, 2024


Use different audio models to support multimodal chatting and categorize the types of audio models.

· Intro
· This article involves
· Method
· Synthesis
· Conclusion
· Full Code


Multimodal (model) chatting is the most common scenario in modern AI applications, roughly revolving around the following:

  • Text response to text
  • Text output as image
  • Text output as audio (text to speech: TTS)
  • Audio(Speech) output as text (speech to text: STT)
  • Text description generates music(text to music: TTM)
  • Description from image to text
  • Image to Image

This article involves

  1. Text output as audio (text to speech: TTS)
  2. Audio(speech) output as text (speech to text: STT)
  3. Text description generates music(text to music: TTM)

We use the first and third methods to create synthesized audio, like telling a story and adding matching background music.

Using the second method, we can extract the speech content and do some summarizing queries, such as RAG.


On Hugging Face, there are a lot of models. This article isn’t about showcasing those amazing models; instead, it just talks about one particular implementation.

  • Text output as audio (text to speech: TTS)

I chose OpenAI’s TTS-1, you can check the specific documentation here.

def text2speech(text: str, voice_type: str = "alloy") -> str:
"""Convert text to speech, return the full filepath to the speech."""
client = OpenAI()
random_filename = create_random_filename(".mp3")
speech_file_path = f"./tmp/{random_filename}"
response = client.audio.speech.create(
return speech_file_path
  • Audio(speech) output as text (speech to text: STT)

Similarly, the popular Whisper is a good choice, the document is here.

def speech2text(base64_speech_audio: bytes) -> str:
"""Convert base64_speech_audio to text, extract the base64_speech_audio content, and provide the text as return."""
audio_file_name = "./tmp/temp.wav"
audio_file = open(audio_file_name, "wb")
bytes_io = io.BytesIO(base64.b64decode(base64_speech_audio))
torch_dtype = torch.float32 if DEVICE == "cpu" else torch.float16
pipe_aud2txt = pipeline(
res = pipe_aud2txt(audio_file_name)
return res.get("text")

I’m using this input audio here for testing:

As output, I first extracted the speech content, and then asked the model if it could summarize some key points and list them for me:

  • Text description generates music(text to music: TTM)

My choice is Facebook’s Audiocraft series, here is their homepage. There is a Google alternative.

def text2music(prompt: str, duration=15) -> str:
"""Convert prompting text to music, return the full filepath to the music."""

def _write_wav(output, file_initials):
for idx, one_wav in enumerate(output):
return True
except Exception as e:
print("Error while writing the file ", e)
return None

model = musicgen.MusicGen.get_pretrained("medium", device=DEVICE)
musicgen_out = model.generate([prompt], progress=True)
musicgen_out_filename = f"./tmp/{create_random_name()}"
_write_wav(musicgen_out, musicgen_out_filename)
return f"{musicgen_out_filename}_0.wav"

These methods may not always be perfect, as new papers are constantly being published. The latest models are usually the ones worth trying out. I also recommend staying subscribed to various communities on LinkedIn to be among the first to know about the newest implementations of models, and of course, Hugging Face is the best tool.


Here’s a nice use case: I asked the bot to tell me a story, and based on the story’s context, generate some background music, then we combined the two. In this case, we didn’t use additional models for the final synthesis, instead, we just used ffmpeg. However, the process from text to audio, text to music, was exactly using the tools mentioned above.

def synthesize_audio(
speech_text: str, prompt: str, voice_type: str = "alloy", duration=15
) -> str:
"""Generate an audio using the provided speech_text and background music audio generated from the prompt.
Generate an audio from the speech_text and another background music from the prompt, then combine the two audioes
into a single synthesis.
text2speech_file_fullpath = text2speech(speech_text, voice_type)
text2music_file_fullpath = text2music(prompt, duration)
synthesis_filename = f"./tmp/{create_random_filename('.mp3')}"

f"ffmpeg -i {text2speech_file_fullpath} -stream_loop -1 -i {text2music_file_fullpath} -filter_complex amix=inputs=2:duration=first:dropout_transition=2 {synthesis_filename} -y"
return synthesis_filename

Note that the audio mixing here isn’t professionally done, and the volumes of the foreground and background sounds haven’t been adjusted.

Here is the video of the case. I asked a total of 2 times, once in Chinese with a joke, and then once in English:

Here is the output effect, because there’s no professional volume adjustment, so the speech sounds unclear:


As of this writing, I can’t directly use GPT for audio input yet, probably still in beta testing. But once it’s possible, that step from audio to text won’t need other models, I believe Anthropic and other models will have the same future.

In short, with so many models on Hugging Face, through various combinations, there will always be alternatives.

Full Code