top of page
Search
Writer's pictureEzra Sandzer-Bell

OpenAI Announces Sora Text-to-Video, But Where's the Audio?

The internet caught on fire mid-February 2024 with the announcement of OpenAI’s Sora text-to-video model. Opinions on social media were divided down the middle, between people celebrating these new, high quality AI videos and others who felt that generative AI had gone too far this time.


Have a look at the short overview below for a quick and informative take on Sora. The Code Report offers a funny but critical perspective on what's happening.



Sora is not publicly available yet, but a cohort of influencers have been granted private beta access to the AI tool and are publishing new videos daily on platforms like Twitter-X.


OpenAI has launched several other successful AI apps in the past, including ChatGPT and their text-to-image generator Dalle-3. Each time they've rolled out one of. these new apps, it was preceded by months of hype. So if history repeats itself, we should see Sora released to the public within the next year or so.


"Sora" is sorta... quiet?


Here's the catch. OpenAI hasn’t announced any plans to generate audio alongside Sora's visual content yet. All of their demos have been silent.


Those familiar with the company may be aware of their prior AI music models, JukeBox and MuseNet. It's unclear how a sparkly new text to video model would benefit from those legacy music models. They're 5+ years old at this point and Jukebox hasn't aged well against newer models like MusicGen.


So that's the golden opportunity that AI music startups caught on to. They've since rushed to promote their services as the perfect audio companion to OpenAI's AI video service. We sat back for a week to take notes and round up all the announcements for this article.


I anticipated this moment back in July 2023 and wrote about an AI film scoring technique using Meta's MusicGen model. My impression was that it could be used to create scores for movie scenes. Sora has helped me realize that AI audio will blow right past the red tape of the movie industry and into the hands of amateur creators using text-to-video tools like Sora.


In this article we'll share some demos reels from Sora, along with examples of tweets + LinkedIn posts from AI music companies thirsting to deliver audio for AI video. Some have already rolled out video-to-music features, while others are still in the aspirational phase. Give it six months and watch what happens.


Table of Contents



Text-to-Video demos from OpenAI Sora



Above are some of the first Sora videos that OpenAI released. Their training data has not been published so we don't know if or how it was licensed. But we know that it took millions of high quality video, including real world footage and fantasy content.


Many of the videos diverge from the physical world in small ways, even when they aim for realism. This makes for mind-bending, surreal journeys like the one below.



The Sora clip above shows a close up of ants maneuvering through a small tunnel underground. At first, the shot seems like it was recorded by a fiber optic camera. But the ant has only four legs instead of six. It's not real -- it's a fake!


Herein lies the problem. Will videos like this create a body of misinformation that makes it difficult to identify real-world footage? If artificial general intelligence (AGI) systems are trained on fake videos like this, could they arrive at wrong or dangerous conclusions about the world, leading to further disinformation?


These ethical problems are on everyone's mind, including OpenAI. That's not stopping them from developing the models and pushing technology forward.


You can read a technical report on Sora here, detailing their combined use of a diffusion model with transformer architecture. The approach makes use of OpenAI’s language models to understand text prompts, while introducing new text to video prompting techniques.


AI music companies are creating Audio for Sora


Text-prompts became a popular feature of AI song generation during 2023 and this trend has continued through 2024. When GPT Vision came out and AI image-to-text captions went mainstream, a few music startups followed in their footsteps. They began implementing an image-to-text-to-music workflows that allowed users to upload a picture and retrieve a compatible audio clip.


Within the week following OpenAI's announcement of Sora, several startups began promoting audio solutions for AI video using artificial intelligence. Here are a few of the examples that we spotted.


AI Video-to-Music: SoundGen showcases Sora workflow


SoundGen already has an image-to-music feature that syncs with embedded video. They released the video announcement below explaining that the app will soon generate music and sound effects for video.



We're particularly excited about SoundGen's upcoming inclusion in Audio Design Desk. They have one of the industry's best solutions for multi-track audio-video sync.


In the next few months, they're releasing ADD 2.0 with SoundGen embedded directly in the DAW. This means you'll be able to create custom music and sound effects for any video, whether generated by Sora or otherwise.


Popular AI music generation companies, like Splash Music and Cassette AI, already offer AI video-to-music automation. Each of these services operates in a similar way under the hood - they capture still images (keyframes) from a video, retrieve AI captions describing the scene, and then use those captions as text prompts to generated audio.


In the near future, machine learning models like MM-Action may be able to analyze and describe objects in motion, making this video-to-music system even more accurate. Of course, getting music cues to accurately match these visual events will be a unique challenge.


Using Sora to create music videos with AI tools


So what problems will Sora solve for the average musician? The most obvious use case I can think of is music video creation. They are expensive, time consuming, and creatively taxing. Why not reduce the effort and achieve better quality?


Most of the popular AI music video generators today use frame-by-frame animation techniques to turn a sequence of still images into trippy visuals. A big part of their appeal, aside from the unique aesthetic, is an audio-reactive engine.


Services like Neural Frames, Kaiber and Deforum Studio adapt camera behavior to the loudest audio transients on a targeted instrument, like the kick or snare of. a song. This generates a stimulating synchronization between audio and visual mediums, akin to synesthesia, that Sora isn't currently equipped to provide.


That being said, there are plenty of people on Twitter adding music to Sora. Many have tried using the same text prompts from OpenAI's Sora demos to generate songs in a similar style.


Our first example below comes from AmliArt, a popular influencer who reviews AI music apps and works with Splash. She loaded one of OpenAI's Sora demos into their music generator to come up with a fitting soundtrack. Have a listen here.


Creating an music video with Sora and Splash Music

Wondering how she did it? Below is a screenshot of the Splash Music Video dashboard. Users upload a video file (must be one minute or shorter) and click auto-generate to convert it into music. According to their CTO, the song is created by referencing multiple key frames and inferring the appropriate style.


Splash video-to-music button

Once the video has been analyzed the AI music is generated, a playback timeline appears. As you navigate across the audio track, the playhead location stays in sync with the video. Mute the video audio to hear just the Splash Track, as shown below.


Splash music video timeline

An export button can found in the upper right corner of Splash Music's video dashboard. The file will download in an MP4 format and you can upload it to Twitter, Instagram, Youtube, etc.


A third example of video-to-music can be found here, from Cassette AI. On the day it dropped, they posed the question Thoughts on Sora ??? and one person playfully replied The videos are so silent. If only there was a video input music ai...


Cassette AI weighs in on Sora

Cassette's video limit length is two minutes rather than the one minute limit of Splash. We noticed some distortions when we tried downloading the video, but nevertheless it proves that AI music companies are watching this space closely.


Cassette AI's video to music feature

The list of music startups doesn't end here. We saw at least two other companies, SoundVerse and Beatoven, weighing in on Sora and showing how their service could be used with it. Neither of these apps offers video or image-to-text, so the prompts are being entered manually.


AI music generator Beatoven demos their app with Sora

Here's a lengthy post from SoundVerse about their method. They used GPT4 vision to analyze Sora video keyframes and then plugged that into their text-to-music app.


AI music app Soundverse demos their app with Sora

Eleven Labs announces beta AI SFX for Sora


Most people know of Eleven Labs as an AI voice generator app. When Sora came out, they posted a tweet announcing an upcoming AI SFX generator, along with a form to apply to their closed beta located here. Perfect timing!

Former Google engineer Deebarghya Das wrote the following tweet to an audience of ~6 million viewers about using Eleven Labs with Sora:


Debarghya Das comments on Sora and music for filmmakers

People aren't taking the "everyone will be filmmakers" seriously enough. I made this 20s trailer in 15mins with an OpenAI Sora clip, David Attenborough's voice on Eleven Labs, and sampling some nature music from Youtube on iMovie.

Responses to Das were mostly negative, ranging from insults about the video or music quality, all the way to personal attacks against him.


This is happening to a lot of influencers. AI audio-for-video have touched a nerve. People don’t want to see art polluted by low quality artificial intelligence. But that fear might not be warranted. If things go well, we might see a golden era of creativity.


Blaine Brown adds music and SFX to Sora

In the post above, influencer Blaine Brown shares a video he created for several different Sora videos. He combined spoken monologue, sound effects and AI music to create a funny montage.


There's no threat that this kind of entertainment would be mistaken for deepfakes. On the contrary, it has many elements of ordinary human creativity, including the creative process of assembling separate pieces into a unified whole.


Still, there are legitimate concerns regarding deepfakes and misinformation that OpenAI will need to address. It's hard to imagine how they will do that effectively, even if they embed watermarks.


Companies creating video models have a very real, ethical responsibility to help preserve the foundations of our society. This will be one of the most pressing questions of our time. Will AI video be the straw that breaks civilization's back or will it unleash a new era of uninhibited creativity? Maybe, and probably, both.


bottom of page