We will have to wait to find out more about Sora; the company is only providing it to a select few safety testers.
Sora, an impressive new generative video model created by OpenAI, can take a brief text description and transform it into a minute-long, intricate, high-definition film clip.
According to four example videos that OpenAI sent to MIT Technology Review prior to today’s release, the San Francisco-based company has expanded the boundaries of text-to-video generation—a hot new field of study that we identified as a 2024 trend to watch.
According to OpenAI scientist Tim Brooks, “we think constructing models that can interpret video, and grasp all these really complicated interactions of our world, is a critical step for all future AI systems.”
It was not until late 2022 that the first generative models that could create video from text fragments were seen. However, early renditions from startups Runway, Google, and Meta were jerky and pixelated. From that point on, technology has advanced rapidly. The quality of the brief movies produced by Runway’s gen-2 model, which was introduced last year, is almost on par with big-studio animation. Even so, the majority of these instances are relatively brief—just a few seconds.
The demonstration films of OpenAI’s Sora are crisp and detailed. It also claims that OpenAI can produce videos up to one minute in length. Sora’s understanding of how things fit together in three dimensions is demonstrated in a video of a street scene in Tokyo, where the camera pans down to follow a couple as they pass a series of stores.
According to OpenAI, Sora also performs well at handling occlusion. Keeping track of things as they disappear from view is a challenge faced by current models. An instance of this would be a street sign that disappears after a truck drives by in front of it.
Sora inserted what appear to be cuts between many clips in a film of an underwater papercraft scene, yet the model kept the same style throughout.
Not everything is flawless. Cars on the left in the Tokyo footage appear smaller than the individuals strolling next to them. Additionally, they sporadically appear in and out of the tree branches. In terms of long-term coherence, “there is undoubtedly some work to be done,” according to Brooks. For instance, someone who disappears from view for an extended period of time is unlikely to reappear. The model seems to have forgotten that they were meant to be there.
As impressive as these sample films are, there is little doubt that they were hand-picked to highlight Sora’s greatest qualities. It is difficult to determine their representativeness of the model’s typical output in the absence of other details.
Before we find out, it can take some time. The tech tease that OpenAI gave with its Sora announcement today is not intended for public release, according to the company. Rather, today is the first day that OpenAI will share the model with outside safety testers.
Before we find out, it can take some time. The tech tease that OpenAI gave with its Sora announcement today is not intended for public release, according to the company. Rather, today is the first day that OpenAI will share the model with outside safety testers.
The company is particularly concerned about possible abuses of phony but lifelike footage. Scientist Aditya Ramesh of OpenAI, who developed the company’s text-to-image model DALL-E, said, “We are being cautious about deployment here and making sure we have all our bases covered before we put this in the hands of the general public.”
However, OpenAI plans to release a product at some point in the future. In addition to safety testers, the company is providing feedback on how best to make Sora helpful to creative professionals by distributing the model to a limited number of artists and film creators. “The secondary objective is to showcase what these models are capable of, to give everyone an idea of what lies ahead,”
The team used the technology underlying DALL-E 3, the most recent iteration of OpenAI’s popular text-to-image model, to create Sora. DALL-E 3 makes use of a diffusion model, just like the majority of text-to-image models. They are trained to create a picture out of a jumble of random pixels.
Instead of using it on still photos, Sora applies this method to movies. But in addition, the researchers used one other method. Sora joins its diffusion model with a transformer-class neural network, which sets it apart from DALL-E and most other generative video models.
Words and other lengthy data sequences are no problem for transformers to process. Because of this, they are now the secret ingredient inside complex language models like Google DeepMind’s Gemini and OpenAI’s GPT-4. However, words do not make up videos. Rather, the researchers needed to figure out how to segment videos into segments that could be handled as though they were. They devised a method that involved chopping up films in both space and time. According to Brooks, “it is like cutting small cubes out of a stack of all the video frames.”
Then, Sora’s transformer can handle these video data chunks in a manner similar to how a large language model’s transformer handles a text block’s worth of words. Compared to previous text-to-video models, the researchers claim that this allowed them to train Sora on a far wider variety of video genres, which differed in terms of resolution, duration, aspect ratio, and orientation.
According to Sam Gregory, executive director of Witness, a human rights organization that focuses on the proper and improper use of video technology, “from a technical viewpoint it appears like a very major leap ahead.” However, he adds, “there are two sides to the coin.” With the expressive skills, a lot more people could be able to use video to tell stories. Furthermore, there exist genuine opportunities for misapplication.
The dangers associated with generative video models are something OpenAI is well aware of. Widespread abuse of deepfake photos is already taking place. This is enhanced by the use of photorealistic video.
Gregory points out that this kind of technology may be used to mislead people about protests or war areas. He finds the variety of styles to be intriguing as well. It would appear more genuine if you could produce shaky video that appeared to have been taken with a phone.
In just 18 months, generative video went from zero to Sora, yet the technology is still in its infancy. According to Gregory, when we enter this new realm, we can expect to see a blend of completely synthetic content and stuff created by humans.
Drawing from DALL-E 3 safety testing conducted the previous year, the OpenAI team intends to use this. Already, Sora has a filter that blocks requests for violent, pornographic, or hostile photos, as well as photographs of people it knows, from being applied to any prompts supplied to the model. Substance that breaches OpenAI’s safety guidelines will be blocked by a different filter that examines the frames of created films.
According to OpenAI, it is also modifying a phony-image detector created for DALL-E 3 for use with Sora. Furthermore, all of Sora’s output will have industry-standard C2PA tags—metadata that describes how an image was created—integrated into it. Yet none of these procedures is infallible. Fake-image identifyrs are not always accurate. Removing metadata is simple, and most social media platforms automatically remove it from uploaded photos.