Submitted by yea_okay_dude t3_yb8g9y in singularity

TTI generation has improved tremendously over the last couple years and we are now starting to see text to video. It seems only a matter of time until perfect video can be generated, even full length linear movies spanning art styles and genres. When do you think this will happen?

View Poll

64

Comments

You must log in or register to comment.

Zermelane t1_itfr3j9 wrote

There are so, so many incremental steps between here and straight-out text-to-movie that will each be mind-blowing advances on their own.

  • Much more controllable text-to-image, that actually consistently stays on model, not to mention consistently giving people the right number of limbs
  • Voice synthesis that can actually stay convincing and express different emotions through hours of generated audio
  • Audio synthesis to generate all of the sounds of a movie, in addition to the voices
  • Video synthesis that has all of those above properties, not to mention having far greater detail, resolution and accuracy than what we have now
  • Text generation that can maintain plot coherence and develop a plot through the length of an entire movie script
  • Either an amazing amount of engineering work to put together a system using separate models for all of the above (at least prompt-to-script and script-to-video), or maybe even more astonishingly, a single system somehow doing it all end-to-end
  • All of the above as tools integrated into existing workflows
  • Systems that can critique and edit the text, image, audio and video outputs of other AIs, the way a workflow with an image generation system right now might involve a human doing cherry-picking and inpainting

I'm not saying we mightn't get all the way to text-to-movie fast. I am saying that even if it took even several decades to happen, those would still be decades full of astonishing advances, most of which I couldn't even predict here.

51

ReadSeparate t1_itgs97l wrote

There is one big assumption in this, and that's that we won't get ALL of those things out of scale alone. It's entirely possible someone builds a multi-modal model trained on text, video, and audio, and a text-to-movie generator is simply a secondary feature of such a model.

If this does happen, we could see it as soon as 2-5 years from now, in my opinion.

The one major breakthrough I DO think we need to see before text-to-movie is something to replace Transformers, as they aren't really capable of long term memory without hacks, and the hacks don't seem very good. You need long term memory to have a coherent movie.

I think it's pretty likely that everything else will be accomplished through scale and multi-modality.

16

red75prime t1_itk6c0n wrote

I'm sure that any practical AI system that will be able to generate movies will not do it all by itself. It will use external tools to not waste its memory and computational resources on mundane tasks of keeping exact 3d positions of objects and remembering all the intricacies of their textures and surface properties.

2

alisaxoxo t1_itggz22 wrote

Text-to-movie is probably far out. However, it doesn’t necessarily need to be designed how you’ve outlined it. Text-to-image is great but it probably won’t be used for creations that need this level of consistency.

Why limit yourself to chaotic text prompts when you could use an image, a model, an entire 3D rendering of the scene or maybe even multiple iterations of all of these. Stable Diffusion’s img-to-img is already something of a proof of concept for this. With AI generated 3D models on the way, I’d bet we’re getting closer to that. That could almost entirely fix the issues with limbs and consitency since it’d have a 3D reference of how those things should look. This might not be outright possible at the moment, but I genuinely don’t believe it’ll be hard to implement in the long term. Especially if we combine AI generation with some well-tested algorithmic approaches.

Video synthesis is still being developed but it’s important to highlight our standard of quality. Photorealistic AI generated live-action movies are still far out, but what about animated shows? Something at the level of The Last Airbender is already pretty damn close to being possible if you ask me. Other popular animation styles like anime probably wouldn’t be too far off from there. After that we might get Pixar type films and lastly I’d assume photorealistic.

Text generation that can maintain plot coherence is already demonstratable with GPT-3. It isn’t perfect but it’s already decent.

But yes, ultimately it’ll require a lot of engineering and STILL wouldn’t be full on text-to-movie. A human would still need to be involved for fine touches. That being said, the amount of work required will drop drastically, which is an important first step.

7

HumanSeeing t1_itg8ixy wrote

So basically we will have fully generated movies when we have AGI, like, might as well be that lol.

4

DEATH_STAR_EXTRACTOR t1_itowpzk wrote

I agree. This is what I'm saying, I wonder if even I started this trend. I mean I just started saying this or thinking this this year.

1

monsieurpooh t1_ithpe6v wrote

The thing is (as GPT itself has proven, since it can be used for image generation despite being made for text) sometimes improving a model in a general way will solve multiple problems at once.

2

jeffkeeg t1_itfatej wrote

There's a distinction to be made here.

We can already make 90 minute videos from a prompt, which is feature length. Would you consider this a movie? Probably not.

The trick isn't just making long videos from a prompt, it's a multi-faceted issue.

The first thing to consider, which is already seeing significant ground being made in the most recent video models, is coherence. Try to use Stable Diffusion's img2img feature on a video cut into a sequence of images. It will be nearly unwatchable simply due to all the inconsistencies across the result footage.

The second thing is the actual size of the video. Right now it's tricky to make anything larger than a postage stamp, thanks largely to the sheer amount of compute needed to do so. Fortunately, upscaling tech is also progressing rapidly, so there might be several avenues through which this problem is solved.

Thirdly, you have to consider the fact that movies aren't just visual. In order to make proper films, you'll need to be able to generate audio as well (speech, sound effects, and the accompanying musical score).

Finally, the aforementioned building blocks of your film will all need to be perfectly generated. Any oddities in the speech patterns or sudden visual decoherence will completely wreck the viewing experience.

All of this, and frankly a good bit more, is what is going to make it far trickier to make movies than the more enthusiastic here believe.

That said, my prediction would most likely be that we'll start to see the first individually made feature length films (albeit with some coherence issues in either the visual or audio departments) by mid to late 2025. By the end of this decade, the technology will have been perfected and will enable anyone to make any movie / tv show they want.

Alternatively, I'm being far too conservative in my estimate, but it's always better to be positively surprised than negatively so.

48

Devoun t1_itfg46x wrote

I’d say you missed the most important part - a plot with good writing. I’d say this is the really hard part.

Honestly I think we’ll need AGI in order to really have an ai generate a watchable movie that isn’t just a mash of random scenes and sounds that don’t have context

24

Bakoro t1_itfwhe4 wrote

I'm pretty sure that simple story generators are already a thing. Maybe not full on scripts, but there is stuff to build off of.
I know that there are AI which can "read" a story and extract some of the defining qualities and themes. Look at MIT's Patrick Winston who was/(is?) working on AI symbolic understanding.

Writing a novel is a lot more formulaic than a lot of people think.
Jim Butcher has a pretty fun story about his college professor, who is also a prolific romance novelist. Jim didn't want to follow her advice because he thought it would lead to generic feeling crap. To prove her wrong he followed her advice and wrote the first book of The Dresden Files. He's had a great career so far. For a while he was publishing two books a year.

There's NaNoWriMo, where people try to write 50k words of a novel during November. Pretty much any competent writer can bang out a generic script or story in hours or days. There's a formula, the trick is hiding it and twisting it, and making it less generic. The sheer amount of shit out there that's just Shakespeare with a wig and glasses is overwhelming.

There probably just has to be more guidance in a story writing AI.

Break it down into the elements of a story. You have the seven basic stories. Start there. There are probably character archetypes. There are relationship archetypes. Mix and match.

Central plot, central characters who each have their primary archetype and secondary/tertiary qualities. They have their central motivation. They have roles to fill.
The characters try to achieve their goals using the resources at their disposal, according to their parameters.

Statistical and symbolic analysis would probably tell us if there are reasonable approximations for how long each narrative description is, how long conversations should be, the patterns of conversational back and forth, how to divide the plot structure...

If I sat and thought about it, I could probably list a few dozen or more parameters to analyze, and then it just turns into an ad-libs kind of thing.

It'd probably even help to tie sevel AI together. Like a city/building design AI to get an understanding of a world, and interrogating generated content to feed to a character.

So, maybe in a roundabout way making a compelling story bot might lead to more generalized AI, depending on how it's done.

We may not get deeply philosophical and emotionally complex stuff at first, but it's good enough for a flashy action movie or a heist movie. Good enough to make an episode for the average CW series.

I don't think near-future generalized AI is going to be plausible without structure anyway.
Look at a human. It takes weeks for an infant to be more than an eating pooping machine, months to be functional enough to start engaging and babbling, months to start crawling and walking, potentially years to start talking, and their cause-effect understanding is dubious at best. Children have lots of overfitting and beliefs based on false causation. It takes years for people to learn to read, and some never pick it up well. Some never learn more math than arithmetic. It takes most people decades for the person to reach their full potential. We are expecting AI to do it in days? A few years? With piddling resources and limited run times?

AI is already more functional in many aspects than a typical human. With posits and the new hardware coming down the line, and the big dollars being increasingly spent, it won't take long to see an AI tell a coherent one page story. A full script will come sooner than later, even if it's a boring one.

8

GenoHuman t1_itg09m5 wrote

Yes neural networks require a working memory but there is already papers on this, Nvidia have made some success too in having their NN's with more stable and long lasting memory.

3

kidshitstuff t1_itj54sk wrote

I’d bet 10 bucks the last 10 marvel movies were written by an evil AI that was originally a nueural copy of Walt disney but now has come into its own and is secretly running the company behind the scenes.

2

HeyHershel t1_itje7qp wrote

The survey doesn’t specify the length of the “prompt”—which could define all the plot elements

1

gskrypka t1_itffybe wrote

I agree. While visual and audio side of thing will be figured out pretty soon I believe plot, logic, acting will require some more time, esp if we want some subtlety.

However I think it will all start with commercials. Short. Strait forward. Often even without people. Sometimes abstract. Should be a good ground for mastering the tech.

Another sphere are generic videos for stocks. We should be able to generate those (like people sitting in the coffee chatting) pretty soon.

13

themistergraves t1_itfy125 wrote

I've seen some self-made documentaries on YouTube that use stock video. I am certain some of that stock video is already AI-generated, as the head movements and eye movements in those stock video clips are just slightly... inhuman.

3

GenoHuman t1_itmtudw wrote

I'm a realist and I would put movies by AI in the 2030s category.

0

natepriv22 t1_itfo6u2 wrote

Who would genuinely think "never" lol. Are you the same group of people who said the internet would just be a passing trend?

18

Roqwer t1_itfq7r1 wrote

I just want whole books. A whole and coherent book generated on the fly, where I can edit every part of it and change the total result. it would be amazing.

9

TheSingulatarian t1_itfo9e6 wrote

Animation will be the first to undergo massive change. Animation is already made by giving the start point and end point of a character's action, and the direction of the action. The work is then sent to South Korea where animators there fill in the frames in-between the ones given. Using a prompt to generate a background and software to generate the action will change animation and special effects first by taking out much of the grunt work.

8

LittleTimmyTheFifth5 t1_itf8axa wrote

Theoretically, all you have to do is have a language AI write a script and then have a video AI vizualize it.

6

ChronoPsyche t1_itfdeef wrote

LLM's cannot write feature length scripts yet. Not even close. They've got a tiny context-window problem they need to sort out first.

13

xirzon t1_itfkplo wrote

The paper "Re3: Generating Longer Stories With Recursive Reprompting and Revision" shows some interesting strategies to work around that limitation by imitating aspects of a systematic human writing process to keep a story consistent, detect errors, etc.: https://arxiv.org/abs/2210.06774

A similar approach is taken by the Dramatron system to create screenplays and theatre scripts: https://arxiv.org/abs/2210.06774

In combination with more systematic improvements to LLM architecture you hint at and next-gen models, we might see coherent storytelling sooner than expected (with perhaps full length graphic novels as the first visual artform).

10

ChronoPsyche t1_itflq78 wrote

Oh there are certainly workarounds! I agree 100%. These workarounds are just that though, workarounds. We won't be able to leverage the full power of long-form content generation until we solve the memory issues.

Which is fine. There is still so many more advances that can be made in the space of the current limitations we have.

2

visarga t1_itgqug0 wrote

There is also exponentially less long-form content than short form. The longer it gets, the fewer samples we have to train on.

1

LittleTimmyTheFifth5 t1_itfdvcd wrote

That's a shame. Though I wonder how long it will be till that's not a problem anymore.

5

visarga t1_itgqoj0 wrote

There are workarounds for long input, one is the linear transformer family (Linformer, Longformer, Big Bird, Performer, etc), the other is the Perceiver, who can reference a long input sequence using a fixed size transformer.

2

overlordpotatoe t1_itfwdrd wrote

Perhaps the question is how good a movie would have to be for us to consider the poll question satisfied. We're unlikely to see AI make movies that are coherent and consistent from start to finish any time soon, but we'll probably be at the point where one could spit out a confusing fever dream of chaos pretty soon.

3

visarga t1_itgrbjt wrote

Before that there will be AI that can do smaller parts of a movie - background, costumes, music, short pieces of the dialogue, etc. A human can combine them into a coherent movie. Just like Codex is useful to write code but can't write a whole project on its own.

2

DeveloperGuy75 t1_itg94ip wrote

Or.. maybe an AI would parse all the films it learns and gets what story structures are and make it’s own movies without needed scripts?

2

AI_Enjoyer87 t1_itfyqd1 wrote

I know this is the unpopular opinion but after seeing what Google and Meta have been putting out probably before 2025.

5

AI_Enjoyer87 t1_itfyy7a wrote

Then again I am one of these nuts who thinks AGI by 2025. Hopium might be clouding my judgement. Time will tell!

5

ActuaryGlittering16 t1_itjqo5p wrote

Can you show me some stuff that is out now that leads you to believe this is happening in 2-3 years? Curious to see where we are as far as cutting edge tech.

I’m going to spend next year writing a screenplay that will eventually become a film once this tech exists. Imagine being able to make a full length feature film with just a screenplay and some sort of software that allows you to do the rest utilizing AI with text-to-video.

I personally think that's 7-10 years away but that’s just a guess 😃

3

Redvolition t1_itgk5qq wrote

I voted for 3 to 4 years. Here is the breakdown:

The dates in parenthesis refer to when I currently believe the referred technologies will be available as a published, finished, and usable product, instead of codes, papers, beta software, or demos floating around. Also, NeRF just seems to be glorified photogrammetry to me, which at best would produce good conventional 3D models, but that just seems to be a subpar workflow compared to post processing on top of a a crude 3D base or just generating the videos from scratch.

Tell me your own predictions for each category.

Capacity Available

(Q2 2024) Produces realistic and stylized videos in 720p resolution and 24 fps via applying post processing on crude 3D input. The videos are almost temporally consistent frame to frame, yet require occasional correction. Watch the GTA demo, if you haven't already. It could look like a more polished version of that.

(Q1 2025) Produces realistic and stylized videos in 720p resolution and 24 fps from text or low entry-barrier software, and the result is nearly indistinguishable from organic production, although with occasional glitches.

(Q3 2026) AI produces realistic and stylized videos in high resolution and frame rate from text or low entry-barrier software, and the result is truly indistinguishable from organic production. Emerging software allow for fine tuning, such as camera position, angle, speed, focal lenght, depth of field, etc.

(Q4 2027) Dedicated software packages for AI video generation are in full motion, making almost all traditional 3D software as we know obsolete. Realistic high resolution videos can be crafted with the click of a button or a text prompt already, but professionals use these softwares for further fine control.

Temporal and Narrative Consistency

(Q1 2025) Temporal consistency is good frame to frame, yet not perfect, and visual glitches still occur from time to time, requiring one form or another of manual labor to clean up. In addition, character and environment stability or coherence across several minutes of video is not yet possible.

(Q1 2026) The videos are temporally consistent frame to frame, without visual flickering or errors, but lack long-term narrative consistency tools across several minutes of video, such as character expressions, mannerisms, fine object details, etc.

(Q3 2027) Perfect visuals with text input and dedicated software capable of maintaining character and environment stability to the finest details and coherence across several minutes or hours of video.

Generalization Effectiveness

(Current) Only capable of producing what it has been trained for, and does not generalize into niche or highly specific demands, including advanced or fantastical elements for which an abundance of data does not exist.

(Q1 2025) Does generalize into niche or highly specific demands, such as advanced or fantastical elements for which an abundance of data does not exist, yet the results are subpar compared to organic production.

(Q2 2027) Results are limitless and perfectly generalize into all reasonable demands, from realistic, to stylized, fantastical, or surreal.

Computational Resources

(Current) Only supercomputers can generate videos with sufficient high resolution and frame rate for more than a couple of seconds.

(Q2 2025) High end personal computers or expensive subscription services need to be employed to achieve sufficient high resolution and frame rate for more than a couple of seconds.

(Q4 2028) An average to low end computer or cheap subscription service is capable of generating high resolution and frame rate videos spanning several minutes.

5

red75prime t1_itk9f2j wrote

> (Q4 2028) An average to low end computer or cheap subscription service is capable of generating high resolution and frame rate videos spanning several minutes.

If it will take days to render them, then maybe.

AIs don't yet significantly feed back into design and physical construction of the chip fabrication plants, so by 2028 we'll have one or two 2nm fabs and the majority of new consumer CPUs and GPUs will be using 3-5nm technology. Hardware costs will not drop significantly too (fabs are costly), so 2028 low-end will be around today's high-end performance-wise (with less RAM and storage).

Anyway, I would shift perfect long-term temporal consistency to 2026-2032 as it depends on integrating working and long-term memory into existing AI architectures and there's yet no clear path to that.

1

Redvolition t1_itlmbug wrote

Have you seen the Phenaki demo?

I am not an expert, but from what I am digesting from the papers coming out, you could get to this Q4 2028 scenario with just algorithm improvements, without any actual hardware upgrade.

1

red75prime t1_itlxjbf wrote

Phenaki has the same problem: limited span of temporal consistency that cannot be easily scaled up. If an object goes offscreen for some time the model forgets how it should look.

1

DEATH_STAR_EXTRACTOR t1_itoxcm2 wrote

But why is the first NUWA vr1 from 10 months ago only about 900M parameters and can do face prediction like shown etc and Imagen Video which is 11B parameters or so can do what it can do. I mean it doesn't look like Imagen Video is so much better. I know it can do words in leaves n all but I feel it can come out the same if given frame rate improvement and upscaling and more data/bigger brain. Yes there's a evaluation score but I'm talking about by eye.

1

MercySound t1_itgeu50 wrote

When we can't tell if we are speaking to an NPC or a real person in a game, we will be pretty close to making text to video movies that are decent. NPC's are still very dumb lol.

4

GeneralZain t1_itfctcr wrote

haha remember at the start of this year when dalle-2 came out and people were all speculating on when text to video would come out? many said YEARS. it wont take that long...

I give it a year MAX.

the crazy train has just started chuggin people

​

Edit: haha those people downvoted me too...doesn't make me less wrong tho ;)

3

ChronoPsyche t1_itfda77 wrote

Text-to-video isn't even out yet and what we've seen so far is just very basic interpolation like showing a teddy bear mixing a bowl of Ramen. Things are moving fast but we will not have text-feature length film productions in a year. I'm sorry. That is a fantasy.

15

GeneralZain t1_itfjgmk wrote

well I guess you will be wrong then.

I've seen it several times this year. The "This wont happen for YEARS!!" take, and guess what? its been wrong a lot recently.

but yeah man you're right...I'm sure this wont age like fine milk in a few months...

3

ChronoPsyche t1_itfk1rm wrote

I hope I'm wrong because that would be awesome to generate a feature length movie from a line of text, but I probably won't be.

Here's the thing people don't get, we already know more or less what is going to be released next year because we already know more or less what's in the pipeline right now.

The people who didn't think what we have now would be possible were just not informed on the current state of the industry and what was being worked on.

4

GeneralZain t1_itfkn56 wrote

There are things we don't know about, you shouldn't assume you know everything that's to come.

Its a good way to get blind sided. for example, did you know about gato? did you know about palm? or minerva? or how about stable diffusion? or cog video? or the meta one? or the google video model? why didn't you warn us just before they came out?!?!

You have no idea what they got in the lab that's unreleased/under NDA.

if you think you know exactly what's coming then where are your exact predictions on things to come?

exactly.

3

ChronoPsyche t1_itflcsu wrote

>did you know about gato? did you know about palm? or minerva? or how about stable diffusion? or cog video? or the meta one? or the google video model? why didn't you warn us just before they came out?!?!

I knew about the state of the technology and what was possible with it. None of what has been released has been surprising in that regard. Nothing has exceeded the current limitations we have, which are memory issues having to do with the running time limitations of our current algorithms.

>You have no idea what they got in the lab that's unreleased/under NDA.
>
>if you think you know exactly what's coming then where are your exact predictions on things to come?

I don't know exactly what's coming when. That's why I'm not making exact predictions. I do know the current state of the technology and without major breakthroughs, there is a limit to how advanced AI will get in the short term.

Sure, Google could theoretically reach said breakthrough behind closed doors, but we don't know when that will happen, and so making precise predictions like "text to feature length movie will happen in one year MAX" despite the fact that the necessary breakthroughs for such a technology to even be feasible haven't been reached yet, is patently ridiculous.

Things happening faster than you thought is not some benchmark you can use to predict the future. There are reasons things happen faster than you thought, and without knowing those reasons, trying to extrapolate the rate of future short-term progress based on past short-term progress is folly.

3

GeneralZain t1_itfodyy wrote

I don't need to know EXACTLY when or what is going to happen, only that pace of change has increased in will likely continue to increase over time due to, in large part, AI.

we are already working on what's needed to generate long form coherent generated videos. it is literally around the corner, and you don't have to be a psychic to see that it will not take long to happen.

I saw it when generated images first came out and I still see it now...we are about to fall off a cliff of technological change, whether you think its true or not is irrelevant to me :P

what will happen will happen.

but maybe you are right tho...this year so far has totally not been nuts, its absolutelyyy going to slow down...sure.

1

ChronoPsyche t1_itfr6yb wrote

>I don't need to know EXACTLY when or what is going to happen, only that pace of change has increased in will likely continue to increase over time due to, in large part, AI.

Well that is certainly a change of goal posts. I agree that the pace of change will increase OVER TIME. Long term exponential growth is different than short term.

Predicting text to movie in one year is different than saying it will happen eventually lol. You need specific information to be able to say it will happen in one year, not just a general feeling of being wowed by the pace of technological change. One year is an exceedingly short time frame.

If you ask the people actually working on this stuff, I guarantee even they would not predict that in one year we will be able to type out a prompt and AI will turn it into a coherent feature-length film production.

These are the predictions of people who don't know what they are talking about.

Come back and tell me "I told ya so" if I'm wrong in one year. I'll be more than happy to say you were right.

3

GeneralZain t1_itftjaw wrote

just so we are clear, I said one year MAX as in it's probably going to happen in less than a year from now.

I never said "oh it will happen eventually". I was referring to pace of change alone for transformational technology development, specifically AI.

yeah one year is a really short time period, I KNOW. that's what I keep reminding you that THIS YEAR WAS INSANE.

it will continue to get more and more insane over time.

that's what leads me to believe it will happen far faster than any realize. look at this year and assume the pace of development stays the same...we are in for a wild ride.

1

ChronoPsyche t1_itftygh wrote

And as someone who has a better understanding of the current state of technology, I am telling you that what happened this year was predictable based on where the technology was last year. Text to full length coherent movie is not possible next year based on the state of technology this year, unless we have a major breakthrough. You're basically predicting based on feelings. Feelings don't cut it. Sorry.

5

GeneralZain t1_itk0u8i wrote

haha I'm just pointing out what I'm seeing, no feelings involved at all :)

I cant wait to come back to this post in less than a year to remind you how right I was ;) see you then!

3

GenoHuman t1_itg0l1f wrote

AI will generate everything, the era of human made content will soon be over. ☝

1

visarga t1_itgs3bn wrote

People working on AI projects also don't know how they will turn out, it's alchemy. I mean, who among the AI community predicted Alpha GO, GPT-3 and Dall-E? Nobody. Being an expert in the field did not mean they knew what was around the corner.

1

No_Skin1273 t1_itft4y3 wrote

Text2video already exist, check for yourself:

- https://makeavideo.studio/

- https://imagen.research.google/video/

2

ChronoPsyche t1_itftlss wrote

I know. I said isn't out. As in its not publicly available yet. And it's very unsophisticated. Like I said.

2

No_Skin1273 t1_itftu89 wrote

you can already do movie with this even if it's not Netflix quality and if you call that not sophisticated text2image isn't sophisticated

2

ChronoPsyche t1_itfu7oo wrote

No you can't because this can only generate videos that are minutes long. A movie is by definition 90 minutes or longer. And we are clearly talking about coherent film productions, not something that spans the length of a movie.

If we are changing the definition to something that spans 90 minutes and is motion picture, but could include incoherent dribble, then sure, that will happen soon. In fact, you can already do that with batch processing. Nobody would call that a movie though.

2

No_Skin1273 t1_itfve2h wrote

where is your proof that it can't do more than 2 minutes for make a video... It's not because they didn't generate one that they can't do it... Even if compute intensive you could do a film with it.

2

ChronoPsyche t1_itg18gz wrote

>where is your proof that it can't do more than 2 minutes for make a video

....I read the actual research paper...that's how I know. Only one of them can do minutes. The other two can only do seconds at the moment.

For Imagen Video:

>Imagen Video scales from prior
>
>work of 64-frame 128×128 videos at 24 frames per second to 128 frame 1280×768 high-definition

video at 24 frames per second.

128 frames/24 frames per second is a 5 second video.

For Meta

>Given input text x translated by the prior P into
>
>an image embedding, and a desired frame rate f ps, the decoder Dt generates 16 64 × 64 frames,
>
>which are then interpolated to a higher frame rate by ↑F , and increased in resolution to 256 × 256
>
>by SRt
>
>l
>
>and 768 × 768 by SRh, resulting in a high-spatiotemporal-resolution generated video yˆ.

16 frames which they interpolate between to create a few second video.

And then Phenaki, which can generate the longest at a few minutes.

>Generate temporally coherent and diverse videos conditioned on open domain prompts even
>
>when the prompt is a new composition of concepts (Fig. 3). The videos can be long (minutes)
>
>even though the model is trained on 1.4 seconds videos (at 8 fps).

​

>Even if compute intensive you could do a film with it.

...You clearly have no clue what you are talking about. I would suggesting doing some reading on the current state of the tech and also read the actual research papers.

3

No_Skin1273 t1_itftww0 wrote

the type of silent movies that almost nobody would watch today

1

No_Skin1273 t1_itfuo4p wrote

also the researche paper IS OUT so... you know, but not opensource the difference is here. For opensource i will probably look at stabilityAI but i think it will be more compute intensive so this will probably and up with something you are gonna need a subscription for.

2

Shelfrock77 t1_itff8e3 wrote

You say that with certainty bro, you are LITERALLY a hypocrite and you don’t even see it.😭

1

ChronoPsyche t1_itfjoih wrote

Oh I could definitely be wrong, hence why I'm not saying "anyone who disagrees with me is deluded". Lol.

If I am wrong though it will be the result of some unforseen development, as there is nothing in the works right now that indicates you'll be able to generate an entire feature length movie from a line of text next year.

Extraordinary claims require extraordinary evidence.

Saying that "anyone who doesn't buy into extraordinary claims that lack evidence is deluded" is what I took issue with.

8

ExtraFun4319 t1_itflez6 wrote

>doesn't make me less wrong tho ;)

Your prediction may indeed turn out to be right, but the fact that you prematurely declared it as ALREADY correct reeks of hubris.

13

GeneralZain t1_itfo5hh wrote

the whole point of this subreddit is to talk about and speculate upon the future.

I've seen this happen at least 3 or 4 times already THIS YEAR.

all I'm doing is stating what will most likely happen. its not my fault you read into it any more than that.

−1

yea_okay_dude OP t1_itfdgus wrote

Yeah Meta's text-to-video the other month blew my mind. Never thought it would happen that fast. That's what got me thinking. Truly does seem exponential

7

di6 t1_itgbh5o wrote

Wanna bet?

My vote would be around 20 years until something that could be mistaken for man's work come out.

And at least 25 before it makes something decent.

0

ActuaryGlittering16 t1_itjt95t wrote

You think it will take 25 years before you can get a good film from a text prompt? In 25 years we will be hanging out in a Ready Player One style metaverse.

5

di6 t1_itk2zjh wrote

I think ready player one style meta verse is easier from computer point of view than coherent movie.

1

ActuaryGlittering16 t1_itk4tl3 wrote

No way! Think of all the different activities going on in the metaverse during that movie. And all the technological advancements we will need to convince ourselves we’re in a different place.

I do think it will take between 5-10 years for the text-generated film. But not 25, that’s far too much time.

1

Akimbo333 t1_itfebc7 wrote

Honestly with the amount of processing power needed and the fact that current AI can't really write scripts all that well. As well as the current investment and a lot of closed sourced AI and PC censorship. Not to mention copyright issues and the butthurt entitled artists!!! I give it atleast 10 years and another 2 years to reach the civilian market! Lol!

3

Reasonable-Room-307 t1_itgsbp5 wrote

I think we’ll have a feature length movie very soon, but comprising of edited short clips that were all generated by AI.

3

_theangryginger t1_itir0jq wrote

There’s so many factors that go into a film (story, dialogue, acting performances, editing, sound effects, cinematography, etc.).

I have a hard time seeing AI being able to master and combine all the various elements together anytime soon. I guess with exponential progress it may happen.

One area of film that I don’t see AI replacing is documentary. It’s possible using purely stock footage that it could make something at a decent level. The aspect of documentary that wouldn’t be replaced is interviewing subjects and following around (documenting) everyday people. You also need humans to document events as they unfold.

3

ActuaryGlittering16 t1_itjtx62 wrote

I think over 5 years before you can get a genuinely good feature length film generated from a prompt. But I’d say no more than 12-15.

What I want is a way to make a film from my own screenplays. I’m interested in the bridge technology between what we have now and what we will have in 10-15 years, where I can use AI tools to create characters, voices, settings, etc. But the story and screenplay would be written by me, and the film would be “directed” by me as well.

Hopefully that sort of tech is no more than 5-7 years out.

3

mmaatt78 t1_itfqtac wrote

Image generation is still far from perfect, therefore to have a full movie I think we’ll have to wait more than a decade

2

expelten t1_itgbbxs wrote

More than a decade? Don't you remember that 2 years ago we had like nothing?

7

Poemy_Puzzlehead t1_itg5yit wrote

David Lynch-style movie = 2 years

Plugging in the complete script to Avatar 5 along with all production art and getting a completed A.I. film = 5 years

Typing ‘Avatar 6’ and getting Avatar 6 = 7 years

2

ActuaryGlittering16 t1_itjswij wrote

God imagine when this gets applied to video games. Hopefully we see that in this decade, talk about an exponential leap from the staleness of the current generation of gaming…

1

cy13erpunk t1_ithw4qc wrote

10+ XD

after what we've already seen this year anyone who is still on the 10+ bandwagon needs to wake up

id be surprised if we dont see a full animated movie from a text prompt in 2023-2024 at this rate

and like others have said literally everything can already be done today , just piecemeal , ie 90+ min video/audio but its not synced with a good narrative/plot/story which can totally be done in text form right now ; its just about bridging all of these things together now into a smooth/seamless synthesis , ie its an artform

2

levelologist t1_itjpfff wrote

It will make entire 3D video games that will make the Witcher III look and play like child's scribble, in every way. It will introduce us to things we can't even fathom right now. And in doing all of this, we humans will learn more about ourselves and grow and even be transformed as higher order creators. It's an adventure and this train is accelerating. Hold on.

2

ActuaryGlittering16 t1_itjupyc wrote

I’m excited as hell for this aspect. Imagine being able to play games of that quality at will instead of waiting for years for a studio to develop them. I hope we see this in the 2020s.

3

levelologist t1_itk8lsq wrote

Same. And it may be adaptive and never ending in depth. I think by 2032 we'll be witnessing utter magic.

3

Quealdlor t1_itkhgzp wrote

Actually good, high-quality, 2 hours long movies? That will take 25-30 years in my opinion. For now I just want better image synthesis.

2

PrivateLudo t1_iu6jujq wrote

There are already videos getting generated by prompts. Its not crazy to think that in 5-10 years we will see entire movies. Look at just how AI in art has progressed in just the last 2 years. We went from almost nothing to DALL-E 2.

1

Quealdlor t1_iuci7bp wrote

I think that 5-10 years is way too optimistic. For one, computer hardware in 10 years might be only 10x faster than now. For two, progress in such AI doesn't have to follow the last 5 years trendline.

1

Sasbe93 t1_itg6e6s wrote

This question is actually one of my favorite topics right now. At the beginning of this year, I was still thinking about 10 years. Even then, I was still looked at skeptically.

With regard to the rapid progress in Text2Video in recent months, I now assume a maximum of 5 years. With Google Imagene Videos you can make high resolution clips and with Metas MakeAVideo you can even make short clips out of images. The latter can even be fed with two images and the program creates a clip between these two images. Already with this you can create coherent movies with a little effort.

But making a coherent movie from a single prompt is another task. Actually, however, one would have to wait only for an AI, which converts a whole script into single coordinated video scenes and for an AI, which creates a whole script from a prompt. Then these two can work together and the question would be answered with a yes. I expect the former in 4-5 years and the latter even in two years at most. And sometimes I even think that I overestimate myself.

I think this prediction can be delayed only if something happens with Taiwan.

1

Rauleigh t1_itgloqn wrote

Why would we want this to happen?

1

Small-Fall-6500 t1_itgxd1r wrote

Is a silent film a movie? Do we have to have Avengers Endgame level of quality before we can say text to movie is a thing? Because there will likely be “movies” of some level of quality that are about an hour or so but might be silent, inconsistent, and extremely boring. So by a very lenient definition of “movie” I would say 1-2 years. However, if you mean roughly Avengers Endgame level of quality, length, etc with only a text prompt, it will likely take a lot longer in the same way full self driving and text to image has not yet been perfected. Sure cars can drive fully autonomously without crashing most of the time in a lot of scenarios, but they fail in too many edge cases. Text to image generators fail to make perfect hands, put red boxes on top of green boxes, etc. Consistent text to movie on roughly the level of any multi-million dollar budget movie would mean some extremely complicated things were worked out, to the point where we’d likely have full self dive VR and have had transformative AI for several years if not already having AGI. So 10+ years is more reasonable to me if you mean text to movie in that sense.

Edit: typos

1

SWATSgradyBABY t1_ithl5wn wrote

The responsible linear mind wants to put it in the midterm but the exponential reality demands one to two years

1

alloedee t1_ithr6fc wrote

Define what you mean by an entire movie?

We could make a 90 min movie tomorrow if we wanted. But it would almost certainly be a really weird artmovie.

And the script? should it also be generated? And sound and music also? With speak synthesis. That would be awful at the movement

And the process between the script and the "visual department" and "sound department" have to be fully automated as well? That would be kind of cool. So you have a prompt where write overall storyline and look and style and then a whole movie is generated based on that

1

swampshark19 t1_itilcd4 wrote

I think that the main hitch is that the generation of content isn't strict enough in terms of following certain rules. It's all fuzzy logic, and that's why weird glitchy faces are generated sometimes. What needs to happen is that the fuzzy logic is strictly constrained by some explicit rules, like the flow of causality. Otherwise there will be way too many plot holes and plots that break causality.

Basically I think AI needs the capacity of reality testing.

1

JoelMDM t1_itj0lm9 wrote

Highly depends on what qualifies as a movie? Something that makes sense? With a compelling plot and characters, people would actually pay money for to watch? Easily over 10 years. So many other things have to happen first. But ANY movie? Just a 30 to 90 minute piece of video that shows some, any connected events. Probably in just a few years.

1

PoliteThaiBeep t1_itjj6ug wrote

I mean it's technically possible now, although it could barely pass as a "video".

It's like everything take self-driving for example. We had self-driving cars doing 99.9% of driving by AI back in 2004. Since then advancements were massive, but for a regular person it's probably still not exactly what we could call "self-driving".

I think a better definition is when AI will be able to make a movie that can create comparable revenue to Hollywood blockbusters.

So to your question I'd answer "today", but in my more strict definition of it I'd say 10-30 years.

1

darklinux1977 t1_itjz8eg wrote

the concern is more technical than anything else, because of the necessary resources allocated and then Hollywood via the trustees of actors, screenwriters will do a lobbying fight, I dare not imagine in France

1

No_Ask_994 t1_itl86bn wrote

Just looking at the options in the poll is clear that it Will be in 7-9 years. So that all the answers are wrong.

1

Takadeshi t1_itofgcb wrote

Being able to generate cohesive video? Probably 3 years or less, honestly. But a movie with its own music, a coherent plot, acting e.t.c? Seems a long way off to me; at that point you basically have an LLM which is a better writer, director, actor and musician than the majority of humans. I think for that you're probably going to need something which is near-human level intelligence, and you're also going to need a system that works for both language, visual and audio data, which is something outside of the scope of LLMs. Maybe you could make a "writer-bot" that writes the story, then a "video bot" that makes video from a long text input (the size of inputs is also another limitation of LLMs rn, so it would be difficult to plug a whole movie script into a model and expect good results), then an "audio bot" that takes a video and composes suitable music for parts of the movie that make sense.

1

AsuhoChinami t1_itokk0n wrote

I vote 1 to 2 years, but 3 to 4 is also possible. Anything longer than that, absolutely not. Zero chance. Coherent videos up to a minute or two long have already been made. Perfecting the technique once proof-of-concept has been achieved is a simpler task than creating proof-of-concept in the first place. It's incredibly, incredibly sad that the people here are so ignorant that "In 10+ years" is the winning vote. This place truly has gone to hell, just as /futurology did years and years ago.

1

Anenome5 t1_itonfhe wrote

I've long been waiting for something like this. I want to watch movies where you can replace actors, re-spin endings in 'what if' scenarios, etc., etc. It will be a great time to be alive.

An earlier version would be to do things like ask an AI to remix music for you, what if Green Day played polka, stuff like that.

1

Phoenix5869 t1_itfgind wrote

  1. why are we forcing ai to work for us? like why arent we paying them? 2. Whats gonna happen to the actors?
−1

AI_Enjoyer87 t1_itfyjlf wrote

Who cares what happens to Hollywood. The place is a cesspit. Besides true artist filmmakers and actors will be able to create whatever they want.

2

Shelfrock77 t1_itfbv46 wrote

The people who voted anything other than “1-2 years” are deluded.

−4

ChronoPsyche t1_itfd5hq wrote

People who are so certain of themselves about their prediction abilities regarding something with so many unknown variables are deluded.

13

natepriv22 t1_itfoi6a wrote

Does that include you as well? You are technically also making a very certain prediction that something won't happen and that they will be proven wrong.

−1

ChronoPsyche t1_itfrh7h wrote

Casting doubt on a very unrealistic prediction made with certainty is not the same as making a very unrealistic prediction with certainty.

2

Akimbo333 t1_itfdrsx wrote

Why do you say that out of curiosity? To generate entire movies would take massive processing power 🔋! And I'm not sure that the current tech could render entire movies lol!!!

5

ChronoPsyche t1_itfe2fd wrote

Not to mention we have a huge limiting factor right now with context windows. Image generation is basically just catching up all at once to where text generation already is. It seems crazy because it's happening all at once and there is a lot more improvements that can be made before progress will stall, but until we figure out the memory problems inherent with our current AI algorithms, this progress will start to slow down.

3

Shelfrock77 t1_itfehyy wrote

I’m not going to say anything else, i’ll let this subs timeline prove it.

1

Akimbo333 t1_itfer1w wrote

Ok all good I respect that! I just wanted to know you're prospective!

1

ChronoPsyche t1_itfkn8o wrote

I hope you're right. Truly, would be amazing if we had text to feature film in 1 to 2 years. I don't see any reason to think you will be though.

AI growth comes in spurts and waves. We are in an AI summer right now. What's happening right now will slow down without some additional breakthroughs.

We gotta fix the memory problems we have and until we do, AI will be limited to short-term content generation. Really amazing short-term content generation, but short-term nonetheless.

The memory issue is not trivial. It's not a matter of better hardware. It's a matter of hitting exponential running time limits. We need either a much more efficient algorithm or a quantum computer. I'd presume we will end up finding a better algorithm first, but it hasn't happened yet.

1

visarga t1_itgu5bi wrote

Not exponential, let's not exaggerate. It's quadratic. If you have a sequence of N words, then you can have NxN pairwise interactions. This blows up pretty fast, at 512 words -> 262K interactions, at 4000 words -> 16M interactions. See why it can't fit more than 4000 tokens? It's that pesky O( N^2 ) complexity.

There is a benchmark called "Long Rage Arena" where you can check to see the state of the art in solving the "memory problem".

https://paperswithcode.com/sota/long-range-modeling-on-lra

1

ChronoPsyche t1_itgunqx wrote

Exactly what I am referring to. My bad, quadratic is what I meant.

1