First Text-to-Speech, Then Text-to-Image, Now Text-to-3D-Animation

After seeing what I’ve just seen, you can consider (what I laughingly call) my mind to be well and truly blown. My poor old noggin is now full of ideas, each one triggering a cascade of considerations. Some of these meandering musings may even be germane to what I’m about to reveal. As usual, of course, we will all have to weed through my rambling waffling and make our own decisions as to what is relevant… or not… as the case might be.

When I was a kid, the height of sophistication with respect to children’s television entertainment was The Flower Pot Men. This featured two little men called Bill and Ben who were made from flowerpots. Each lived in a big flowerpot at the bottom of an English suburban garden. Between their flowerpot homes was a third character called Little Weed. All three were puppets. Even though you could see the strings, I still thought they were real, living creatures. This show was presented in glorious black-and-white. If you are English and in a nostalgic frame of mind, you can pause, peruse, and ponder the very first episode, Seeds, on YouTube.

As an aside, I bet the creators of this program in 1952 would never have thought in a thousand years that it would still be available for anyone in the world (apart from people’s paradises like China, Russia, and North Korea, of course) to watch 72 years in their future on devices like smartphones and personal computers connected to the globe-spanning internet.

Later, circa the early 1960s, I used to love cartoons like Popeye the Sailor, The Flintstones, The Bugs Bunny Show, Top Cat, The Yogi Bear Show, Deputy Dawg, and The Jetsons, to name but a few. I don’t know if these all started in black-and-white or if they were in color. This is because anyone we knew who had a TV at all had only a black-and-white model.

The thing about these cartoons was that they were all painstakingly hand-created on a frame-by-frame basis. This took lots of people, with some drawing the backgrounds, others creating the outlines of the characters, and still others filling/shading (or coloring) those outlines. I’m a little fluffy about the details, but I think it was sometime in the 1970s that digital computers started to be used to “fill in the gaps.” By this I mean that an animator could draw a character at the beginning of a motion, like jumping in the air, and again at the end of the motion, and then a computer could be used to automatically interpolate and generate the intermediate frames.

I still like 2D cartoons as an art form—I’m not sure if they are easier for young kids to understand than their 3D counterparts—but I have to say that I really love 3D animations. The first fully 3D animated TV series was Veggie Tales, which came out in 1993. I just watched the first episode Where’s God When I’m S-Scared? on YouTube.

Meanwhile, the first 3D animated feature film was Toy Story, which took the public consciousness by storm in 1995. There was a lot of behind-the-scenes wrangling about this film, including production shutdowns and a complete transformation of many of the characters. Most of what I vaguely remember about this I learned in the Steve Jobs biography by Walter Isaacson.

I know that computers were so limited in memory and raw computational power at that time (which is only around 30 years ago as I pen these words) that the animators had 100+ computers running 24 hours a day. Each frame could take anywhere from 45 minutes to 30 hours to render depending on how complex it was. As a result, Pixar was able to render less than 30 seconds of film per day. Furthermore, they didn’t have the computational capability or time to generate shadows (did you even realize that there are no shadows in Toy Story 1?). Ultimately, Toy Story required 800,000 machine hours and 114,240 frames of animation in total. These were divided across 1,561 shots that totaled over 77 minutes of finished film.

Now, of course, we are used to seeing 3D graphics—with shadows—being rendered on the fly for such applications as computer games, virtual reality (VR), and mixed reality (MR) (see Are You Ready for Mixed Reality?).

And, of course, artificial intelligence (AI) is now making its presence felt all over the place. For example, when I attended Intel Architecture Day 2021 (see Will Intel’s New Architectural Advances Define the Next Decade of Computing?), out of the myriad things that boggled my brain, one that really stuck out was that—instead of rendering computer games at 4K resolution—they had a graphics chip that could render at 1080p and then use on-chip AI to upscale to 4K on a frame-by-frame basis… in real-time!!! As you can see in this video, the results are astonishingly good.

One of the first in a growing suite of “text-to-xxx” applications was text-to-speech. Noriko Umeda et al. developed the first general English text-to-speech system in 1968 at the Electrotechnical Laboratory in Japan. On the one hand, this was amazing; on the other hand, it was only a hint of sniff of a whiff of what was to come. Consider today’s Generative Voice AI offering from the guys and gals at llElevenLabs, for example. Bounce over to their website and play a few samples. Personally, I wouldn’t be able to tell whether this was a person or a program doing the talking.

A more recent development is text-to-image, such as the Generative AI Stable Diffusion model. Almost unbelievably, at least to me, is the fact that, as I wrote in Generative AI Is Coming to the Edge, it’s now possible to get your own personal Stable Diffusion running on a USB-based “stick” equipped with 16GB of memory and an Ara-2 AI chip from the chaps and chapesses at Kinara. I’m hoping to lay my hands on one of these bodacious beauties to help me create pencil sketch illustrations for the Life of Clive book I’m currently writing to tell the tale of my formative years.

And so, finally, we come to the crux of this column. 3D animation content consumption has seen a drastic increase over the past few years. In fact, the 3D animation market is expected to grow at a CAGR of nearly 12%, reaching more than $62B by 2032. Furthermore, the way in which 3D animation content is created is experiencing significant evolution due to new 3D animation technologies, such as those that can be used by anyone from any device through video or simple text in the form of text-to-3d-animation.

This is the point where I’d like to introduce you to a company called DeepMotion. I was just chatting with Kevin He, who is the Founder and CEO. Prior to DeepMotion, Kevin served as CTO of Disney’s mobile game studio, Technical Director of ROBLOX, and Senior Engine Developer of World of Warcraft at Blizzard, so he knows a thing or two.

The tagline on DeepMotion’s website is “Bringing Digital Humans to Life With AI.” Their first offering was Animate 3D, which uses AI to create 3D animations from video. All I can say is that you must see this to believe it, so it’s fortunate that I’m in the position to show you a video.

To be honest, if this was all DeepMotion had to offer, I’d still say it’s more than enough. I’m gasping in astonishment and squealing in delight, but there’s more. The folks at DeepMotion have recently announced their text-to-3D-animation offering in the form of SayMotion. Yes, of course there’s a video.

This really is rather amazing. You select a character, type in a text prompt, and “Bob’s your uncle” (or aunt, depending on your family dynamic). I’m speechless, which isn’t something I expect to say often (no pun intended), so I’ll turn things over to you. Do you have any thoughts you’d care to share on any of this?

5 thoughts on “First Text-to-Speech, Then Text-to-Image, Now Text-to-3D-Animation”

Ansi_Kenabi17 says:

February 20, 2024 at 11:13 am

And “text-to-article” is what, maybe a couple weeks or months away ?

AI engines are cropping up like mushrooms after the rain.
I’ve just seen groq do this:
[Groq Labs: Project Know-It-All](hhtps://www.youtube.com/watch?v=QE-JoCg98iU)

Log in to Reply
1. Max Maxfield says:
  
  February 22, 2024 at 6:39 am
  
  Don’t scare me
  
  Also, I think you meant https://www.youtube.com/watch?v=QE-JoCg98iU
  
  Log in to Reply
  1. Ansi_Kenabi17 says:
    
    February 22, 2024 at 6:58 pm
    
    Well, electronic ink has barely dried and here we have this…
    “Here’s Why Gemini 1.5 Might STEAL Your Job”
    https://x.com/NorthstarBrain/status/1759921306324152514?s=20
    
    BTW, how the !*”#!%& does one insert a link here ?
    
    Log in to Reply
    1. Max Maxfield says:
      
      February 28, 2024 at 9:40 am
      
      If you are trying to cheer me up… you are failing miserably
      
      Log in to Reply
      1. Ansi_Kenabi17 says:
        
        February 28, 2024 at 8:12 pm
        
        As the guy said:
        “What’s wring with Wolfie? I can hear him barking…”

First Text-to-Speech, Then Text-to-Image, Now Text-to-3D-Animation

Related

5 thoughts on “First Text-to-Speech, Then Text-to-Image, Now Text-to-3D-Animation”

Leave a Reply Cancel reply

featured chalk talk