With a tug and a tug, the rose comes to life! Google proposes generative image dynamics, and everything will be alive from now on

Original source: Xinzhiyuan

Image source: Generated by Unbounded AI

Look, with a gentle pull, the rose starts to move.

Dragging the leaves to the left, the pine tree moved in the same direction.

There are also pictures of various objects in the world, which can be pulled to life in an instant.

This is the latest research by the Google team, turning your hands into "magic golden fingers" that can control everything and move it at a touch.

In this paper, Google proposed "Generative Image Dynamics", which models the image space prior and then trains the model to predict "neural random motion texture".

Finally, it was implemented, and even an infinite loop of video could be generated by interacting with a single image.

In the future, artists’ imaginations will no longer be limited by traditional frameworks, and everything will be possible in this dynamic image space.

Everything in the picture comes alive

The movement of everything in the world is multi-modal.

The clothes hanging in the yard swayed back and forth with the wind.

The big red lanterns hanging on the street swayed in the air.

There is also a kitten sleeping next to the curtain, its belly breathing up and down, so lazy.

These movements are not usually predictable: candles burn in a certain way, trees sway in the wind, leaves rustle...

Picking up a photo, researchers may be able to imagine how it moved when the photo was taken.

Given the current development of generative models, especially diffusion models, it is possible to model highly rich and complex distributions.

This makes possible many applications that were previously impossible, such as generating arbitrarily realistic images from text. In addition to being useful in the image field, the diffusion model can also be used to model the video field.

From this, the Google team in this study modeled a generative prior on scene motion in image space, that is, the motion of all pixels in a single image.

The model is trained based on motion trajectories automatically extracted from a large number of real video sequences.

Conditioned on the input image, the trained model predicts a "neural stochastic motion texture": a set of basic motion coefficients that describe the future trajectory of each pixel.

Google researchers limited the scope of their study to real-world scenes with natural swing dynamics, such as trees and flowers blowing in the wind, so they chose Fourier series as the basis function.

Then, a diffusion model is used to predict "neural stochastic motion textures." The model only generates coefficients for one frequency at a time, but coordinates these predictions across different frequency bands.

The resulting frequency-space texture can be converted into dense long-distance pixel motion trajectories, which can be used to synthesize future frames and transform static images into realistic animations.

Next, let’s take a look at how it is implemented?

Technology Introduction

Based on a single image

, the researcher’s goal is to generate a video of length T

, this video can show dynamic trees, flowers, or candle flames swaying in the breeze.

The system constructed by the researchers consists of two modules: "action prediction module" and "image-based rendering module".

First, the researchers used a “latent diffusion model” as the input image

Predicting a neural stochastic motion texture

It is the frequency representation of the motion trajectory of each pixel in the input image.

In the second step, the inverse discrete Fourier transform is used to convert the predicted random motion texture into a series of motion displacement fields.

These motion displacement fields will be used to determine the position of each input pixel at each future time step.

With these predicted motion fields, the researchers' rendering module uses image-based rendering techniques to pick up encoding features from the input RGB images and decodes these picked features into output frames through an image synthesis network.

Neural Random Motion Texture

motion texture

In previous research, the motion texture defined a series of time-varying 2D displacement maps.

where, each pixel coordinate p, from the input image

The 2D displacement vector in defines the position of the pixel at future time t.

To generate a future frame at time t, the corresponding displacement map can be used, from

Pick pixels from , resulting in a forward-deformed image:

Random motion texture

As has been demonstrated previously in computer graphics research, many natural motions, especially oscillatory motions, can be described as the superposition of a small set of harmonic oscillators, represented by different frequencies, amplitudes, and phases.

One way to introduce randomness in motion is to integrate noise fields. But as previous research has shown, adding random noise directly to the spatial and temporal domains of predicted motion fields often results in unrealistic or unstable animations.

Furthermore, using the motion texture in the temporal domain defined above means that T 2D displacement fields need to be predicted to generate a video segment containing T frames. To avoid predicting such a large output representation, many previous animation methods either generate video frames autoregressively or independently predict each future output frame through additional temporal embedding.

However, neither strategy ensures that the generated video frames are temporally consistent over the long term, and both may produce videos that drift or diverge over time.

To solve the above problems, researchers represent the per-pixel motion texture of the input scene in the frequency domain (i.e., the complete motion trajectories of all pixels) and formulate the motion prediction problem as a multi-modal image-to-image conversion task.

The researchers used a latent diffusion model (LDM) to generate a random motion texture consisting of a 4K channel 2D motion spectrogram, where K << T is the number of frequencies modeled, and at each frequency, the researchers needed four scalars to represent the complex Fourier coefficients in the x and y dimensions.

The image below shows these neural random motion textures.

So, how should the K output frequency indicated by the researchers be chosen? Previous research on real-time animation has shown that most natural oscillatory motion consists mainly of low-frequency components.

To test this hypothesis, the researchers calculated the average power spectrum of motion extracted from 1,000 randomly sampled 5-second real video clips. As shown in the left picture below, the power is mainly concentrated on the low-frequency components.

The frequency spectrum of an action decreases exponentially with increasing frequency. This shows that most natural vibration actions can indeed be well represented by low-frequency terms.

In practice, the researchers found that the first K = 16 Fourier coefficients were sufficient to faithfully reproduce original natural movements in a range of real videos and scenes.

Use diffusion model to predict actions

The researchers chose the latent diffusion model (LDM) as the core of the researchers' action prediction module because the LDM is more computationally efficient than the pixel-space diffusion model while maintaining generation quality.

A standard LDM mainly includes two modules:

  1. A variational autoencoder (VAE) compresses the input image into the latent space through the encoder z = E(I), and then reconstructs the input from the latent features through the decoder I = D(z).

  2. A diffusion model based on U-Net, which learns to iteratively denoise latent features starting from Gaussian random noise.

The researchers' training was applied not to input images but to random action textures from real video sequences, which were encoded and then diffused n steps in a predefined variance schedule to produce noisy latent variables zn.

Frequency adaptive normalization

The researchers observed a problem where random action textures have specific distribution properties in frequency. As shown in the left panel of the image above, the amplitude of the researchers' motion texture ranges from 0 to 100, and decays roughly exponentially as frequency increases.

Since diffusion models require output values between 0 and 1 for stable training and denoising, researchers must normalize the S coefficients extracted from real videos before training with them.

If the researchers scale the magnitude of the S coefficients to [0,1] based on the image width and height, then almost all the coefficients will be close to zero at higher frequencies, as shown in the figure above (right).

Models trained on such data may produce inaccurate actions because during inference, even small prediction errors may lead to large relative errors after denormalization, when the normalized S The magnitude of the coefficient is very close to zero.

To solve this problem, the researchers employed a simple but effective frequency-adaptive normalization technique. Specifically, the researchers first normalized the Fourier coefficients at each frequency independently based on statistics calculated from the training set.

Frequency-coordinated denoising

A straightforward way to predict a random action texture S with K frequency bands is to output a tensor with 4K channels from a standard diffusion U-Net.

However, training a model to produce such a large number of channels often produces overly smooth and inaccurate outputs.

Another approach is to predict the action spectrogram at each individual frequency independently by injecting additional frequency embeddings into the LDM, but this leads to irrelevant predictions in the frequency domain and thus unrealistic actions.

Therefore, the researchers proposed the frequency-coordinated denoising strategy shown in the figure below. Specifically, given an input image I0, we first train an LDM to predict random action texture maps with four channels for each individual frequency, where we inject additional frequency embeddings into the LDM along with time step embeddings. in the network.

### Image-based rendering

The researchers further describe how to render a frame ˆIt at future time t using a random motion texture S predicted for a given input image I0. First, the researchers used inverse time domain FFT (Fast Fourier Transform) to calculate the motion trajectory field at each pixel point p

These motion trajectory fields determine the position of each input pixel at each future time step. To generate future frames It, the researchers employ a depth image-based rendering technique and perform forward warping (splatting) using the predicted motion field to warp the encoded I0, as shown in the figure below.

Since forward warping may cause holes in the image, and multiple source pixels may be mapped to the same output 2D position, the researchers adopted the feature pyramid Softmax warping strategy previously proposed in frame interpolation research.

We jointly train the feature extractor and synthesis network with start and target frames randomly sampled from real videos, where we distort the encoded features of I0 using the estimated flow field from I0 to It, and use the VGG perceptual loss to predict ˆIt carries out supervision.

As shown above, our motion-aware feature warping produces a frame without holes or artifacts compared to direct average warping and baseline depth warping methods.

Further extended applications

The researchers further demonstrated the application of adding dynamic effects to a single static image using the motion representation and animation process proposed by the researchers.

Image to video

The researchers' system animates a single static image by first predicting a neural random motion texture from the input image, and by applying the researchers' image-based rendering module to a motion displacement field derived from the random motion texture.

Because we modeled scene motion explicitly, this allowed us to generate slow-motion videos by linearly interpolating motion displacement fields and zoom in (or zoom out) on animated motion by adjusting the amplitude of the predicted random motion texture coefficients.

Seamless Loop

Sometimes it is useful to generate videos with seamless looping motion, meaning there is no discontinuity in appearance or motion between the beginning and end of the video.

Unfortunately, it is difficult to find a large training set of seamlessly looping videos. Therefore, the researchers devised a method using the researchers' motion diffusion model, which was trained on regular non-looping video clips, to produce seamlessly looping videos.

Inspired by recent research on guided image editing, the researchers' approach is a motion self-guided technique that uses explicit loop constraints to guide the motion denoising sampling process.

Specifically, during each iterative denoising step of the inference phase, the researchers included an additional motion guidance signal alongside the standard classifier-free guidance, where we forced each pixel to be at the start and end frame positions and Speeds are as similar as possible.

Generate interactive animations from a single image

The image spatial motion spectrum in an observed video of an oscillating object approximates the physical vibration modal basis of the object.

Modal shapes capture the oscillatory dynamics of an object at different frequencies, so an image-space projection of an object's vibration patterns can be used to model the object's response to a user-defined force such as a poke or pull.

Therefore, the researchers used a previously studied modal analysis technique, which assumes that the motion of an object can be explained by the superposition of a set of resonators.

This allows the researchers to write the image space two-dimensional motion displacement field of the physical response of the object as a weighted sum of the Fourier spectrum coefficients and the complex modal coordinates of each simulation time step t, and the time t.

Experimental Evaluation

The research team conducted a quantitative comparison between the latest method and the baseline method on a test set of unseen video clips.

It was found that Google's approach significantly outperformed previous single-image animation baselines in both image and video synthesis quality.

Specifically, Google’s FVD and DT-FVD distances are much lower, indicating that the videos generated by this method are more realistic and temporally coherent.

Furthermore, Figure 6 shows the sliding window FID and sliding window DT-FVD distances of videos generated by different methods.

Because Google employs a global stochastic motion texture representation, its approach generates videos that are more consistent over time and do not drift or degrade over time.

In addition, the Google team conducted a visual qualitative comparison between its own method and the videos generated by the baseline in two ways.

First, the Xt spatiotemporal slice of the generated video is shown, as shown in Figure 7.

The dynamics of the Google-generated videos are more similar to the motion patterns observed in the corresponding real reference videos (second column). Baselines such as random I2V and MCVD cannot realistically simulate appearance and movement over time.

We also predict images by visualizing

and its corresponding motion displacement field at time t =128. Qualitative comparison of the quality of each generated frame and motion in different methods.

The Google-generated method produces frames that exhibit fewer artifacts and distortions compared to other methods, and the corresponding 2D motion fields are most similar to the reference displacement fields estimated from the corresponding real videos.

Ablation study: It is observed from Table 2 that all simpler or alternative configurations result in worse performance compared to the full model.

about the author

Zhengqi Li

Zhengqi Li is a research scientist at Google Research. His research interests include 3D/4D computer vision, image-based rendering and computational photography, especially in the wild images and video. He received his PhD in Computer Science from Cornell University, where he was mentored by Noah Snavely.

He is the recipient of the CVPR 2019 Best Paper Honorable Mention Award, the 2020 Google PhD Scholarship, the 2020 Adobe Research Fellowship, the 2021 Baidu Global Artificial Intelligence Top 100 China Rising Star Award, and the CVPR 2023 Best Paper Honorary Award.

References:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)