📢 Gate Square #MBG Posting Challenge# is Live— Post for MBG Rewards!
Want a share of 1,000 MBG? Get involved now—show your insights and real participation to become an MBG promoter!
💰 20 top posts will each win 50 MBG!
How to Participate:
1️⃣ Research the MBG project
Share your in-depth views on MBG’s fundamentals, community governance, development goals, and tokenomics, etc.
2️⃣ Join and share your real experience
Take part in MBG activities (CandyDrop, Launchpool, or spot trading), and post your screenshots, earnings, or step-by-step tutorials. Content can include profits, beginner-friendl
With a tug and a tug, the rose comes to life! Google proposes generative image dynamics, and everything will be alive from now on
Original source: Xinzhiyuan
Look, with a gentle pull, the rose starts to move.
Finally, it was implemented, and even an infinite loop of video could be generated by interacting with a single image.
In the future, artists’ imaginations will no longer be limited by traditional frameworks, and everything will be possible in this dynamic image space.
Everything in the picture comes alive
The movement of everything in the world is multi-modal.
The clothes hanging in the yard swayed back and forth with the wind.
Picking up a photo, researchers may be able to imagine how it moved when the photo was taken.
Given the current development of generative models, especially diffusion models, it is possible to model highly rich and complex distributions.
This makes possible many applications that were previously impossible, such as generating arbitrarily realistic images from text. In addition to being useful in the image field, the diffusion model can also be used to model the video field.
The model is trained based on motion trajectories automatically extracted from a large number of real video sequences.
Conditioned on the input image, the trained model predicts a "neural stochastic motion texture": a set of basic motion coefficients that describe the future trajectory of each pixel.
Then, a diffusion model is used to predict "neural stochastic motion textures." The model only generates coefficients for one frequency at a time, but coordinates these predictions across different frequency bands.
The resulting frequency-space texture can be converted into dense long-distance pixel motion trajectories, which can be used to synthesize future frames and transform static images into realistic animations.
Technology Introduction
Based on a single image
The system constructed by the researchers consists of two modules: "action prediction module" and "image-based rendering module".
First, the researchers used a “latent diffusion model” as the input image
In the second step, the inverse discrete Fourier transform is used to convert the predicted random motion texture into a series of motion displacement fields.
These motion displacement fields will be used to determine the position of each input pixel at each future time step.
With these predicted motion fields, the researchers' rendering module uses image-based rendering techniques to pick up encoding features from the input RGB images and decodes these picked features into output frames through an image synthesis network.
Neural Random Motion Texture
motion texture
In previous research, the motion texture defined a series of time-varying 2D displacement maps.
To generate a future frame at time t, the corresponding displacement map can be used, from
As has been demonstrated previously in computer graphics research, many natural motions, especially oscillatory motions, can be described as the superposition of a small set of harmonic oscillators, represented by different frequencies, amplitudes, and phases.
One way to introduce randomness in motion is to integrate noise fields. But as previous research has shown, adding random noise directly to the spatial and temporal domains of predicted motion fields often results in unrealistic or unstable animations.
Furthermore, using the motion texture in the temporal domain defined above means that T 2D displacement fields need to be predicted to generate a video segment containing T frames. To avoid predicting such a large output representation, many previous animation methods either generate video frames autoregressively or independently predict each future output frame through additional temporal embedding.
However, neither strategy ensures that the generated video frames are temporally consistent over the long term, and both may produce videos that drift or diverge over time.
To solve the above problems, researchers represent the per-pixel motion texture of the input scene in the frequency domain (i.e., the complete motion trajectories of all pixels) and formulate the motion prediction problem as a multi-modal image-to-image conversion task.
The researchers used a latent diffusion model (LDM) to generate a random motion texture consisting of a 4K channel 2D motion spectrogram, where K << T is the number of frequencies modeled, and at each frequency, the researchers needed four scalars to represent the complex Fourier coefficients in the x and y dimensions.
The image below shows these neural random motion textures.
To test this hypothesis, the researchers calculated the average power spectrum of motion extracted from 1,000 randomly sampled 5-second real video clips. As shown in the left picture below, the power is mainly concentrated on the low-frequency components.
In practice, the researchers found that the first K = 16 Fourier coefficients were sufficient to faithfully reproduce original natural movements in a range of real videos and scenes.
Use diffusion model to predict actions
The researchers chose the latent diffusion model (LDM) as the core of the researchers' action prediction module because the LDM is more computationally efficient than the pixel-space diffusion model while maintaining generation quality.
A standard LDM mainly includes two modules:
A variational autoencoder (VAE) compresses the input image into the latent space through the encoder z = E(I), and then reconstructs the input from the latent features through the decoder I = D(z).
A diffusion model based on U-Net, which learns to iteratively denoise latent features starting from Gaussian random noise.
The researchers' training was applied not to input images but to random action textures from real video sequences, which were encoded and then diffused n steps in a predefined variance schedule to produce noisy latent variables zn.
Frequency adaptive normalization
The researchers observed a problem where random action textures have specific distribution properties in frequency. As shown in the left panel of the image above, the amplitude of the researchers' motion texture ranges from 0 to 100, and decays roughly exponentially as frequency increases.
Since diffusion models require output values between 0 and 1 for stable training and denoising, researchers must normalize the S coefficients extracted from real videos before training with them.
If the researchers scale the magnitude of the S coefficients to [0,1] based on the image width and height, then almost all the coefficients will be close to zero at higher frequencies, as shown in the figure above (right).
Models trained on such data may produce inaccurate actions because during inference, even small prediction errors may lead to large relative errors after denormalization, when the normalized S The magnitude of the coefficient is very close to zero.
To solve this problem, the researchers employed a simple but effective frequency-adaptive normalization technique. Specifically, the researchers first normalized the Fourier coefficients at each frequency independently based on statistics calculated from the training set.
Frequency-coordinated denoising
A straightforward way to predict a random action texture S with K frequency bands is to output a tensor with 4K channels from a standard diffusion U-Net.
However, training a model to produce such a large number of channels often produces overly smooth and inaccurate outputs.
Another approach is to predict the action spectrogram at each individual frequency independently by injecting additional frequency embeddings into the LDM, but this leads to irrelevant predictions in the frequency domain and thus unrealistic actions.
Therefore, the researchers proposed the frequency-coordinated denoising strategy shown in the figure below. Specifically, given an input image I0, we first train an LDM to predict random action texture maps with four channels for each individual frequency, where we inject additional frequency embeddings into the LDM along with time step embeddings. in the network.
The researchers further describe how to render a frame ˆIt at future time t using a random motion texture S predicted for a given input image I0. First, the researchers used inverse time domain FFT (Fast Fourier Transform) to calculate the motion trajectory field at each pixel point p
We jointly train the feature extractor and synthesis network with start and target frames randomly sampled from real videos, where we distort the encoded features of I0 using the estimated flow field from I0 to It, and use the VGG perceptual loss to predict ˆIt carries out supervision.
Further extended applications
The researchers further demonstrated the application of adding dynamic effects to a single static image using the motion representation and animation process proposed by the researchers.
Image to video
The researchers' system animates a single static image by first predicting a neural random motion texture from the input image, and by applying the researchers' image-based rendering module to a motion displacement field derived from the random motion texture.
Because we modeled scene motion explicitly, this allowed us to generate slow-motion videos by linearly interpolating motion displacement fields and zoom in (or zoom out) on animated motion by adjusting the amplitude of the predicted random motion texture coefficients.
Seamless Loop
Sometimes it is useful to generate videos with seamless looping motion, meaning there is no discontinuity in appearance or motion between the beginning and end of the video.
Unfortunately, it is difficult to find a large training set of seamlessly looping videos. Therefore, the researchers devised a method using the researchers' motion diffusion model, which was trained on regular non-looping video clips, to produce seamlessly looping videos.
Inspired by recent research on guided image editing, the researchers' approach is a motion self-guided technique that uses explicit loop constraints to guide the motion denoising sampling process.
Specifically, during each iterative denoising step of the inference phase, the researchers included an additional motion guidance signal alongside the standard classifier-free guidance, where we forced each pixel to be at the start and end frame positions and Speeds are as similar as possible.
Generate interactive animations from a single image
The image spatial motion spectrum in an observed video of an oscillating object approximates the physical vibration modal basis of the object.
Modal shapes capture the oscillatory dynamics of an object at different frequencies, so an image-space projection of an object's vibration patterns can be used to model the object's response to a user-defined force such as a poke or pull.
Therefore, the researchers used a previously studied modal analysis technique, which assumes that the motion of an object can be explained by the superposition of a set of resonators.
This allows the researchers to write the image space two-dimensional motion displacement field of the physical response of the object as a weighted sum of the Fourier spectrum coefficients and the complex modal coordinates of each simulation time step t, and the time t.
Experimental Evaluation
The research team conducted a quantitative comparison between the latest method and the baseline method on a test set of unseen video clips.
It was found that Google's approach significantly outperformed previous single-image animation baselines in both image and video synthesis quality.
Specifically, Google’s FVD and DT-FVD distances are much lower, indicating that the videos generated by this method are more realistic and temporally coherent.
Because Google employs a global stochastic motion texture representation, its approach generates videos that are more consistent over time and do not drift or degrade over time.
First, the Xt spatiotemporal slice of the generated video is shown, as shown in Figure 7.
The dynamics of the Google-generated videos are more similar to the motion patterns observed in the corresponding real reference videos (second column). Baselines such as random I2V and MCVD cannot realistically simulate appearance and movement over time.
The Google-generated method produces frames that exhibit fewer artifacts and distortions compared to other methods, and the corresponding 2D motion fields are most similar to the reference displacement fields estimated from the corresponding real videos.
about the author
Zhengqi Li
He is the recipient of the CVPR 2019 Best Paper Honorable Mention Award, the 2020 Google PhD Scholarship, the 2020 Adobe Research Fellowship, the 2021 Baidu Global Artificial Intelligence Top 100 China Rising Star Award, and the CVPR 2023 Best Paper Honorary Award.
References: