Has the trend of the AI "Hundred Models War" changed? 360, Meitu successively launched moves, and the visual large-scale model staged a "fairy fight"

Question

Original source: China Times![](https://img.gateio.im/social/moments-bab2147faf-e3e634b33e-dd1a6f-62a40f) Image source: Generated by Unbounded AI‌As the upsurge in the development and application of AI large models continues to rise, the reporter noticed that players on the track began to shift their focus from large language models to large visual models. Recently, Adobe, Meta, 360, Meitu and many other top Internet companies at home and abroad have published large-scale model results, adding fire to the already extremely hot AI market."The application of artificial intelligence in the field of video is getting more and more attention." Wu Gaobin, vice chairman of the China Communications Industry Association's Integration of Industrialization and Industrialization Committee, told the "China Times" reporter that the release of these large-scale AI models has brought new competition to enterprises. motivation. Competition among enterprises will promote technological innovation and progress, and will also bring better products and services. Competition will also promote cooperation and resource sharing among enterprises, so as to better meet market demands.## **Visual large-scale model at home and abroad "Fairy Fighting"**After row upon row of large-scale language models and multi-modal large-scale models have emerged, "visual large-scale models" have become another battleground for military strategists. A few days ago, Meitu released MiracleVision, a large-scale AI vision model, along with seven products including AI vision creation tool WHEE, AI digital human creation tool DreamAvatar, and Meitu AI assistant RoboNeo.According to reports, MiracleVision has strong visual expression and creativity, and can reverse the technological evolution from visual creation scenes such as painting, design, film and television, photography, games, 3D, and animation. Different from other large models on the market, it is especially good at generating directions such as Asian portrait photography, national style and fashion, and commercial design.Wu Xinhong, the founder, chairman and CEO of Meitu, said in an interview with a reporter from the China Times: “The core advantage of Meitu’s large model is to understand aesthetics. The C-end user base is large enough. The cost of customer acquisition is low. Meitu currently has 243 million monthly active users and 7.19 million global VIP members, who can verify the success of the product in a short time. Unlike other manufacturers, Meitu’s large model focuses on aesthetics (screen drawing Quality design, etc.), in the future, if we have to compete, we will "roll" on aesthetics."Coincidentally, 360 also officially released "360 Smart Brain-Vision Large Model" a few days ago. Zhou Hongyi, the founder of 360, said that the large language model is the basis for building a large visual model, and the core of multimodal capability enhancement is the cognition, reasoning, and decision-making capabilities of the large language model. At the same time, the large visual model is also an important capability component of the "360 Smart Brain", which can understand pictures, videos and sounds in the future.Overseas companies have also begun to lay out visual models. A few days ago, the social media giant Meta announced that it will open to researchers some components of a "humanoid" artificial intelligence model called I-JEPA, which can analyze and complete unfinished images more accurately than existing models, while Instead of just making inferences based on nearby pixels like other generative AI models do.Yang Likun, the chief artificial intelligence scientist of Meta, once publicly pointed out that the current GPT autoregressive model lacks the ability of planning and reasoning, and the future GPT system may be abandoned, and gave what he thinks is the correct answer - the world model. I-JEPA is said to be the first AI model based on key components of its vision to analyze and complete unfinished images more accurately than existing models.In addition, Meta has also released the speech generation AI model "Voicebox", which supports speech generation from text, can match audio styles based on samples that are only two seconds long, and converts text samples into another language. In the case of individual voice samples, and the ability to read the translated text content in the speaker's original voice, six languages are currently supported: English, French, German, Spanish, Polish, and Portuguese.As early as April this year, Adobe integrated its Adobe Firefly function (ChatGPT-like products) into the matrix of audio and video products such as Premiere Pro, After Effects, Audition, Remix, etc., providing users with one-click content generation, editing, color matching, Change music and other functions.## **From "Language Model" to "Vision Model"**The "China Artificial Intelligence Large-scale Model Map Research Report" shows that in terms of the number and distribution of large-scale models released around the world, China and the United States are significantly ahead, accounting for more than 80% of the global total. At the same time, more and more R&D teams in Europe, Russia, Israel, etc. are also investing in the development of large models. But it is worth noting that there are still few large models in the fields of computer vision and other fields in my country.Investigating the reason, Yan Shuicheng, the visiting chief scientist of Beijing Zhiyuan Research Institute, told the reporter of "China Times": "The main reason why the development of visual models is slightly lagging behind is that large visual models consume much more computing power than text, so we I also look forward to faster development of chips, and it is even possible to integrate other non-GPU chips together. The models you see now are generally of the kilocal level, but some people may use the 10,000 card level to make them next year."According to Huang Tiejun, president of Beijing Zhiyuan Artificial Intelligence Research Institute, the visual field is the focus of the next wave in the field of large models. He pointed out that the thinking methods and basic routes behind the large visual model and the large language model are the same, but the input data has become images and videos, and the trained model has a certain general visual language ability. One is the premise The AIGC (Artificial Intelligence Automatically Generated Content) can generate images and artworks. "There is also a more basic ability, that is, after seeing the world, you must first be able to distinguish the world (everything)."For the development of large-scale visual models, many organizations have also expressed optimistic attitudes. According to the research report released by CICC Research, computer vision is expected to achieve a higher degree of automation, high precision and low power consumption in the future, further enriching the content ecology of the Metaverse and lowering the barriers to entry. The advancement of computer vision has led to the rapid maturity of 3D reconstruction and motion capture technology, and gradually accumulated technological progress in their respective fields. In the future, computer vision is expected to usher in a higher degree of automation, higher precision, and lower power consumption. It will gradually achieve better visual effects on the mobile terminal, be applied in a large number of downstream industries, and gradually move towards connecting the physical world and the digital world. A long-term vision of the world.CITIC Securities Research also stated that in the field of design, large models lead digital design to intelligent design, and related industrial design software combined with GPT and other technologies can be applied to scenarios such as design planning, layout optimization, plug-in assistants, and sketching. Under the general trend of AI upgrading, a new round of productivity revolution is ushering in.