首页 News 正文

After a stunning day, overturned? The 6-minute video of Google's "Gemini" model was exposed to have been edited

白云追月素
262 0 0

After Bard's debut "Crash" at the beginning of the year, on December 7th Beijing time, Google launched the large model Gemini (Chinese name "Gemini") and released a series of dazzling demonstration videos. Can Gemini compete against GPT-4 this time?
Among these demonstration videos, the most amazing thing is that in a 4-minute demonstration video, when the test personnel perform painting, magic, and other operations, Gemini can express their opinions in real-time and interact with the test personnel in real time. Only by watching the performance in the video, Gemini's understanding even reaches the level of humans.
"From the content of the demonstration alone, Gemini's video understanding ability undoubtedly reaches the most advanced level at present." The algorithm engineer of a large model in Beijing said in an interview with the New Beijing News and Shell Finance reporter, "This ability comes from Gemini naturally adding a large amount of video data during training and supporting video understanding in architecture."
However, just one day after its release, many users found during testing that Gemini's video comprehension ability was not as smooth as in the demonstration. Google quickly posted a blog article explaining the multimodal interaction process in the demonstration video, almost acknowledging the use of static images and multiple prompts to achieve such an effect. In addition, some netizens have noticed that Google has an important disclaimer in its demonstration videos: in order to reduce the delay of the demonstration effect, the output of Gemini has also been simplified.
Nevertheless, in the eyes of many professionals, Google has finally launched a big model that can compete with OpenAI. As an established manufacturer of artificial intelligence, Google has a rich foundation, and Gemini will also become a strong competitor to GPT.
Where did you edit it? What is the difference between the demonstration video and the actual situation?
"Have you watched the video demonstration of Google's latest big model? Multimodal switching is a qualitative change, especially when playing game maps, people may not be able to react." On December 7th, Mr. Liu, a website developer, sent a demonstration video to a reporter from Beike Finance.
In this exciting demonstration video of Google's big model Gemini, which excites many practitioners, the tester took out a piece of paper and Gemini immediately replied, "You took out a piece of paper." As the tester drew curves and colored the paper, Gemini immediately "understood" and continued to explain with the tester's actions: "You were drawing curves, it looked like a bird, it was a duck, but blue ducks were not common. Most ducks were brown, and the Chinese pronunciation of ducks was" yazi ". There were four tones in Chinese." When the tester placed a blue rubber duck on the world map, Gemini saw it immediately. "This duck has been placed in the middle of the sea, there aren't many ducks here," he said
Afterwards, the testers began to use gestures to interact with Gemini. When the testers made the movements of scissors and cloth, Gemini "answered" you're playing with stone, scissors, and cloth ". Afterwards, Gemini even guessed the image of an eagle and a dog imitating them with their hands.
However, a reporter from Shell Finance found many traces of editing in this video, such as in the stone scissors cloth, where the movements of the tester when punching were clearly cut off. Regarding this, Google has posted a blog to provide "Q&A and clarification": when given a picture of Gemini's "deployment", Gemini's answer is "I saw a right hand, with the palm open and the five fingers separated"; When given a picture of "punching", Gemini's answer is "one person knocking on the door"; When given a picture of "scissors out", Gemini's answer is "I see a hand extending from my index and middle fingers." Only when these three pictures are put together and asked "What do you think I'm doing?" will Gemini answer "You're playing with rock scissors.".
So in fact, although Gemini's answer is still true, the actual application may not be as smooth as shown in the demonstration video.
Source: "Gemini" demonstration video released by Google.
How is multimodal ability refined?
Through this demonstration, many industry insiders also acknowledge that Google has indeed taken a step forward in catching up with OpenAI. In fact, before the emergence of ChatGPT, Google had always been in a leading position in the field of artificial intelligence. However, the success of ChatGPT has put a lot of pressure on Google. In February of this year, it launched a benchmark against ChatGPT, but after its debut failed, Google has been lacking a sufficiently excellent big model to boost morale.
After the emergence of Gemini, Google has at least demonstrated certain characteristics in the field of multimodal understanding. "Gemini is a native multimodal big model, which means it is multimodal during training. Google already has a strong ecosystem in search, long videos, online documents, and more. In addition, Google has many graphics cards and several times the computing power of OpenAI. Now, it is' burning its bottom 'to catch up with OpenAI." A big model practitioner who graduated from Tsinghua University majoring in automation told Shell Finance reporters.
Specifically, the Gemini model includes three versions: Gemini Ultra, the largest and most powerful version; Gemini Pro (large cup), suitable for a wide range of tasks; Gemini Nano (medium cup) will be used for specific tasks and mobile devices.
In addition to its multimodal abilities, Gemini also performs well in many aspects such as text comprehension and code operations. In a MMLU multitasking language comprehension dataset test, Gemini Ultra not only surpassed GPT-4, but even surpassed human experts. A reporter from Beike Finance logged into Google Deepmind's official website and found that the phrase "Witness Gemini - Our Most Capable Big Model" was posted on the homepage.
At present, users can enter and experience the Gemini Pro capability through the Google Bard port, but Shell Finance reporters have found that this capability is only available in some regions. Through tests conducted by some foreign netizens, users can input both images and text to Gemini. According to the test results, Gemini Pro and GPT-4V, which also have multimodal capabilities, have their own strengths in answering many questions and have not been overwhelmed by GTP-4V.
"Based on my observation, Gemini's ability in text is still slightly inferior to GPT4, but Google's technological strength is still in the first tier," said the algorithm engineer for the aforementioned large model.
He told a reporter from Shell Finance that in order for the big model to have the "multimodal ability" to understand image, video, and sound, technically it can be seen as expanding the image understanding module of LLaVA (a multimodal pre training model) to video and speech, and adding additional video and audio data during training. "This actually proves that for the first time, Gemini has incorporated video and speech understanding into the big model, verifying the feasibility of these two in the big model."
"Overall, the release of the Google big model meets expectations, and every technical point of Gemini has been validated in the academic community before, and corresponding papers can be found. In the future, personal assistants will be a very attractive scene. Compared to big language models, multimodal big models can play the role of assistants who can listen, see, speak, and draw, more like a human." This big model algorithm engineer told a reporter from Shell Finance.
New Beijing News Shell Finance reporter Luo Yidan
CandyLake.com 系信息发布平台,仅提供信息存储空间服务。
声明:该文观点仅代表作者本人,本文不代表CandyLake.com立场,且不构成建议,请谨慎对待。
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

  •   耐克公司上季度销售额不及预期,撤回新财年全年业绩指引。   当地时间10月1日,耐克公司(NYSE:NIKE)发布截至2024年8月31日的2025财年第一财季财务业绩。该季度实现营收116亿美元,不及市场预期,同比下滑 ...
    覃志辉
    3 天前
    支持
    反对
    回复
    收藏
  • 纳斯达克中国金龙指数收涨4.94%,热门中概股大涨,哔哩哔哩涨超10%。
    3215779
    前天 10:58
    支持
    反对
    回复
    收藏
  •   据美国CNBC网站报道,当地时间10月3日,美国特斯拉公司表示,由于后视摄像头图像延迟,可能会影响驾驶员视野,增加撞车风险,公司将在全美范围内召回超过2.7万辆电动皮卡。   报道称,这是该公司对电动皮卡 ...
    hk1990
    昨天 10:21
    支持
    反对
    回复
    收藏
  • 【规模创造历史:特斯拉召回超2.7万辆电动皮卡】特斯拉公司宣布,由于后视摄像头图像存在延迟问题,可能对驾驶员视野造成影响并增加碰撞风险,因此将在全美范围内召回超过2.7万辆Cybertruck。报道称,这是该公司迄今 ...
    事业为上
    昨天 09:47
    支持
    反对
    回复
    收藏
白云追月素 注册会员
  • 粉丝

    0

  • 关注

    0

  • 主题

    39