A recent study by Chinese researchers has demonstrated that multimodal language models (LLMs) can spontaneously develop conceptual representations of objects similar to those of humans.
Researchers analyzed nearly 4.7 million responses generated by AIs about 1,854 varied objects such as dogs, chairs, apples, and cars. The models studied included ChatGPT-3.5, which operates solely on text, and Gemini Pro Vision, a multimodal model capable of processing both images and text.
The results revealed that these AIs organized these objects into 66 conceptual dimensions, far exceeding simple classic categories like "food" or "furniture." These dimensions include subtle attributes such as texture, emotional relevance, or suitability for children. In other words, these AIs seem to be building a sophisticated mental map, where objects are not mechanically arranged, but classified according to complex criteria that resemble the way our brain sorts through the world around us.
The study also compared how AI models and the human brain react to the same objects. The results showed that certain areas of brain activity correspond to what AIs "think" of objects. This convergence is even more marked in multimodal models, which combine visual and semantic processing, thus mimicking the way humans combine senses to understand their environment.
However, it is important to note that these AIs do not have sensory or emotional experiences. Their "understanding" results from statistical processing of data, where they identify and reproduce complex patterns, without feeling what they describe. This is the whole nuance between sophisticated recognition and true conscious cognition.
Nevertheless, this study invites us to rethink the limits of what current AIs can do. If these models manage to spontaneously generate complex conceptual representations, this could indicate that the boundary between imitating intelligence and possessing a form of functional intelligence is less clear than we thought.
Beyond philosophical debates, this advance has concrete implications for robotics, education, and human-machine collaboration. An AI capable of integrating objects and concepts as we do could interact more naturally, anticipate our needs, and adapt better to unprecedented situations.
In summary, large language models like ChatGPT are much more than simple language imitators. They could possess a form of representation of the world close to human cognition, built from vast data and capable of integrating complex information. However, these machines remain today sophisticated mirrors, reflecting our way of organizing knowledge without experiencing it directly. They do not feel, do not live, do not think like us, but they could one day lead us there, paving the way for ever more intelligent and intuitive AIs.