#CodeX: AI & Humans —> Bridging Worlds Beyond Text / by Ajit Minhas

In the ever-evolving landscape of Artificial Intelligence (AI), the journey towards comprehensive understanding involves a dynamic interplay between language models and sensory inputs. Aiming for fairness, inclusivity, and depth, the AI community delves into multifaceted approaches encompassing vast datasets and innovative algorithmic designs.

AI and LLM training requires a multi-faceted approach. It involves carefully curating and pre-processing training data to mitigate biases and promote fairness and inclusivity. It involves designing algorithms and training techniques such as data augmentation, diversifying training data sources, and implementing fairness-aware algorithms. Furthermore, ongoing monitoring, evaluation, and feedback of model outputs are crucial for continuous improvement and to build more inclusive and equitable AI systems.

Language Models vs. Sensory Models

Large Language Models (LLMs) like OpenAI GPT-3 are trained on enormous datasets comprising text (language) from books, articles, websites, and other sources. While text data provides valuable insights and knowledge about human interactions, culture, and knowledge domains, it may not capture the full spectrum of human experiences and understanding obtained through sensory input.

Learning through Sensory Input

Human Beings, as well as many other organisms, acquire a significant portion of their knowledge and understanding of the world through sensory input. This includes information obtained through sight, hearing, touch, taste, and smell. For example, we learn about the physical properties of objects, spatial relationships, cause-and-effect scenarios, and social dynamics by directly interacting with the environment.


Bridging Worlds: AI & Humans

According to recent studies, language (text) accounts for only a fraction (7%) of human communication. Studies indicate that the majority of communication, approximately 93%, is nonverbal, including body language, facial expressions, and tone of voice.


Limitation of Language Models

Despite all this training data in place today, Machines (LLMs) will never reach human-level AI without learning from high-bandwidth sensory inputs, such as vision or touch.

Sensory inputs (Vision and Touch) are of much higher bandwidth relative to Language/Texts:

  • The data bandwidth of visual perception is roughly 16 million times higher than the data bandwidth of written (or spoken) language.

  • In a mere 4 years, a child has seen 50 times more data than the biggest LLMs trained on all the text publicly available on the internet.

  • Most of human knowledge (and almost all of animal knowledge) comes from our sensory experience of the physical world.

A Vision for the Future

Looking ahead, integrating sensory modalities into AI architectures unveils the potential for machines to perceive, understand, and empathize with human realities more profoundly. To bridge the gap researchers are pioneering Multimodal learning approaches, amalgamating text, images, audio, and other modalities to construct holistic representations of human experiences.

Recent advancements in computer vision and natural language processing are enabling AI systems to analyze and interpret multimodal data more effectively, mirroring human comprehension with greater fidelity. Acknowledging the significance of sensory input and embracing Multimodal learning not only enhances AI capabilities but also fosters nuanced models of human cognition and interaction.

As AI transitions from a utilitarian tool to an empathetic collaborator, the synergy between sensory immersion and language expression lays the groundwork for deeper connections and enriched experiences.

The evolution of AI from data-driven machines to empathetic AI companions heralds transformative possibilities for human-machine interaction across diverse spectrums.