The voice is our primary means of communication, and telephony has enabled us to connect using our voices for over a century. The phone call as we know it has evolved from analogue to digital, from fixed to mobile, and from low speech quality to natural speech quality. One major advancement, however, was still lacking: how to enable a fully authentic, immersive sound to be transmitted, live.
The introduction of the IVAS (Immersive Voice and Audio Services) codec, standardized by 3GPP in Release 18 in June this year represents a major advancement in audio technology. Unlike traditional monophonic voice calls, IVAS enables the transmission of immersive, three-dimensional audio, offering a richer, more lifelike communication experience. This innovation is made possible using new audio formats optimized for conversational spatial audio experience. One such example is a new Metadata-Assisted Spatial Audio format, MASA, which uses only two audio channels and metadata for spatial audio descriptions. Spatial audio calls allow users to experience sound as though it were happening in real life, complete with features like head tracking.
Bringing 3D calling to mobile phones, the last major innovation in voice calling was the EVS codec, introduced in 2014 and recognized by consumers as HD Voice+. While it significantly enhanced call quality, like all previous codecs, it only offered a monophonic listening experience.
A major part of the challenge is noise reduction, crucial for enhancing speech clarity in settings like concerts or nature. Many traditional noise reduction methods only filter out continuous sounds, such as air conditioning hums or traffic noise, but often leave other background noise. To solve this, Immersive audio technology, for example, is designed to intelligently adjust how much background noise is reduced depending on the surrounding environment, as well as providing users control, allowing individuals to manually adjust the levels of noise reduction.
Immersive audio setups with multiple microphones and loudspeakers also face a major obstacle – acoustic echo. This happens when microphones pick up sound from nearby speakers, causing unwanted feedback. To solve this, a machine-learning-based spatial AEC solution was created, which removes the loudspeaker sound from the microphone input using a reference signal, improving audio quality, especially for spatial audio in real-time voice applications.
To bring spatial audio to mobile phone calling, in addition to Over-the-Top (OTT) services, the 3rd Generation Partnership Project (3GPP) recently adopted a new voice codec standard. The IVAS codec integrates a built-in renderer that supports head-tracked binaural audio and multi-loudspeaker playback using the MASA format.
The IVAS codec also tackles and overcomes various spatial communication challenges, such as noise reduction and acoustic echo cancellation. For enterprises, 3D audio voice calling unlocks new capabilities, from enhanced customer experience through directional audio to transforming team collaboration and decision-making. In industrial settings, audio analytics can drive automated processes, streamlining operations, and boosting efficiency.
The power of 3D live audio revolutionizes the audio experience for consumers, enterprises, and industries. For consumers, it deepens engagement in interactions with friends and family by sharing local sounds, whether live-streamed or recorded, and offers full immersion in synchronized metaverse experiences. For enterprises, 3D audio voice calling unlocks new capabilities, from enhanced customer experience through directional audio to transforming team collaboration and decision-making. In industrial settings, audio analytics can drive automated processes, streamlining operations, and boosting efficiency.