Tuochao Chen – 91̽��News

AI headphones translate multiple speakers at once, cloning their voices in 3D sound

Stefan Milne — Fri, 09 May 2025 16:02:02 +0000

, a 91̽�� doctoral student, recently toured a museum in Mexico. Chen doesn’t speak Spanish, so he ran a translation app on his phone and pointed the microphone at the tour guide. But even in a museum’s relative quiet, the surrounding noise was too much. The resulting text was useless.

Various technologies have emerged lately promising fluent translation, but none of these solved Chen’s problem of public spaces. , for instance, function only with an isolated speaker; they after the speaker finishes.

Now, Chen and a team of 91̽��researchers have designed at once, while preserving the direction and qualities of people’s voices. The team built the system, called Spatial Speech Translation, with off-the-shelf noise-cancelling headphones fitted with microphones. The team’s algorithms separate out the different speakers in a space and follow them as they move, translate their speech and play it back with a 2-4 second delay.

The Apr. 30 at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan. The code for the proof-of-concept device is available for others to build on. “Other translation tech is built on the assumption that only one person is speaking,” said senior author , a 91̽��professor in the Paul G. Allen School of Computer Science & Engineering. “But in the real world, you can’t have just one robotic voice talking for multiple people in a room. For the first time, we’ve preserved the sound of each person’s voice and the direction it’s coming from.”

Story in
For more information, visit

The system makes three innovations. First, when turned on, it immediately detects how many speakers are in an indoor or outdoor space.

“Our algorithms work a little like radar,” said lead author Chen, a 91̽��doctoral student in the Allen School. “So it’s scanning the space in 360 degrees and constantly determining and updating whether there’s one person or six or seven.”

The system then translates the speech and maintains the expressive qualities and volume of each speaker’s voice while running on a device, such mobile devices with an Apple M2 chip like laptops and Apple Vision Pro. (The team avoided using cloud computing because of the privacy concerns with voice cloning.) Finally, when speakers move their heads, the system continues to track the direction and qualities of their voices as they change.

The system functioned when tested in 10 indoor and outdoor settings. And in a 29-participant test, the users preferred the system over models that didn’t track speakers through space.

In a separate user test, most participants preferred a delay of 3-4 seconds, since the system made more errors when translating with a delay of 1-2 seconds. The team is working to reduce the speed of translation in future iterations. The system currently only works on commonplace speech, not specialized language such as technical jargon. For this paper, the team worked with Spanish, German and French — but previous work on translation models has shown they can be trained to translate around 100 languages.

“This is a step toward breaking down the language barriers between cultures,” Chen said. “So if I’m walking down the street in Mexico, even though I don’t speak Spanish, I can translate all the people’s voices and know who said what.”

, a research intern at HydroX AI and a 91̽��undergraduate in the Allen School while completing this research, and , a 91̽��doctoral student in the Allen School, are also co-authors on this paper. This research was funded by a Moore Inventor Fellow award and a .

For more information, contact the researchers at babelfish@cs.washington.edu.��

AI headphones let wearer listen to a single person in a crowd, by looking at them just once

Stefan Milne — Thu, 23 May 2024 16:36:42 +0000

Noise-canceling headphones have gotten very good at creating an auditory blank slate. But allowing certain sounds from a wearer’s environment through the erasure still challenges researchers. The latest edition of Apple’s AirPods Pro, for instance, for wearers — sensing when they’re in conversation, for instance — but the user has little control over whom to listen to or when this happens.

A 91̽�� team has developed an artificial intelligence system that lets a user wearing headphones look at a person speaking for three to five seconds to “enroll” them. The system, called “Target Speech Hearing,” then cancels all other sounds in the environment and plays just the enrolled speaker’s voice in real time even as the listener moves around in noisy places and no longer faces the speaker.

The team presented May 14 in Honolulu at the ACM CHI Conference on Human Factors in Computing Systems. The is available for others to build on. The system is not commercially available.

“We tend to think of AI now as web-based chatbots that answer questions,” said senior author , a 91̽��professor in the Paul G. Allen School of Computer Science & Engineering. “But in this project, we develop AI to modify the auditory perception of anyone wearing headphones, given their preferences. With our devices you can now hear a single speaker clearly even if you are in a noisy environment with lots of other people talking.”

To use the system, a person wearing off-the-shelf headphones fitted with microphones taps a button while directing their head at someone talking. The sound waves from that speaker’s voice then should reach the microphones on both sides of the headset simultaneously; there’s a 16-degree margin of error. The headphones send that signal to an , where the team’s machine learning software learns the desired speaker’s vocal patterns. The system latches onto that speaker’s voice and continues to play it back to the listener, even as the pair moves around. The system’s ability to focus on the enrolled voice improves as the speaker keeps talking, giving the system more training data.

Related:

For more information, visit
Stories from and

The team tested its system on 21 subjects, who rated the clarity of the enrolled speaker’s voice nearly twice as high as the unfiltered audio on average.

This work builds on the team’s previous “semantic hearing” research, which allowed users to select specific sound classes — such as birds or voices — that they wanted to hear and canceled other sounds in the environment.

Currently the TSH system can enroll only one speaker at a time, and it’s only able to enroll a speaker when there is not another loud voice coming from the same direction as the target speaker’s voice. If a user isn’t happy with the sound quality, they can run another enrollment on the speaker to improve the clarity.

The team is working to expand the system to earbuds and hearing aids in the future.

Additional co-authors on the paper were , and , 91̽��doctoral students in the Allen School, and , director of research at AssemblyAI. This research was funded by a Moore Inventor Fellow award, a and a .

For more information, contact tsh@cs.washington.edu.

91̽��team’s shape-changing smart speaker lets users mute different areas of a room

Stefan Milne — Thu, 21 Sep 2023 15:19:43 +0000

A team led by researchers at the 91̽�� has developed a shape-changing smart speaker, which uses self-deploying microphones to divide rooms into speech zones and track the positions of individual speakers. Here 91̽��doctoral students Tuochao Chen (foreground), Mengyi Shan, Malek Itani, and Bandhav Veluri — all in the Paul G. Allen School of Computer Science & Engineering — demonstrate the system in a meeting room. Photo: April Hong/91̽��

In virtual meetings, it’s easy to keep people from talking over each other. Someone just hits mute. But for the most part, this ability doesn’t translate easily to recording in-person gatherings. In a bustling cafe, there are no buttons to silence the table beside you.

The ability to locate and control sound — isolating one person talking from a specific location in a crowded room, for instance — has , especially without visual cues from cameras.

A team led by researchers at the 91̽�� has developed a shape-changing smart speaker, which uses self-deploying microphones to divide rooms into speech zones and track the positions of individual speakers. With the help of the team’s deep-learning algorithms, the system lets users mute certain areas or separate simultaneous conversations, even if two adjacent people have similar voices. Like a fleet of Roombas, each about an inch in diameter, the microphones automatically deploy from, and then return to, a charging station. This allows the system to be moved between environments and set up automatically. In a conference room meeting, for instance, such a system might be deployed instead of a central microphone, allowing better control of in-room audio.

The team published Sept. 21 in Nature Communications.

“If I close my eyes and there are 10 people talking in a room, I have no idea who’s saying what and where they are in the room exactly. That’s extremely hard for the human brain to process. Until now, it’s also been difficult for technology,” said co-lead author , a 91̽��doctoral student in the Paul G. Allen School of Computer Science & Engineering. “For the first time, using what we’re calling a robotic ‘acoustic swarm,’ we’re able to track the positions of multiple people talking in a room and separate their speech.”

Previous research on has required using overhead or on-device cameras, projectors or special surfaces. The 91̽��team’s system is the first to accurately distribute a robot swarm using only sound.

The team’s prototype consists of seven small robots that spread themselves across tables of various sizes. As they move from their charger, each robot emits a high frequency sound, like a bat navigating, using this frequency and other sensors to avoid obstacles and move around without falling off the table. The automatic deployment allows the robots to place themselves for maximum accuracy, permitting greater sound control than if a person set them. The robots disperse as far from each other as possible since greater distances make differentiating and locating people speaking easier. Today’s consumer smart speakers have multiple microphones, but clustered on the same device, they’re too close to allow for this system’s mute and active zones.

The tiny individual microphones are able to navigate around clutter and place themselves with only sound. Photo: April Hong/91̽��

“If I have one microphone a foot away from me, and another microphone two feet away, my voice will arrive at the microphone that’s a foot away first. If someone else is closer to the microphone that’s two feet away, their voice will arrive there first,” said co-lead author , a 91̽��doctoral student in the Allen School. “We developed neural networks that use these time-delayed signals to separate what each person is saying and track their positions in a space. So you can have four people having two conversations and isolate any of the four voices and locate each of the voices in a room.”

The team tested the robots in offices, living rooms and kitchens with groups of three to five people speaking. Across all these environments, the system could discern different voices within 1.6 feet (50 centimeters) of each other 90% of the time, without prior information about the number of speakers. The system was able to process three seconds of audio in 1.82 seconds on average — fast enough for live streaming, though a bit too long for real-time communications such as video calls.

As the technology progresses, researchers say, acoustic swarms might be deployed in smart homes to better differentiate people talking with smart speakers. That could potentially allow only people sitting on a couch, in an “active zone,” to vocally control a TV, for example.

To charge, the microphones automatically return to their charging station. Photo: April Hong/91̽��

Researchers plan to eventually make microphone robots that can move around rooms, instead of being limited to tables. The team is also investigating whether the speakers can emit sounds that allow for real-world mute and active zones, so people in different parts of a room can hear different audio. The current study is another step toward science fiction technologies, such as the “cone of silence” in “Get Smart” and “Dune,” the authors write.

For more information see .

Of course, any technology that evokes comparison to fictional spy tools will raise questions of privacy. Researchers acknowledge the potential for misuse, so they have included guards against this: The microphones navigate with sound, not an onboard camera like other similar systems. The robots are easily visible and their lights blink when they’re active. Instead of processing the audio in the cloud, as most smart speakers do, the acoustic swarms process all the audio locally, as a privacy constraint. And even though some people’s first thoughts may be about surveillance, the system can be used for the opposite, the team says.

“It has the potential to actually benefit privacy, beyond what current smart speakers allow,” Itani said. “I can say, ‘Don’t record anything around my desk,’ and our system will create a bubble 3 feet around me. Nothing in this bubble would be recorded. Or if two groups are speaking beside each other and one group is having a private conversation, while the other group is recording, one conversation can be in a mute zone, and it will remain private.”

, formerly a principal research manager at Microsoft, is a co-author on this paper, and , a professor in the Allen School, is a senior author. The research was funded by a Moore Inventor Fellow award.

For more information, contact acousticswarm@cs.washington.edu.

With a new app, smart devices can have GPS underwater

Stefan Milne — Mon, 24 Jul 2023 16:50:40 +0000

A team at the 91̽�� has developed the first underwater 3D-positioning app for smart devices, such as the smartwatch pictured here. Photo: 91̽��

Even for scuba and snorkeling enthusiasts, the plunge into open water can be dislocating. Divers frequently swim with limited visibility, which can become a safety hazard for teams trying to find each other in an emergency. Yet even though many dive with smartwatches designed to go to depths of over 100 feet, accurately locating mobile devices underwater has confounded researchers.

Now, a team at the 91̽�� has developed the first underwater 3D-positioning app for smart devices. When at least three divers are within about 98 feet (30 meters) of each other, their devices’ existing speakers and microphones contact each other, and the app tracks each user’s location relative to the leader. This range can extend with more divers, if each is within 98 feet of another diver. The team will present in September at the in New York City.

“Mobile devices today can work nearly anywhere on Earth. You can be in a forest or on a plane and still get internet connectivity,” said lead author , a 91̽��doctoral student in the Paul G. Allen School of Computer Science & Engineering. “But the one place where we still hadn’t made mobile devices work was underwater. It’s kind of the final frontier.”

Above water, GPS relies on a vast satellite network to locate mobile devices with radio signals. Underwater, these signals quickly fade. Sound, though, travels faster and farther in water than it does in air. Previous underwater positioning systems have relied on strategically placed buoys, but these systems are expensive and cumbersome to deploy, leading many divers to do without.

The underwater GPS app runs on a smartwatch. Photo: 91̽��

The 91̽��team found that such buoys aren’t necessary. With the app, if the dive leader has at least one other diver visible, the group’s devices can send acoustic signals to each other through their microphones and speakers and use the timestamps to estimate each diver’s distance. Based on these distances, the app can estimate the group’s formation and each diver’s location. If a device also tracks depth, as sport monitors like the Apple Watch Ultra or the Garmin Descent do, the system can locate divers in 3D.

The app needs at least three devices in its network to function, and its accuracy improves as more devices are added. When tested with four to five devices in local lakes and a pool, the app estimated locations with an average error of about 5 feet (1.6 meters) — close enough for divers to see each other in most environments. To get actual GPS coordinates, instead of tracking locations relative to the dive leader, the leader needs to be wirelessly connected to a surface device on a boat with GPS capabilities.

For more information and to see the app’s open-source code, visit the .

The study builds on a , which allows divers to send messages to each other underwater.

“This and AquaApp can be used together,” said author , a 91̽��doctoral student in the Allen School. “For example, if the dive leader finds someone going the wrong way, the leader can send an alert: ‘Hey, you’re going out of range. You need to come back.’ Or if a diver is running out of gas, an SOS can let the team find the person quickly even in murky water.”

, a professor in the Allen School, is a senior author on this paper. This research was funded by grants from the Gordon and Betty Moore Foundation and National Science Foundation.

For more information, contact underwaterGPS@cs.washington.edu.