How To Translate Sight Into Sound? It's All In The Vibrations

Aug 6, 2014
Originally published on August 6, 2014 8:47 am

Melissa Block talks to Abe Davis, a graduate student at the Massachusetts Institute of Technology. Davis helped author a paper on a visual system to detect sound, which can recover intelligible speech from the vibrations of a potato chip bag photographed through soundproof glass.

Copyright 2018 NPR. To see more, visit


What if you could extract audio from silent video? How would you do it, and what would that sound like? Well, those questions have been answered by researchers at MIT and elsewhere. It's all about tiny vibrations - so small you can't see them with your eyes. But there is movement, and that movement can be converted into sound. The scientists described the process as turning visible objects, such as a house plant or a bag of potato chips into visual microphones. The first author on the paper is MIT graduate student Abe Davis, and he joins me now to explain how this works. Abe, welcome to the program.

ABE DAVIS: Hi, thanks for having me.

BLOCK: And let me see if I've gotten this right. The idea is that sound waves will cause an object to vibrate. You are capturing that vibration on video and then - this is the tricky part - you're turning that vibration back into sound. Am I close?

DAVIS: Yeah, that's pretty accurate.

BLOCK: OK, well let's take a listen to what this sounds like. You played a song to a house plant, and the song sounded like this.


BLOCK: And Abe, what are we hearing there?

DAVIS: Well, that's the sound that we played out of the speaker, and, you know, it created these fluctuations in air pressure. And when those fluctuations in air pressure hit the object they move the object a very, very, very small amount. And usually we can't see that, but it turns out that it does create this very, very, very miniscule changes in video. And if you look at the video locally - if you just look at some - one part of the plant in the image that you see, then you can't really get the sound from that one part. But if you start to combine all these tiny noisy signals from all across the surface of an object, then you can start to filter out some of that noise and you can actually recover the sound that produced that motion.

BLOCK: So when you filtered out the noise - you had a camera on the plant, you took the image of those vibrations, you captured it in a computer, and somehow with this algorithm you were able to generate sound. And here's what it sounded like.


BLOCK: So deep in their, Abe, we're hearing "Mary Had A Little Lamb." I'm not sure how it happened, but we are.

DAVIS: Yeah. That's kind of what the plant heard - or really, more accurately, what the plant felt. All sound creates these vibrations when it comes into contact with an object.

BLOCK: Well, Abe let's listen to another example. You tried out human speech on a bag of potato chips. So here's what went in, in terms of the sound.


UNIDENTIFIED MAN: Mary had a little lamb whose fleece was white as snow.

BLOCK: And here is what you extracted.


UNIDENTIFIED MAN: Mary had a little lamb whose fleece was white as snow.

BLOCK: Were you surprised when that was the result, Abe, when you first heard that?

DAVIS: Yes. Well, sort of - I mean, that wasn't the first experiment that we did where we recovered human speech. I do remember that the first time that we recovered really clear speech, I had to keep double checking to make sure that I hadn't, you know, mixed up my signals or something.

BLOCK: Abe, what are you thinking about when you think about practical uses for what you've done here? What do you - where does take you?

DAVIS: Most people - when they hear about this work, their mind sort of immediately goes to espionage and spying.

BLOCK: The idea there would be you could take video that has no sound and somehow extract what the conversation was in that video?

DAVIS: Yeah. I mean, in some situations I think that you could do that. When people hear about what we do here, it's easy to imagine that it would just kind of work in any arbitrary situation, and that's not exactly the case. I mean, what we do is limited. I think there are a lot of things that we could potentially do with it, but a lot of that is stuff that's going to hopefully be fleshed out in future work.

BLOCK: Well, Abe Davis, thanks so much for talking to us about your visual microphone.

DAVIS: Thank you.

BLOCK: Abe Davis is one of the MIT scientists who, along with researchers at Microsoft and Adobe, figured out how to recover audio from silent video. Transcript provided by NPR, Copyright NPR.