Listen on Spotify, Apple, Amazon, and Podcast Addict | Watch on YouTube.
Today Madrona Managing Director Karan Mahandru and Scott Stephenson, Co-Founder and CEO of Deepgram, a foundational AI company building a voice AI platform providing APIs for speech-to-text and text-to-speech. From medical transcription to autonomous agents, Deepgram is the go-to for developers of voice AI experiences, and they’re already working with over 500 companies, including NASA, Spotify, and Twilio.
Today, Scott and Karan dive into the realities of building a foundational AI company, meaning they’re building models and modalities from scratch. They discuss the challenges of moving from prototype to production, how startups need to out-fox the hyperscalers while also partnering with them, and, of course, how Scott went from being a particle physicist working on detecting dark matter to building large language models for speech recognition. This is a must-listen for anyone building in AI.
This transcript was automatically generated and edited for clarity.
Karan: To kick us off today, Scott, why don’t you share a little bit about your incredible background as a particle physicist turned voice recognition start-up CEO? Not your traditional journey into AI.
Scott: I built deep underground dark matter detectors. We were working about two miles underground — imagine a James Bond layer. There are yellow railings, there are cranes, there are all sorts of workers milling about in the background with hard hats on and building stuff, and that’s exactly what was happening. We were doing this for a few years, building the quietest place in the universe from a radioactivity perspective. The purpose of this experiment was to detect dark matter with a terrestrial-based detector, and you had to build super sensitive detectors that analyzed waveforms, these analog waveforms, hundreds of them in real time, and you tried to pick out the signal from the noise. We had this experience building with FPGAs and training models with GPUs, doing signal processing, and using neural networks to understand what’s inside waveforms. When we were down there, we also noticed that what we were doing was pretty insane. Who gets to do this? Who gets to work two miles underground and work on all this stuff?
We thought, man, there should be some documentary crew or somebody down here. There wasn’t. But we were like, “Well, we could be our own crew. Let’s build a little device to record audio all day, every day, to make a backup copy of what we’re doing.” Then we started to realize, wait a minute, you could put these two things together — the types of models that we were building to analyze waveforms could be used for audio as well. The thousands of hours that we recorded then, or many hundreds of hours, you would be able to search through and find the interesting moments and then all the dull moments you would get rid of. We looked for an API or a service that would provide that to us — this was back in 2015 — and they didn’t exist. Once we looked around long enough, we said, “Hey, we should be the ones to build this.” And so nine years ago, we started Deepgram.
Karan: That’s an amazing story. Every time I hear it, it is fascinating just given what you were doing. It is also a reminder that some of the best founders and companies are born out of frustration and extreme pain that they’ve faced themselves, so that’s awesome. At that point, you started Deepgram. Maybe talk about some of the going in theses you had. It feels like you were intentional about building a developer-first and developer-centric approach to solving this problem. Walk us and the listeners through your thinking when you initially started Deepgram.
Scott: This definitely files under build what you know, or you are your own customer way of thinking. We were developers, and we were looking for an API that we could analyze our audio. We realized there wasn’t a good way to do that, so let’s build that for ourselves, and then we’ll be able to. As long as we’re scratching our own itch, we’ll at least know one customer who would be interested in it. But then we also suspect that there are a lot more out there, and that was an interesting journey because when we first started, we thought, “Hey, we’ll just speak only solely to the individual developer, and that will be able to build a product with a lot of users and a big company.” That was nine years ago, and we quickly realized that, hang on a second, there’s a whole lot of education that has to go on around AI first for folks to build up enough demand to support a venture-backed company in that area.
There were plenty of other buyers who were already there and had tons of pain. This was in the call center space, so recorded phone calls all across the world. Anytime you hear that this call may be monitored or recorded for analysis later, that type of market was already big. We focused on B2B as a company, but we always had this developer mindset, and we believed that in the coming years, just read the winds and the tide and everything that these things are going to combine.
The developer is going to get more and more power in the organization and figure out which product they will build. If you build with B2B in mind, but then you also build with that individual developer in mind, and then you meld them together, then that’s what’s going to create the great product along with building the foundational models that supply that. We had some initial thoughts around that, and I still believe in them today, and they turned out to be true. Another one was around end-to-end deep learning being the thing that will solve the underlying model problem, so that go-to-market and product packaging along with the foundational deep learning models, solving the underlying deep tech problem, and meld all of those together and then you put the blinders on and only chase after that. And that’s what we focused on.
Karan: Having worked with you for a couple of years, I know there’s a lot of foundational tech that you and the team have built, and it was probably not always up and right from the first day that you started building Deepgram. You already mentioned one of the challenges you faced, which is how do you convince customers to buy their foundational tech from a venture-backed and very early-stage venture startup. Can you talk a little bit about some of the other challenges that you probably faced as you were closing your first 20, 30, 50 customers?
Scott: That decision to press pause on the PLG side and go directly to individuals and then focus only on a sales-gated contact us form, to go after the market that way was a really big one for us. We learned that through basically getting punched in the face over and over. And so that’s one, but I already talked about that. But there’s another, which is the first product that we offered as a company was not speech-to-text, which is what a lot of people know Deepgram for today, our real-time speech-to-text and our batch mode speech-to-text in over 30 different languages. Our first product was a search product, much like what we were doing in physics, where you’re trying to find individual events that mattered to you.
What you would do is try to find individual words or phrases or ideas that were happening in the audio and then surface those. We learned very early on that there was too big of a gap for buyers. How I would think about that today is the product that we built then is essentially a vector embedding model plus a fuzzy search, so everybody knows these today with great companies that do that kind of thing. We had to decide, “Well, are we going to be the fuzzy database vector embeddings company in 2016?” There’s going to be no demand for it for so long. What do people have demand for? Well, they have a demand for speech-to-text. Early engineers and researchers at Deepgram were hesitant, “Hey, isn’t speech-to-text boring?” Because they know about the fancy stuff that is coming. They know about the embeddings. They know about the speech-to-speech models. They know about all this other stuff.
And it’s like, yeah, but we have to earn our license to learn in this market. We need to establish ourselves in one product domain, and we’re just undeniable in that domain. Then we can expand into these other domains, and it gives us the right to play. I like to think about that every year, every two years with Deepgram, “Hey, have we earned the right to play again? Have we positioned ourselves well?” I think that early decision to say, hey, the search side — by the way, we still have that in our product today — but to switch to speech-to-text and our straight-to-developer PLG switch to B2B as the first move. By the way, we do the PLG and all of that now and search, and that kind of thing is even better today, etc. We get to do all of these things now, but we had to really put the blinders on early, and I think that’s an important lesson for any kind of company.
Pick one thing. Be really, really good at that thing, and then see how you can expand it over time. And you might think, “Hey, I might expand it over quarters or something like that.” But it’s generally actually years.
Karan: I love that. I think we often get into conversations around the trade-offs between focus and having an expansive vision, but I think it’s great that you emerged from your position of strength around speech-to-text and then expanded from there. I remember from our conversations that conventional thinking back then was, like you said, a lot of these hyperscalers were coming in. Google would launch a speech-to-text product, and there was this fear of commoditization of the most important source of data inside the enterprise which is voice. And here’s this startup called Deepgram, which is venture funded with a few customers with the best product in the market. I’m sure you faced a lot of people sort of talking about the commoditization of speech-to-text. Help us understand how you went around that and through it. What was your approach to working with the hyperscalers like AWS and Google? And then I want to get to another moment in the company’s history, which I think was interesting post that, but let’s talk a little bit about that first.
Scott: This is always an interesting line to walk. You do want to create a general model that is good in many circumstances, and others out there will be creating that as well. There’s other technology out there, and by some measure, this is moving toward commoditization. In other words, interchangeable services like, “Hey, we build an API service that supports batch mode in real-time.” Well, there are others out there that support batch mode in real-time, and they support English, and they support Spanish and all sorts of things. But there are differences between them. There are accuracy and latency differences, the time it takes for the API to respond, and throughput differences. There are also differences in where you run your computations. You could do it fully in the cloud hosted by us, you can do it in the cloud hosted by you in your own VPC, or you can do it air-gapped in your four walls as well.
These different areas of competition and differentiation start to show up. There is a little bit of commoditization where you want folks to get together and build their first demo, to scale up a little bit, but then they start to feel the pain in the other areas. They start to feel it in latency, accuracy, and the model’s adaptability or where it runs.
From our perspective, we were thinking, hey, we think from first principles that these systems can be cheaper than they were before a few years ago. The only way to get speech-to-text was to pay $2 an hour. I think that’s 10 times too much. If you drop the price 10x, then you can get 100X or 1,000X more usage. This is one angle for us. You always have to walk the line, you have a commodity offering, but then you have this differentiation that makes it plainly not a commodity for a B2B customer that needs the best accuracy or the best latency or the best cogs or whatever it is. Then, you have these large areas of differentiation.
Karan: One of the things that I remember in our conversations early on, the conventional thinking at that time, if you want to call it that, was the four things that you mentioned, speed, cost, latency, and accuracy. In some ways, many of them were mutually exclusive. If you had speed, you didn’t have cost. If you had cost, you didn’t have accuracy, and I think one of the things that Deepgram really pioneered was reminding folks that no, it is possible to have very accurate, low word error rate products that are low latency, power real-time applications, and you can do it at a price where you can still afford yourself 85% gross margins. I know it takes a lot of stuff in the back end, a lot of engineering work that you and the team did, but I think that was really interesting.
Let’s go fast-forward. We have OpenAI, which is in some ways a hyperscaler and in some ways a very nimble startup, and it might be the best of both. They launched Whisper as their speech-to-text product. Let’s walk through that sort of moment in time. What did that do to Deepgram? How did you react when that came out? What did you hear in the market? What did you then remind yourself and the team to do?
Scott: I remember when that came out, and we did our first testing. We’re like, “Ooh, this model model’s pretty good.” Also, their mentality on how to structure the model was like, “Ooh, this is an end-to-end deep learning model for real.” Up until that point in time, every open source speech-to-text model was not. It put together several different pieces, so it was missing mostly on accuracy, but it also was missing in speed and latency, but if you don’t have accuracy, then the other stuff doesn’t matter. We’re like, “Hey, this model is actually pretty good out of the box, and it supports several languages as well.”
One thing, though, to take a step back and look at us as a company is who did the first end-to-end speech-to-text models in the world? It was Deepgram. And we did them seven years prior to Whisper being released. We were like, “Hey, I’m surprised it took this long for somebody to put all this together and put a model out there.” I’m also glad that OpenAI, which has a very big marketing bullhorn — when they say something, the world listens. Now, everybody is aware that end-to-end deep learning works for speech. Some of these other things previously thought impossible, like supporting multiple languages in a model and that kind of thing, where people who didn’t think about that. I’m glad that they did this education as well into a reasonable level on the accuracy side. It helped everybody get pushed through their learning curve faster for folks who wanted to implement AI and voice AI into their products.
From the outside, you might think like, oh no, OpenAI just open-sourced this model. What’s going to happen? Right? But like we talked about before on the other differentiation like latency, high throughput, low cogs, running hybrid in the cloud or on your own infrastructure, all of those things matter. Our own models are also more accurate than Whisper too. And there’s a reason for that because Whisper and models like it, and many open source models are trained on public data. When our customers work with us, they can adapt models to their own domain. In that domain, you can expose your models to different acoustic environments or audio scapes, and then the model can get good at those types of things. If you feed that type of audio to an open-source model that’s only trained on YouTube videos, it typically doesn’t do as well. From the outside, it might look like, oh no, but for us, we’re like, oh yes, great. Everybody’s going to get educated around this.
When they try out the open source stuff, and try to run it themselves, all of that, they’ll be like, “Wow, this is really complicated. This is expensive. It’s hard to make the model do what I want it to do.” But now they’ll be educated and say, “Hey, are there other products out there in the world?” And yes, there are like Deepgram where you can build a B2B voice AI product and have it just work. We definitely had to take a beat as a company and say, “Hey, what’s our positioning going to be around this? The world is going to wonder what do we think about it?” I wrote a blog post at the time, and I probably did a podcast or something about it, too, saying, “We’re glad about this because, hey, it’s moving all of it forward, and now everybody will be educated about the power of end-to-end deep learning for speech.”
Karan: I remember those conversations at the board level as well where there was a little bit of a pause, there was a breather, and then we realized that OpenAI just did Deepgram and this entire speech AI market a huge favor by educating and having that like you said, the megaphone they had.
In venture, we always say the trick is to be non-consensus and right, and there are a lot of people who are operating on consensual thinking. If you think about what the world believes about AI or, more specifically, speech AI today that you believe to be true or conversely, what do we not believe to be true that you think it is, it’d be helpful to hear a little bit of your vision into where this goes and what most of us are probably getting wrong in our assumptions about AI or speech AI.
Scott: I’m smiling because six months ago, the answer would be different than a year ago, than two years ago, etc., because it’s been such a rapid pace of learning, and everybody is paying their tuition around what AI is capable of and how fast it can do different things. There were tech companies, and now there are intelligence companies, and intelligence companies move three times faster.
I have to update my own model of where the world is at and what does the world understand compared to our own? Also, are there overreactions? For instance, a year ago, it was really plain to see that smaller models, more efficient models are going to become super important because the cost of the inference is going to matter so much. A big reason for that is AI is actually effective, and when you scale it up, no company wants to pay a hundred million dollar AI bill.
Over the last year, we’ve seen that come true. COGS have become more important, cost has dropped for LLMs and that kind of thing. We’ve seen a recent thing now, which is maybe a couple of months ago, OpenAI did their demo of GPT-4o with the voice mode. It’s a speech-to-speech model, and I think the industry probably absorbed that a little too much. They think, “Okay, everything has to be a multimodal model.” I’ll caution folks against that.
Multimodal models are great, especially in a consumer use case like a jack-of-all-trades situation. You allow them and shape them into a single personality and then allow them to handle some of the normal tasks, and that will work fairly well. In a B2B use case where you’re trying to build a voice AI agent that handles insurance claims, or food ordering, et cetera, they’re going have to interact with CRMs. They’re going to have these voice AI agents, like humans. Think of it as a human, okay? They’re going to have to interact with the CRM, they’re going to have to interact with a knowledge base, they’re going to have to interact with all these things.
A speech-to-speech model that is trained to sound likable and respond to things is not going to be able to do that. You’re going to have to have separate parts that have different beefed-up components, and every B2B company is going to have to choose where they want to spend their cogs, basically. Do you need good speech-to-text? Do you need good LLMs? Do you need good text-to-speech? Do you need them to interact with a rag system? Do you need them to interact with whatever next-gen cognitive architecture system that is coming out?
You’re going to need controllability. The idea that multimodal models will save us all and reduce all of the complexity, I don’t think that’s necessarily true because these models need to interact with all these other pieces, and it’s going to take a while. It’s going to take several years for that to shake out. Don’t get me wrong, though, three years from now, five years from now, you’ll start to see a condensation of the different areas into more multimodal. It probably won’t be all one single master huge model, but it will be components of it that you put together. It won’t look so much like a Swiss army knife. It’ll look more like putting together AWS Lego blocks or something like that in the future. At least for B2B right now, you need more control. You don’t want to leave it open-ended to a speech-to-speech system to handle your bank account resets. You need way more control over it.
Karan: The pace at which the space is moving makes anything that you say almost obsolete by the time you finish saying it, and so I totally resonate with that statement that you have to keep changing your own mental model and your own model by adapting to what’s happening.
One of the things I love about this space and speech as a modality or as a way of interaction with the world is that it’s not zero-sum in many ways; it’s sort of expanding the pie of what’s possible once you get to a place where Deepgram gets to and allows people to build these amazing applications. Talk a little bit about some examples of enterprise B2B applications that Deepgram is powering today. As you look forward, without divulging too much, of course, what should we expect? What should customers expect? What should the world expect from Deepgram in the near future?
Scott: We’re in a revolution right now, and I wince when people call it another industrial revolution. I think it’s different. We had an agricultural revolution. We had an industrial revolution. Then we had an information revolution. Now we’re in an intelligence revolution, and I won’t go into the specific details of each of those, but intelligence is different.
Two years ago or prior, you had to have a human do intelligent work. That is not necessarily true anymore. In the industrial revolution, you had to have a human swing the hammer, and that was not true after that anymore. In the information revolution, you had to write down on paper and file away in a filing cabinet and all of this, not anymore, not after the information revolution. You can transmit information at the speed of light, and you can categorize it, and you can search it, and you can do all these things.
In our current situation, things are going to change drastically, but they’re not going to change all at once, and a company like Deepgram is not going to try to tackle all these things at once. I think it’s important to recognize that everything is going to be touched by this, just like everything was touched by electricity, everything was touched by cars and transportation, everything was touched by the internet, et cetera. Then that comes back to us putting the blinders on and saying, “Hey, we have a belief that there are these fundamental infrastructure companies that need to be built that are also foundational model builders at the same time, and this is what Deepgram is.” We have a horizontal platform that anybody could use ostensibly to do anything, but we are going to focus in certain areas that are going to make the world more productive, but also that we have an advantage from what we have already done or from the way we think about the world.
The advantage from our perspective is scale. We always think about the world from first principles and what will scale, and so we’re thinking about cost per watt, we’re thinking about how much training data we need in order to do that kind of thing. We look at the world like, “Hey, the call centers, food ordering, that kind of thing. This is all massive scale, and it’s all going to be disrupted, and so we’re going to focus in that area where there’s scale.”
Where there are other areas, like doing dialogue for a Hollywood film or something like that, that’s not our game. There are other companies that are out there that will do that well. I think it’s really helpful to think about the world in these ways that there are certain things that scale gets you, and if you drop the price, then now you can kind of change the face of work and how it happens. We’ll do our part in the voice AI area. There will be other companies doing it from a search perspective. There’ll be other companies doing it in text. At some point in the future, we’ll all kind of have to figure out how we all fit together, but that point is not now. Let’s just go transform our own respective areas.
Karan: One of the interesting things about some of the applications that I hear about from you and the team that Deepgram is powering, is the distinction between what we hear from a lot of the other AI companies is a lot of stuff that is happening in prototype stages, a lot of stuff that is happening in sort of toy boxes. I think one of the things that is really unique about Deepgram is it is a foundational AI company that is powering real use cases in production for large enterprises, but we don’t hear a lot of examples of actual AI companies powering use cases in production, and I think Deepgram is an exception in that.
Scott: It’s kind of amazing to me at this point. I think we could say that by default, you could probably assume that Deepgram is under the hood if the speech-to-text is working well in whatever product you’re using. Don’t get me wrong, there are other good technologies out there, but it’s a pretty good bet. That’s just in speech-to-text. We released our text-to-speech last year as well, and that is growing at a really big clip too, and that’s powering that real-time agent side. We have companies like embedded device companies that many people would be familiar with. Unfortunately, I feel like I have to be cagey about this because many of our customers don’t like us to say that Deepgram is under the hood, but I can mention some of them. Nevertheless, there’s a lot out there that is being powered right now.
If there’s one thing to take away from this, it’s not like this is coming, it is already here. It’s already being utilized, but now it’s being utilized in new ways that will be even more user-facing. You’ll start to think, if I call a call center, if I need to order something, if I’m going to talk to something online, you’re not going to dread it. You’re not going to think like, “Oh, great, I’m going to spend 45 minutes, and it’s going to be a horrible process.” You’ll start to be glad. You’ll be like, “Wow, that was really peppy. It understood exactly what I needed. I solved my problem. I’m in and out in three minutes.” That’s way better than sending an email and waiting for it to come back three days from now or something.
Karan: Is there something, Scott, about the nature of the product that Deepgram has or the use cases that you power that just makes it — I don’t want to call it easier because nothing you do as a startup is ever easy — but easier than many of the other AI companies that are having a hard time moving from prototype to actually powering production use cases. Why is it that it’s working so well at Deepgram, and yet so many AI companies are having a hard time with that?
Scott: I think partially, time is on our side. There are only a handful of foundational AI companies that were started around 2015, and we have the benefit of being one of those. Another is coming from first principles at the problem. I like to liken this to what’s the difference between Tesla or Ford? Or what’s the difference between SpaceX and Lockheed Martin. All their cars have four wheels. They’ve got a steering wheel, they’re propulsed through the world, they have turn signals. The rockets are tall and pointy, and they have an engine at the other end.
What is the difference? The difference is not so much what is the chemical makeup of the metal they use or something like that, it’s actually the methodology by how they arrive at their product. You have to think of your company as a factory, I think, right now. To build a true foundational AI company, efficiency matters. Bring in some of the Amazon mindset and everything as well, because it’s really more like three companies in one. You’re a cutting-edge research company like DeepMind, and you’re a cutting-edge infra company trying to compete with AWS and Google and Azure and all of that.
You either need to partner with or be yourself an amazing data labeling company as well. When you get all three of those right, then you can have this amazing product, so that’s the secret. It isn’t that much of a secret, but it’s just really hard to do all of that from a lean perspective and from a first principal’s mindset. What we’re always looking for internally is instead of thinking, “Should we hire for that?” We think, “Should we do it at all?” And then the next thing would be, “Can we automate this?” Okay, the next would be, “Well, maybe we need to hire for it now, but can we automate it later?”
You’re always trying to explore and then condense and explore and condense, and then you rely on this backbone spinal cord of the company or whatever that is amazingly well suited to accomplish the goal of, so, for instance, the vast majority of models trained at Deepgram now are not trained by a human. They’re trained by a machine that we built ourselves to do those tasks. Now, the frontier models that have never been trained before, they’re partially trained by a machine and partially done by a human. They’re working in concert. But that type of thing, if you just tried to start building it today, that would be a very good idea to think along those lines, but the companies that have been doing it for two years, five years, nine years in the case of Deepgram, it’s harder to go compete with that. So you have to come up with a new first principles thing that you think will be better in the end, and it might take five years for it to pay off just because of the underlying moats that companies already have.
Karan: Listening and obviously continuing to talk about where the space goes, how fast this market is evolving, and how fast speech AI is evolving, I sometimes wonder if you and I were doing this podcast two years from now, whether it’ll be your agent talking to my agent powered by Deepgram voice AI. So I look forward to that.
Scott: And we’ll just approve it. In the end, we’ll be like, “Yep. Yeah, that’s what I would say. Actually, it’s better than how I would say it. Okay, good.”
Karan: Yeah, that’s great. Well, on that note, I just wanted to say a huge thank you to you on behalf of all of our listeners. There are so many things I love about Deepgram. It’s such a joy to work with you. It’s a privilege to be your partner on this journey. But the one thing I’ve always said that really separates you from many of the founders that we work with is the audacity of your ambition and where you want to take Deepgram and, by extension, this whole space that you’re operating in. I really appreciate you and your time today.
Scott: Thank you, Karan. Love working with you and Madrona.