Guide to Twilio + OpenAI Realtime on Rails (Without Anycable)

Jon Sully

Sat Jan 18 '25

26 Minutes

How to add OpenAI Realtime to your Rails app so that you can do AI phone calls via Twilio!

So you want to use OpenAI’s new Realtime API to add AI-driven phone calls to your Rails app, huh? Well that’s exactly what my team wanted and I’m here to give you the rundown on how I built a fully-functioning implementation. But first, a few caveats:

Important

Caveat 1: this guide assumes you already have an application with some kind of Twilio integration. If this is true, you know that no Twilio integration is the same as another, but I’m making assumptions that you’re familiar with how an application interfaces with Twilio and uses various TwiML responses, web-hooks, and requests to make the whole machine function. If you’re not already pretty familiar with integrating Twilio into a Rails app to begin with, you will probably be lost here.

Our application already had a deep integration with Twilio for supporting phone calls between our users (via browser) and end-customers (via their phones / phone numbers). Keep that in mind. This guide is not meant to be a zero-to-done guide, it’s meant to be a toolkit for those that already have some kind of Twilio integration and just want to add in this new AI / Realtime component.

Caveat 2: the setup I’m going to show and illustrate here is for, specifically, one-to-one calls between OpenAI Realtime (the AI model) and an end-customer. What I’m not covering here is how to add a third party (like your own internal customer service rep). Three way calls in that style are going to require getting into Twilio’s Conference product (multi-person calling) and my team simply doesn’t need that so we didn’t look into it.

Caveat 3: Someone’s probably going to mention it at some point, so I’ll just say this for now — I’m using the Faye::Websocket library for Ruby to handle the persistent websocket connection and lifecycle. This whole setup could probably also be accomplished with a different setup using Ruby Async’s Websocket gem, which is newer. They would both do the same job. Faye works fine. I’m not worried about it right now. Websocket standards aren’t changing any time soon. And frankly I prefer Faye’s API style.

Caveat 4: I am not a lawyer. I’m also definitely not your lawyer. Using AI for phone conversations (or frankly, in general, as a commercial company) is a big bag of legal worms that there’s hardly any precedence or rules/rulings for yet. You do you; this is purely a technical guide and I hereby explicitly grant you no advice on the legal front.

What we are going to cover is:

How the architecture works / calls as background “jobs” / threads
Basic setup and install: getting your calls to connect to AI rather than your normal users/customer serivce reps/humans
Carefully passing context around so AI calls can actually be useful
Getting transcripts of the call in real time from both parties
Carefully handling edge cases / errors

Sound good?

What’s Really Happening

Before we can dive into the integration code and getting things installed, we need to understand what’s actually happening in this architecture. Ultimately we don’t have a ton of choice here. We’re beholden to both the requirements of how OpenAI wants Realtime clients to connect and how Twilio wants its call Media Stream system to work.

On the OpenAI side we connect to Realtime in real-time via Websockets. But more specifically, we do that as a websocket client — OpenAI hosts the websocket server and we connect to it. When it comes to the OpenAI leg, we are the “client” and OpenAI is the “server”. I’m repeating this because it’s going to get confusing as we journey along. Websockets are inherently two-way connections once they’re established (either party can send frames to the other ad-hoc at any time), but in the establishment process one party must be a client and one must be a server. And we know that OpenAI is the server at face value because they provide the connection URL: wss://api.openai.com/v1/realtime.

Note

This is because the Websocket connection is preceded by a normal HTTP request (with flags saying “hey, I’m up for Websocket if you are…”) and, as per any HTTP request, there’s a requestor (client) and a requestee (server).

But let’s start building a diagram to keep things straight as we dive in:

Simple diagram with our application in one box, OpenAI’s Websocket server in another box, and an arrow pointing from our application to the OpenAI server indicating that our application is the client

Simple enough! Our application being a client to a third party’s Websocket server is the simplest and easiest setup.

….but Twilio is totally opposite. 🤦‍♂️

Twilio’s system for programmatically connecting to call audio in real-time, two way fashion is called Media Streams. The thing is, the way Twilio built Media Streams requires that we issue a command to Twilio saying “connect this call to XYZ media stream”, where XYZ is a Websocket-ready URL. Which means that Twilio is actually a Websocket client and we have to be the Websocket server. We have to provide the URL! That makes our diagram look like:

Follow-up diagram from the prior, now with a third box for Twilio’s Media Stream, along with an arrow from that media stream pointing to our own application, making Twilio the client and our application the server

So ultimately we’ve got a setup where we’re the server for the Twilio Websocket relationship but we’re the client for the OpenAI Websocket relationship. Confusing when trying to read two different sets of docs all using the words “client” and “server” over and over… but okay.

Architectural Impact

This setup actually has quite a large impact on how we need to structure our app and code. Ultimately, since Twilio requires that we give them a Websocket target URL, that means we need to expose an endpoint from our Rails web process.

Let me take a slight digression here and talk about background jobs. I generally look to background jobs to do any work which is asynchronous and/or not required to answer an incoming web request. If someone is POST’ing a request to /reset-password with accurate credentials, I’m not going to send the email synchronously while I’m handling that response. I’m going to fire a background job to send the email and keep the web response as fast as possible (essentially just painting the markup for “Email is on the way!”).

So in an ideal world, since a phone call between an AI agent and one of my end-users has nothing to do with answering web requests, I’d prefer to have all that work done in a background job. As an added bonus, scaling background job processes/containers and segmenting various jobs across various dedicated resources is trivially easy in most background job systems. So we could, in theory, setup an ai_calls queue on a dedicated process and apply custom auto-scaling so we can handle fluctuating call volumes. That would be neat.

But unfortunately that’s simply not possible. Because background job services typically are not public-network addressable and, since Twilio requires us to host a URL for them to connect to, we’re at an impasse. While a background job can totally connect to another server as a client (e.g. it would work with the OpenAI leg), we simply can’t give Twilio a URL that points to a web server on a background job service which is ready to setup a websocket connection.

Note

Background job systems typically aren’t web-request-handling anyway, so even if you are running a background job service on a more custom hosting solution like AWS and expose that service to the public internet, there wouldn’t be any kind of ingress in the background job system (e.g. Sidekiq) to connect an incoming web request to a particular background job. It’s a no-go!

That leaves us with only one option: our phone call system needs to run in our web service directly 😬. I’ll circle back to why that’s “😬”-worthy later.

But the end result here is that, for any given phone call that we’re going to connect to our AI agent, we’re going to use a background thread on our web server. This is separate from Puma’s worker threads — it will be its own, disparate thread. But a thread nonetheless!

So, at a high level, here’s what the architecture looks like.

We initiate a call with the Twilio API (as usual)
Instead of dialing our user/CS-rep, we tell Twilio to use our Media Stream endpoint
Twilio, in a separate request, hits our Media Stream endpoint with a WSS upgrade to setup bidirectional stream (audio going both ways)
We spawn a new background thread, totally separate from Puma’s web threads, to manage our now-background Websocket coordination work
The Puma thread that handled the initial HTTP request from Twilio responds with a 101 HTTP status and gets released back into Puma’s thread pool to be used for subsequent HTTP requests
Once the Twilio Websocket connection is fired up and running, we call out to OpenAI’s WSS endpoint to setup a two-way stream to the Realtime model (our second Websocket leg) still in the same background thread
We coordinate as a middle-man between the two two-way Websocket connections, pushing audio from one to the other as necessary and doing application tasks along the way (update this record, add notes to that record, etc.), all from our background thread

This all happens within tens of milliseconds. Modern networking 🔥

So now that we have a sense of architecture, let’s walk through the code!

Let’s See Some Ruby

As noted in the intro, I’m assuming you’ve already got a Rails application doing some kind of calling/coordination with Twilio. So I’m only going to walk through how to change / upgrade that setup to connect with OpenAI Realtime.

Our team’s setup previously involved an incoming caller being pushed to a queue (“Please hold…”) while we determine which user could handle the incoming call. Once a user opts to connect with the caller, we ran something like this:

Twilio::REST::Client.new.calls(@twilio_call_sid).update(
  twiml: Twilio::TwiML::VoiceResponse.new do |command|
    command.dial do |dial|
      dial.client do |client|
        client.identity(@user.browser_twilio_id)
      end
    end
  end
)

Note

This and subsequent examples will likely use the style of updating a call via the proactive API commands (e.g. Twilio::REST::Client....update() ) rather than providing TwiML responses to Twilio web-hooks, but this is neither here nor there — the same generated TwiML will work in response to web-hooks too. If you use a web-hook based flow, the same principles shown here will apply.

So let’s use that as our starting point. Instead of connecting the caller to one of our users, let’s instead connect them to our AI agent.

Step 1 is directing Twilio to setup a media stream for this call, pointing it at a media stream endpoint in our app.

Twilio::REST::Client.new.calls(@twilio_call_sid).update(
    twiml: Twilio::TwiML::VoiceResponse.new do |command|
    command.say(message: "Connecting to AI...")
    command.connect do |connection|
      connection.stream(url: call_wss_url(protocol: :wss))
    end
  end
end

…which we don’t have yet! call_wss_url is not yet a thing. Let’s make it!

# routes.rb

# ...
get "media-stream", to: "your_controller#media_stream", as: :call_wss
# ...

(There are many ways to route and/or name this)

All of the magic begins in that new action, media_stream, on our Twilio controller. Well, sort of. Here’s the controller action:

def media_stream
  if Faye::WebSocket.websocket?(request.env)
    AiCallJob.new.perform(request.env)
  end

  head :switching_protocols
end

So, a few things to address here. First, the action actually does very little. As long as the incoming request has the headers that indicate it’s a handshake for a Websocket connection, it simply fires off a background job then sends back a 101 status code. Second, and extremely uncommonly, yes, we’re passing the full request.env down into the job. Third, what’s a “job”?

The truth is that I’m using the term “Job” fairly loosely here. I mentioned above how we can’t actually do this realtime-ai-calling thing outside of the web process itself, and that’s true. But I also said that a “job” is a wrapper for doing “any work which is asynchronous and/or not required to answer an incoming web request”. Which still fits this functionality. So this is a “job” (lower case ‘j’) as in, some work to be done. But it is not a Job (upper case ‘j’) in that it’s an actual Sidekiq construct that will be run in another process.

In short, I’m using a PORO to wrap all this Websocket two-way stuff into its own isolated class and calling it a “job” because it helps clarify the mental model a bit, even though it’s not a Sidekiq Job. Yes, for paradigm’s sake, I implemented the perform method name. But again, just a PORO. And still going to run in a background thread!

Note

The ‘job’ code in the next section is heavily inspired by Daniel Friis’ exploratory proof of concept code. I thank him tremendously for his first stab into this foray!

The Job

Ultimately the complete job code can be found and read at the bottom of this article if you want to skip the steps and reasoning, but for the sake of this blog post, we’re going to walk through each chunk individually.

Let’s start with the shell and the basics.

class AiCallJob
  def perform(env)
    @twilio_websocket = Faye::WebSocket.new(env)
    @openai_websocket = Faye::WebSocket::Client.new("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview", nil, headers: {
      "Authorization" => "Bearer #{Rails.application.credentials.dig(:open_ai, :api_key)}",
      "OpenAI-Beta" => "realtime=v1"
    })
    @stream_sid = nil

    # A whole LOT of stuff, we'll keep adding below

    @twilio_websocket.rack_response
  end
end

Keep in mind that this ‘job’ is running, in real time, in an active web request that we’re eventually going to respond to. And that we’re passing the full request.env of that request into this job.

The first thing we do is fire up the two legs of our Websocket structure. First, using the non-client Faye configuration, we pass in the full request.env the same way you would directly in a controller: Faye::WebSocket.new(env). Then, just after, we use Faye::Websocket::Client to setup the client connection to OpenAI.

Then we simply prepare an instance variable to store our Stream ID (on the Twilio side) since we’ll need that to make updates to the call itself. And last, before we dive into all the event handling code, we must end the synchronous job code with @twilio_websocket.rack_response. That method is ultimately what will fire up the background thread, hijack the Rack port, and allow Puma to get its thread safely back into an available state.

I said “synchronous job code” because all of the event handler code we’re about to cover is asynchronous. It only runs when appropriate events arrive in the Websocket! Let’s start to dive in and cover these. We’ll start with the Twilio Websocket connection and the easiest two events we have on file:

    # continuing from previous example...

    @twilio_websocket.on :open do |event|
      puts "[RT-T] Twilio Websocket Established (Open)"
    end

    @twilio_websocket.on :error do |event|
      puts "[RT-T] ERROR event: #{event.data} #{event.inspect}"
    end

    # more to come

Feel free to omit for production use but both proved quite helpful in development.

Then, with barely more functionality than the last two, we need to handle the close event (e.g. the callee hangs up on us!):

    # continuing from previous example...

    @twilio_websocket.on :close do |event|
      puts "[RT-T] Disconnecting. Killing OpenAI Websocket Too..."
      @openai_websocket.close if @openai_websocket.ready_state == Faye::WebSocket::OPEN
    end

    # more to come

Which understandably shuts down the OpenAI websocket construction since it’s no longer needed and we no longer have a phone call.

Okay, now for the juicy part! The main guts of what we do with the Twilio side of things:

    # continuing from previous example...

    @twilio_websocket.on :message do |event|
      # puts "[RT-T] Message from Twilio: #{event.data}" # NOTE: THIS IS NOISY — every 20ms of audio is sent every 20ms

      data = JSON.parse(event.data)
      case data["event"]
      when "start"
        @stream_sid = data["start"]["streamSid"]

        puts "[RT-T] Incoming phone-call data stream has started. Stream id: #{@stream_sid}"

        load_base_session_info_openai_websocket!
        load_context_into_openai_websocket!

      when "media" # e.g. "audio"
        begin
          if @openai_websocket.ready_state == Faye::WebSocket::OPEN
            @openai_websocket.send({
              type: "input_audio_buffer.append",
              audio: data["media"]["payload"]
            }.to_json)
          end
        rescue => e
          puts "Error processing Twilio audio: #{e.message}"
        end
      end
    end

    # more to come

There are two primary message types we’re looking out for from the Twilio Websocket: start and media.

In start we store off the stream SID then fire a couple of methods that get the OpenAI Realtime API ready to roll. We wait until we receive start from Twilio to ensure that the OpenAI system won’t be ready too early. We want to wait until the Twilio Websocket is fully running and ready before we start getting AI voices into the call. Otherwise the user could potentially begin a call with an AI agent half-way through a sentence!

We’ll expand more on those helper methods (load_base_session_info_openai_websocket! and load_context_into_openai_websocket!) below.

On the media side of things, we’re simply passing the encoded audio data that Twilio sent directly over to OpenAI. A simple hand-off.

Changing our diagram a little bit to no longer worry about who’s the client and who’s the server, let’s illustrate the data flow paths:

Now, over on the OpenAI Websocket side of things, we need similar handlers for the various events: open, close, error, and message. You’ll note again that the first three aren’t very interesting:

    # continuing from previous example...

    @openai_websocket.on :open do |event|
      puts "[RT-OA] OpenAI Websocket Established (Open)"
    end

    @openai_websocket.on :close do |event|
      puts "[RT-OA] Disconnecting"
    end

    @openai_websocket.on :error do |event|
      puts "[RT-OA] ERROR event: #{event.data} #{event.inspect}"
    end

    # more to come

Luckily the message handler is a bit more interesting:

    @openai_websocket.on :message do |event|
      puts "[RT-OA] Message from OpenAI: #{event.data}"
      data = JSON.parse(event.data)

      case data["type"]
      when "response.audio.delta"
        if data["delta"]
          begin
            puts "[RT-OA] Pushing OpenAI audio chunk to Twilio"
            @twilio_websocket.send({
              event: "media",
              streamSid: @stream_sid,
              media: {
                payload: data["delta"]
              }
            }.to_json)
          rescue => e
            puts "Error processing audio delta: #{e.message}"
          end
        end

      when "input_audio_buffer.speech_started"
        puts "[RT-OA] VAD detected speech from Callee!"
        user_started_talking!
      end
    rescue => e
      puts "Error processing OpenAI message: #{e.message}, Raw message: #{event.data}"
    end

There are two types of messages we care about on the OpenAI side: response.audio.delta and input_audio_buffer.speech_started. The former is a chunk of generated audio from the Realtime model which is speech. So we push it over into the call for the user to hear!

The latter is more interesting. Remember that we essentially just hand off all incoming audio from Twilio’s Websocket to OpenAI? That’s actually a completely raw audio track. There’s no voice detection, no filtering, no breaks. Every 20ms Twilio sends a frame with the last 20ms of audio encoded in it. Even if the audio itself is just white noise. There’s no smarts here.

So detecting that the human caller is talking is actually something OpenAI does on their end. And the moment they detect that the caller has begun speaking, they send us back the input_audio_buffer.speech_started (hence the “input” speech started). We’ll get to that user_started_talking! method momentarily.

Let’s update our illustration here..

There we go… starting to build up a little complexity here.

Now let’s round it off and talk about those helpers!

First, user_started_talking!:

  def user_started_talking!
    # If there was audio already in the Twilio buffer (e.g. audio yet to be played to the user),
    # clear that now
    if @twilio_websocket.ready_state == Faye::WebSocket::OPEN
      @twilio_websocket.send({
        streamSid: @stream_sid,
        event: "clear"
      }.to_json)
    end

    # If OpenAI is still generating audio, cancel that. It may not have been, but this is a no-op, then.
    if @openai_websocket.ready_state == Faye::WebSocket::OPEN
      @openai_websocket.send({
        type: "response.cancel"
      }.to_json)
    end
  end

We know that the moment we receive the input_audio_buffer.speech_started event from OpenAI, the human user on the call is actually trying to speak. So we want to do two things here — first, stop any AI audio from playing on the Twilio side. There may not be, which is fine, but if there is, it should immediately stop. Twilio maintains its own small audio buffer so there could be 1-2 more seconds of AI voice waiting to play when the user interrupts. We want to dump that and leave the call line open for the user to speak.

Second, we want to halt the model from continuing to work on a now-stale response. If a human is interrupting, we’ll want to get a new AI response out to them, according to the content of the interruption, as soon as possible. We don’t want to waste any time on an old response. And the moment the human starts talking, everything else is now suddenly an ‘old response’.

Moving on, let’s talk through the second helper method, load_base_session_info_openai_websocket!:

  def load_base_session_info_openai_websocket!
    @openai_websocket.send({
      type: "session.update",
      session: {
        turn_detection: {
          type: "server_vad",
          prefix_padding_ms: 500, # prevents short sounds from triggering responses; user must make noise for at least X ms (default 500)
          silence_duration_ms: 200 # prevents brief pauses from triggering responses; user must be silent for at least X ms (default 500)
        },
        input_audio_format: "g711_ulaw",
        output_audio_format: "g711_ulaw",
        voice: "verse", # See: https://x.com/OpenAIDevs/status/1851668229938159853?mx=2
        instructions: <<~TXT,
          This is a custom prompt that you'd use from your company and ideas.
          This is definitely not ours, but I can't release that publicly!
        TXT
        tools: [], # you should definitely have tools
        modalities: ["audio", "text"],
        temperature: 0.8
      }
    }.to_json)
  end

And recall that this method only fires once the Twilio Media Stream is fully start‘ed. Just the standard setup / options hash for a new session, per OpenAI’s docs.

Great! We should have a fully working implementation for callers talking directly to OpenAI Realtime at this point. 🎉

There’s just one glaring issue once you actually try to build this into your app and connect the dots….

Who is This?

How does our Realtime model know who it’s talking to? Actually, better yet, our code at this point doesn’t even know who’s on the other end of the line! Remember that we’re operating in a background thread that was spawned off of a request thread hitting media_stream… but that request didn’t have any other data with it. We just knew that some call was trying to connect to our Websocket. We don’t know which and we don’t know who.

So at this point we have a working implementation of a human calling and talking to your stock model (with instructions) but no actual context about who this person is. That’s… not ideal.

Luckily Twilio has a mechanism we can use to get some context passed around. Wherever we initially kick off this talk-to-AI process by running or returning the TwiML command to <Stream> (remember this:)

Twilio::REST::Client.new.calls(@twilio_call_sid).update(
    twiml: Twilio::TwiML::VoiceResponse.new do |command|
    command.say(message: "Connecting to AI...")
    command.connect do |connection|
      connection.stream(url: call_wss_url(protocol: :wss))
    end
  end
end

We have context there! At that point we likely know who the caller is, who the callee is, we can programmatically pull up their account, etc., etc. But because HTTP is stateless, once the separate web request for media_stream comes in, none of that context is available.

So we have to pass it along. Twilio allows us to pass ad-hoc parameters along with the <Stream> command so that our Websocket implementer (the AiCallJob) can know more about the call based on our system previously having routed the call to said job. That’s a mouthful. It’s easier to see:

Twilio::REST::Client.new.calls(@twilio_call_sid).update(
    twiml: Twilio::TwiML::VoiceResponse.new do |command|
    command.say(message: "Connecting to AI...")
    command.connect do |connection|
      connection.stream(url: call_wss_url(protocol: :wss)) do |stream|
        stream.parameter(name: :context_object_type, value: "Customer")
        stream.parameter(name: :context_object_id, value: 55)
      end
    end
  end
end

Note

This particular example is most relevant for an inbound phone call of Customer ID#55. If this was an outbound call I’d like pass a reference to some kind of Task or Todo object instead, which my job can then mark complete once the phone call has ended.

But I’ll recommend augmenting this slightly to encode and sign the values, that way when we decode them on the other side we can verify that nothing has been tampered with. Similar to how we double-check the validity of cookies! So, more like this:

verifier = ActiveSupport::MessageVerifier.new(Rails.application.secret_key_base)
signed_type = verifier.generate("Customer")
signed_id = verifier.generate(55)

Twilio::REST::Client.new.calls(@twilio_call_sid).update(
    twiml: Twilio::TwiML::VoiceResponse.new do |command|
    command.say(message: "Connecting to AI...")
    command.connect do |connection|
      connection.stream(url: call_wss_url(protocol: :wss)) do |stream|
        stream.parameter(name: :context_object_type, value: signed_type)
        stream.parameter(name: :context_object_id, value: signed_id)
      end
    end
  end
end

Then, of course, over on the other side (the job where we’re actually executing the phone call and coordinating audio), we can decode those parameters to get our context. I’m calling this the “context object”. Very clever, I know.

One little note here is that Twilio only passes back the ad-hoc custom parameters we’re using after the Websocket connection has been established and the start event comes in. So we have to do this referencing in the start event, not just inline in the job:

    @twilio_websocket.on :message do |event|

      # ... previous code here

      when "start"
        @stream_sid = data["start"]["streamSid"]

        puts "[RT-T] Incoming phone-call data stream has started. Stream id: #{@stream_sid}"

        verifier = ActiveSupport::MessageVerifier.new(Rails.application.secret_key_base)
        begin
          context_object_type = verifier.verify(data.dig("start", "customParameters", "context_object_type"))&.safe_constantize
          context_object_id = verifier.verify(data.dig("start", "customParameters", "context_object_id"))
        rescue ActiveSupport::MessageVerifier::InvalidSignature
          context_object_type = nil
          context_object_id = nil
        end

        if context_object_type.present? && context_object_id.present?
          @context_object = context_object_type.find_by(id: context_object_id)
        end

        load_base_session_info_openai_websocket!
        load_context_into_openai_websocket!

      # ... previous code continues

With this setup we now have a @context_object available throughout the Websocket handlers in this job that can be our root back to doing useful application-level things! For instance, if you exposed the proper tool to the OpenAI model for, say, “Call this Function when the user wants to reset their password”, you could now implement that function as @context_object.reset_password! (Assuming the context object is a User and supports that method).

Thus, I can finally show the last helper method that I left undiscussed — load_context_into_openai_websocket! — as it depends on having a @context_object set.

  def load_context_into_openai_websocket!
    puts "[RT-OA] Seeding context..."
    @context_object.previous_conversation_items.each do |context_item|
      @openai_websocket.send({
        type: "conversation.item.create",
        item: {
          type: :message,
          role: :user,
          input_text: context_item.body
        }
      })
    end

    puts "[RT-OA] Kicking Off Conversation"
    @openai_websocket.send({
      type: "response.create",
      response: {
        instructions: "You are now on the phone with #{@context_object.first_name}. Greet them by name to start the conversation",
        modalities: ["audio", "text"]
      }
    }.to_json)
  end

Some of this is a little domain specific to our particular team and needs, but the idea is that we first load in all previous context from conversations with the customer in the past. We essentially want to give the model the best odds of knowing what the customer might be asking about. We’re simply using conversation.item.create for that.

Then, and last, we force a response.create — compelling the model to start giving back audio. Since this proof of concept integration was primarily aimed at handling incoming calls, this is what triggers the model to say “Hey there” since the call was connected. Otherwise the customer would hear the phone ringing, connect, then… nothing.

This is all great. We now have a two-way connection, everything working, and passable context between the originator of the call intent and the actual real-time call handler.

Transcripts

It’s one thing to connect customers with an AI agent for handling their concerns. It’s another to do so blindly and have no idea what happened on the call. You could look into actual call audio recording (I think Twilio can provide this somewhat out-of-the-box?), but it’s trivially easy to add two-way transcript support at this point in our build. Actually the Realtime model already does it one way! Any audio response the model generates will also natively generate the same response in text. We just need to listen for another type of inbound message and snag the copy out of it.

To generate transcriptions for the customer’s audio, there’s a simple, single flag we can activate in our session configuration. Let’s start there. We’ll simply tweak our load_base_session_info_openai_websocket! to include a new session configuration key, input_audio_transcription:

  def load_base_session_info_openai_websocket!
    @openai_websocket.send({
      type: "session.update",
      session: {
        turn_detection: {
          type: "server_vad",
          prefix_padding_ms: 500, # prevents short sounds from triggering responses; user must make noise for at least X ms (default 500)
          silence_duration_ms: 200 # prevents brief pauses from triggering responses; user must be silent for at least X ms (default 500)
        },
        input_audio_format: "g711_ulaw",
        output_audio_format: "g711_ulaw",
        input_audio_transcription: {model: "whisper-1"},
        voice: "verse", # See: https://x.com/OpenAIDevs/status/1851668229938159853?mx=2
        instructions: <<~TXT,
          This is a custom prompt that you'd use from your company and ideas.
          This is definitely not ours, but I can't release that publicly!
        TXT
        tools: [], # you should definitely have tools
        modalities: ["audio", "text"],
        temperature: 0.8
      }
    }.to_json)
  end

That alone will configure the Realtime model to provide transcripts for audio from the customer! We just need to listen for the appropriate events and grab the text. The easiest way to do this is to setup an instance variable for storing up all the text:

class AiCallJob
  def perform(env)
    @twilio_websocket = Faye::WebSocket.new(env)
    @openai_websocket = Faye::WebSocket::Client.new("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview", nil, headers: {
      "Authorization" => "Bearer #{Rails.application.credentials.dig(:open_ai, :api_key)}",
      "OpenAI-Beta" => "realtime=v1"
    })
    @stream_sid = nil
    @context_object = nil
    @transcripts = []

    # all the websocket callbacks

    @twilio_websocket.rack_response
  end
end

Then listen for the two appropriate events coming back from the Realtime model:

    @openai_websocket.on :message do |event|

      # ... previous code here

      when "conversation.item.input_audio_transcription.completed"
        @transcripts << {
          person: :human,
          text: data["transcript"]
        }
      when "response.audio_transcript.done"
        @transcripts << {
          person: :ai,
          text: data["transcript"]
        }

And, for simplicity in development, I also added a dump of the transcript to the close event so that I can see the textual conversation in my terminal once the phone call ends:

    @twilio_websocket.on :close do |event|
      puts "[RT-T] Disconnecting. Killing OpenAI Websocket Too..."
      @openai_websocket.close if @openai_websocket.ready_state == Faye::WebSocket::OPEN

      puts "Final Transcript:"
      puts @transcripts
    end

And that’s it! We’ve done it.

Some Thoughts on Production

Now that we’ve got the architecture and implementation out of the way, it’s worth talking through some considerations about how this sort of project will go live in a production environment.

The most primary factor at play here is simply Websockets in production. I’d wager that most Rails applications in the wild aren’t using much, if any, Websockets currently. Those that are probably aren’t doing quite as much lifting as this implementation will take. The combination of running (potentially) lots of background threads in a web process where each thread is going to require a non-trivial amount of CPU (though I can’t say how much yet) is… dangerous. I’ve always recommended autoscaling your production applications and, more importantly, scaling based on queue time, but moderate to heavy-duty Websockets in a Rails application feels like new territory. All that to say, be careful if and when you roll this sort of thing out. Go slowly, take your time, and watch your queue time and traffic levels carefully. You should anticipate some CPU saturation; exactly how much will depend on several factors. With CPU saturation comes increased request queue time so… just keep an eye on things.

All that to say, autoscaling your web process when running this call framework may be tricky. I simply don’t yet know what to expect there.

An alternative approach you could take is setting up a second production app on a subdomain. A fully second copy of the primary application, same database and all, which exists solely to handle the phone call Websocket workflows. That might work for you! Or it might not. It’s hard to say. Your mileage will vary.

Hope this is helpful!

- J

Final Job Code in Full

class AiCallJob
  def perform(env)
    @twilio_websocket = Faye::WebSocket.new(env)
    @openai_websocket = Faye::WebSocket::Client.new("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview", nil, headers: {
      "Authorization" => "Bearer #{Rails.application.credentials.dig(:open_ai, :api_key)}",
      "OpenAI-Beta" => "realtime=v1"
    })
    @stream_sid = nil
    @context_object = nil
    @transcripts = []

    # ------------- Twilio Websocket -------------

    @twilio_websocket.on :open do |event|
      puts "[RT-T] Twilio Websocket Established (Open)"
    end

    @twilio_websocket.on :error do |event|
      puts "[RT-T] ERROR event: #{event.data} #{event.inspect}"
    end

    @twilio_websocket.on :close do |event|
      puts "[RT-T] Disconnecting. Killing OpenAI Websocket Too..."
      @openai_websocket.close if @openai_websocket.ready_state == Faye::WebSocket::OPEN

      puts "Final Transcript:"
      puts @transcripts
    end

    @twilio_websocket.on :message do |event|
      # puts "[RT-T] Message from Twilio: #{event.data}" # NOTE: THIS IS NOISY — every 20ms of audio is sent every 20ms

      data = JSON.parse(event.data)
      case data["event"]
      when "start"
        @stream_sid = data["start"]["streamSid"]

        puts "[RT-T] Incoming phone-call data stream has started. Stream id: #{@stream_sid}"

        verifier = ActiveSupport::MessageVerifier.new(Rails.application.secret_key_base)
        begin
          context_object_type = verifier.verify(data.dig("start", "customParameters", "context_object_type"))&.safe_constantize
          context_object_id = verifier.verify(data.dig("start", "customParameters", "context_object_id"))
        rescue ActiveSupport::MessageVerifier::InvalidSignature
          context_object_type = nil
          context_object_id = nil
        end

        if context_object_type.present? && context_object_id.present?
          @context_object = context_object_type.find_by(id: context_object_id)
        end

        load_base_session_info_openai_websocket!
        load_context_into_openai_websocket!

      when "media" # e.g. "audio"
        begin
          if @openai_websocket.ready_state == Faye::WebSocket::OPEN
            @openai_websocket.send({
              type: "input_audio_buffer.append",
              audio: data["media"]["payload"]
            }.to_json)
          end
        rescue => e
          puts "Error processing Twilio audio: #{e.message}"
        end
      end
    end

    # ------------- OpenAI Websocket -------------

    @openai_websocket.on :open do |event|
      puts "[RT-OA] OpenAI Websocket Established (Open)"
    end

    @openai_websocket.on :close do |event|
      puts "[RT-OA] Disconnecting"
    end

    @openai_websocket.on :error do |event|
      puts "[RT-OA] ERROR event: #{event.data} #{event.inspect}"
    end

    @openai_websocket.on :message do |event|
      puts "[RT-OA] Message from OpenAI: #{event.data}"
      data = JSON.parse(event.data)

      case data["type"]
      when "response.audio.delta"
        if data["delta"]
          begin
            puts "[RT-OA] Pushing OpenAI audio chunk to Twilio"
            @twilio_websocket.send({
              event: "media",
              streamSid: @stream_sid,
              media: {
                payload: data["delta"]
              }
            }.to_json)
          rescue => e
            puts "Error processing audio delta: #{e.message}"
          end
        end

      when "input_audio_buffer.speech_started"
        puts "[RT-OA] VAD detected speech from Callee!"
        user_started_talking!
      end

      when "conversation.item.input_audio_transcription.completed"
        @transcripts << {
          person: :human,
          text: data["transcript"]
        }

      when "response.audio_transcript.done"
        @transcripts << {
          person: :ai,
          text: data["transcript"]
        }

    rescue => e
      puts "Error processing OpenAI message: #{e.message}, Raw message: #{event.data}"
    end

    @twilio_websocket.rack_response
  end

  private

  def user_started_talking!
    # If there was audio already in the Twilio buffer (e.g. audio yet to be played to the user),
    # clear that now
    if @twilio_websocket.ready_state == Faye::WebSocket::OPEN
      @twilio_websocket.send({
        streamSid: @stream_sid,
        event: "clear"
      }.to_json)
    end

    # If OpenAI is still generating audio, cancel that. It may not have been, but this is a no-op, then.
    if @openai_websocket.ready_state == Faye::WebSocket::OPEN
      @openai_websocket.send({
        type: "response.cancel"
      }.to_json)
    end
  end

  def load_base_session_info_openai_websocket!
    @openai_websocket.send({
      type: "session.update",
      session: {
        turn_detection: {
          type: "server_vad",
          prefix_padding_ms: 500, # prevents short sounds from triggering responses; user must make noise for at least X ms (default 500)
          silence_duration_ms: 200 # prevents brief pauses from triggering responses; user must be silent for at least X ms (default 500)
        },
        input_audio_format: "g711_ulaw",
        output_audio_format: "g711_ulaw",
        voice: "verse", # See: https://x.com/OpenAIDevs/status/1851668229938159853?mx=2
        input_audio_transcription: {model: "whisper-1"},
        instructions: <<~TXT,
          This is a custom prompt that you'd use from your company and ideas.
          This is definitely not ours, but I can't release that publicly!
        TXT
        tools: [], # you should definitely have tools
        modalities: ["audio", "text"],
        temperature: 0.8
      }
    }.to_json)
  end

  def load_context_into_openai_websocket!
    puts "[RT-OA] Seeding context..."
    @context_object.previous_conversation_items.each do |context_item|
      @openai_websocket.send({
        type: "conversation.item.create",
        item: {
          type: :message,
          role: :user,
          input_text: context_item.body
        }
      })
    end

    puts "[RT-OA] Kicking Off Conversation"
    @openai_websocket.send({
      type: "response.create",
      response: {
        instructions: "You are now on the phone with #{@context_object.first_name}. Greet them by name to start the conversation",
        modalities: ["audio", "text"]
      }
    }.to_json)
  end
end

Hey! 👋 Jon here. Are you stuck on something and found this article in hopes of an answer?

If you'd prefer, we can just pair on it! I do a ton of pair programming and would love to help you too.

What’s Really Happening

Architectural Impact

Let’s See Some Ruby

The Job

Who is This?

Transcripts

Some Thoughts on Production

Final Job Code in Full

Comments? Thoughts?