Browser-Based Calling with Twilio and Hotwire: A Novel Approach

14 min read | January 22 2024

Our team at Agent Pronto makes lots of phone calls directly from our platform. Whenever someone fills out our form indicating that they’re looking for a real estate agent, one of our Customer Experience agents calls them (on average) within two minutes. These calls, while appearing to the end-user as a typical phone call, are all run through our users’ browsers.

If you’ve spent any time with integrating phone calls into an application using Twilio’s SDK’s, particularly the Javascript SDK for browser-based calling, you’ll know that Twilio essentially spins up and runs a large client-side framework within the browser on behalf of the user. Twilio exposes several methods and events to the rest of Javascript-land for you to interface with and control, but ultimately Twilio’s Javascript establishes its own websocket connection to the Twilio servers and manages its own state very well. It’s a great library. But it’s a rich, client-side-only, application.

One thing we do persistently at Agent Pronto is try to follow the Rails Way™️. This, in extension, has lead us to also follow the Hotwire Way™️™️. We’re a micro-sized development team and have experienced (over and over) the benefits of following convention: we’ve been able to keep our development team tiny for over ten years now.

And thus we find our tension. How can we follow the Hotwire Way, which dictates back-end driven logic and only HTML sent over the wire, but also integrate Twilio Calling, which is a heavy-duty front-end library, into the same Rails application? Compromise on the Hotwire Way and have API endpoints and some special Javascript bundle that coordinates Twilio’s Call states with our back end? Negative.

For starters, the major headache we wanted to avoid is managing state in two places. In order to display them in our CRM over time, we log Calls in our back-end; they’re part of our domain model. So we create them, update columns on them, correlate them to the Twilio Call SID, etc. and ultimately associate them to various records in our CRM and render them in our timelines. If we went down the road of rich-javascript wrapping Twilio’s SDK, we’d ultimately be instantiating and maintaining the state of that Call in both our rich-javascript and our back-end, where we’d constantly have to push back (and forth) information to keep both up to date with one another. This is a front-end micro-app. We didn’t want that.

Fun fact: while this build was a ‘new’ build in how it was architected, we actually already had a call system in place. It’d been around since the beginning, in all its jQuery + Handlebars glory. It was the exact architecture I implied above: a lot of heavy-duty JS wrapping Twilio’s own stuff and trying, so very hard, to keep itself and our back-end in sync about the Call being made or received. It caused a lot of headaches over the years.

But, as much as we didn’t want that, there’s really no way to do browser-based calling without leveraging extensive JS APIs — calls are real-time, microphone-and-speaker based, highly interactive things. Even if one wanted to build their own custom JS that interfaces with Twilio instead of using the Twilio JS SDK (please do not do this), we’d still need to bind to all of those things in a flexible way. What’s a Rails app to do?

The Remote-Controlled Approach

Things began to click when we made a mental leap and decided to think of the user’s browser differently. Yes, the Twilio SDK will spin up and be available in Javascript-land. But what if we don’t use it? What if we consider everything the user sees and interacts with as one ‘side’ of their browser session, and Twilio’s SDK environment the other ‘side’? What if we use traditional Hotwire principles so our user can interact with our back-end and essentially let the Twilio SDK be its own, separate space?

The magic began to click.

Instead of having our own Javascript interact directly with the Twilio SDK to start calls and/or control calls in-progress, what if our back-end controls the calls via the Turbo UI elements we display to the user in user-land? We’ve dubbed this the ‘remote-controlled’ approach because instead of us having Javascript that directly interfaces with and controls the Twilio SDK calls, like so:

(Which, again, requires a lot of duplicated logic, sync-code, and Javascript) We instead have a model that looks more like this:

I’ve added more lines to this diagram, but the architecture actually feels simpler than keeping two separate Call states in sync. The key feature here is that the red line is never crossed — that we don’t have our own heavy-duty JS interacting with the Twilio SDK. Instead, we’re using Turbo Frames and Streams to push all the interaction to the back-end where the Call state already lives! Let’s walk this out in practice.

1: An Incoming Call Comes In

This part is simply out-of-the-box Twilio: when someone calls one of the numbers in our account, Twilio sends a POST to a configured endpoint in our Rails app. At this point, all we do is create a Call model/record and determine which of our users is online and can readily accept a call.

1½: Broadcast New Call to the User

As soon as we create a Call record, we use Turbo Stream Broadcasting to push HTML over a websocket to the user we determined to be the viable recipient of the call. That markup looks like this:

incoming

2: Queue ‘Em!

While the user is determining if they want to answer the call and/or finishing up other work, we push the caller to a holding queue with Twilio’s built-in <Queue> system. This essentially just gives the call somewhere to wait while our user is getting ready to pick up their call.

We have lovely hold music though. So peaceful ✨

3: User Clicks “Accept”

If the user clicks one of the big green buttons shown above, it means they’re ready to answer the call. However (and this is where the magic happens), it doesn’t actually do anything to/with the Twilio SDK that’s already running happily in that browser session. All that button does is POST back to our back-end.

4: Connect ‘Em to Browser!

Once our back-end receives word that the user wants to accept the call, it’s our back-end that actually POSTs out to Twilio to pull that call out of the holding queue and dial the user’s browser. We also update our Call record with any necessary information.

4½: Broadcast Mid-Call Controls to User

Even though the new call technically hasn’t been fully connected yet, we know that’s only milliseconds away. So we again use Turbo Streams Broadcasting to replace the markup above (a proposed call) with new markup that has its own form(s) for an in-progress call. That looks like this:

in_progress

5: Twilio Dials the User

This is where the Twilio SDK receives word from its websocket connection directly to Twilio Servers that a call is coming in to the browser

6: Call is Auto-Started

The last piece of the puzzle here is a small bit of JS to configure the Twilio SDK to automatically accept any started calls. Since our user already indicated they were ready for the call, we skip this and instead just automatically connect the call.

Bonus.. After users chat for a while…

7: User Clicks “End”

Similar to our ‘Accept’ button from step 3, this button also doesn’t interface with the Twilio SDK at all. Once again, it simply POSTs back to our back-end indicating that the user wants to end the call.

8: End the Call!

Our server once again makes a POST to Twilio’s platform — this time to kill the call. We update our Call model record with any applicable information and…

8½: Broadcast UI Teardown

As the back-end tells Twilio to kill the call, it also broadcasts a Turbo Stream to tear down the Call UI (it simply disappears). The user just clicked the ‘end’ button, so this is what they’d expect.

9: Twilio Ends the Call

As soon as it receives the POST, Twilio’s platform ends the call between the two parties and uses its own websocket connection to the Twilio SDK in our user’s browser to disconnect and tear down the call object / constructs to reset for the next call.

And that’s it. That’s the essence of the workflow. Did you see the magic?

The magic of this process is that, while the “accept” button the user clicks (and the “end” button) looks and feels like it’s directly interfacing with the Twilio SDK (that is, clicking the button causes the connection to begin), it’s not. But, that all the plumbing to actually start the call (from the back-end) happens fast enough, that it feels like it is. Magic!

Ultimately we end up with one source of truth for the state of any given call (the back end), one set of markup for all call states (the back-end views), far less overall code, no complex state-syncing code, and almost zero Javascript required at all, while still achieving a user experience that feels direct and native. That is magic! Turbo Streams run so quickly that even though the user’s call controls aren’t connected to the Twilio SDK at all, they feel like they are. It’s a seamless experience:

NOTE: You’ll hear some echo in that clip since the test phone is sitting on my desk. This actually, in effect, shows that the two-way connection is working well.

Taking a screen recording of a two-way audio system working is tricky!

Let’s See Some Code!

Sure, let’s do it. But beware, there’s not that much! And that’s the beauty of it.

For starters, the small window / view that handles this calling behavior is an endpoint we call “Real-Time Tools”. Within that view we have two key pieces:

<%= turbo_stream_from [current_user, :calls] %>

<div id="call_container">
</div>

The former sets up the Turbo Streaming websocket for subsequent Broadcasts from the server. We’re simply setting up a listener to listen on a channel identified as the current user’s :calls channel. As long as this matches the broadcast channel targets (we’ll get to that), the broadcasted message will land here!

The latter is an empty div with a specific id that we’ll reference in subsequent broadcasts. This will ultimately contain the markup for the inbound (or outbound) call — where the turbo_stream_from sets up the websocket channel to listen to, the id="call_container" sets up a known element for the Turbo Stream message to look for when adding (or updating) markup to the DOM. The former is knowing which road to go down, the latter is knowing which specific house you’re looking for on that road. You need both!

So whenever a user opens up their Real-Time Tools panel, the websocket listener is fired up and an empty div waits for a future broadcast.

Then, when Twilio POSTs our back-end that a new call is coming in, we have a controller that 1) creates the Call record, and 2) responds to Twilio with “put them in a queue”:

module Twilio
  class CallsController < TwilioController
    def create
      Call.create!(
        external_number: params[:From],
        internal_number: params[:To],
        recipient: User.find_by(etc: :etc),
        status: :proposed,
        # etc..
      )
      
      render xml: Twilio::TwiML::VoiceResponse.new { |command|
        command.say(message: "Please hold while we connect you...")
        command.enqueue(name: :incoming)
      }
    end
  end
end

In the meantime, our Call model has an after_commit that coordinates the broadcasts:

class Call < ApplicationRecord
  after_commit :after_commit
  

  enum direction: {
    incoming: 0,
    outgoing: 1
  }
  
  enum status: {
    proposed: 0,
    in_progress: 1,
    ended: 2,
  }
  
  
  def after_commit
    if incoming?
      if proposed?
        # Incoming call from Twilio created new record with :proposed status
        broadcast_prepend_later_to [recipient, :calls], target: :call_container
      elsif in_progress?
        # User clicked "accept" button which updated record to `in_progress`
        # Connect the actual call and broadcast new markup for call controls
        Twilio::REST::Client.new.calls(twilio_call_sid).update(
          twiml: Twilio::TwiML::VoiceResponse.new do |command|
            command.dial do |dial|
              dial.client { |client| client.identity(recipient.id) }
            end
          end
        )
        
        broadcast_replace_later_to [broadcast_target_user, :calls]
      else
        # User clicked 'cancel' from proposed call or 'end' from active call
        # Kill the call and broadcast teardown markup
        Twilio::REST::Client.new.calls(twilio_call_sid).update(status: :completed)
        
        broadcast_remove_to [broadcast_target_user, :calls]
      end
    else
      # outgoing call logic...
    end
  end
end

And finally, a bit of tame erb view markup — the partial that gets rendered by those broadcasts:

<%= turbo_frame_tag call do %>
  <div>
    <!-- Cancel button — works when :proposed or :in_progress -->
    <% if call.proposed? %>
      <%= form_with model: call do |form| %>
        <%= form.hidden_field :status, value: :canceled %>
        <%= form.button do %>
          <%# "X" button in top corner %>
        <% end %>
      <% end %>
    <% end %>
    <div>
      <!-- Call details -->
      <div>
        <%# Markup for rendering the "Jon Sully" and phone number at the top %>
      </div>
      <!-- Dialpad -->
      <div>
        <div>
          <!-- Row 1 -->
          <%# Markup for the 1, 2, 3 dial-pad buttons %>

          <!-- Row 2 -->
          <%# Markup for the 4, 5, 6 dial-pad buttons %>

          <!-- Row 3 -->
          <%# Markup for the 7, 8, 9 dial-pad buttons %>

          <!-- Row 4 -->
          <%# Markup for the *, 0, # dial-pad buttons %>

          <!-- Row 5 -->
          <% if call.proposed? %>
            <%= form_with model: call do |form| %>
              <%= form.hidden_field :status, value: :starting %>
              <%= form.submit do %>
                <%# "Browser" icon button %>
              <% end %>
            <% end %>

          <% elsif call.in_progress? %>
            <%# Markup for self-mute microphone icon button %>

            <!-- Red End button -->
            <%= form_with model: call do |form| %>
              <%= form.hidden_field :status, value: :ended %>
              <%= form.button do %>
                <%# Red end-call button (downward phone) icon %>
              <% end %>
            <% end %>
          <% end %>
        </div>
      </div>
    </div>
  </div>
<% end %>

So we can see that the markup changes slightly depending on the status of the call being rendered. That’s great since it allows our model to simply call broadcast_replace_later_to... and let the view figure out the UI that ought to represent the current call state.

And that’s essentially it. That code is the backbone for our back-end presenting real-time call controls to our user and giving them the power to control the call without ever interfacing directly with the Twilio SDK instance running in their browser. As such, the last piece of this pie is a very small Stimulus controller that gets the Twilio SDK client registered and enabled auto-accepting calls. That looks like this:

import { Controller } from "@hotwired/stimulus"
import { Device } from "@twilio/voice-sdk"

export default class extends Controller {
  static targets = ["status", "inputSelector", "muteButton"]

  async connect() {
    this.twilioDevice = new Device(twilio_token, { closeProtection: true })
    this.twilioDevice.on("incoming", this.#incomingCall)
    this.twilioDevice.register()
  }

  #incomingCall = call => {
    call.accept()
  }
}

And it’s the only place where there’s any crossover between our app’s front-end code and Twilio’s JS SDK environment. But since we’re not introducing, storing, or handling any state, that’s fine. This could just as easily be a literal <script> tag on the Real-Time Tools markup itself — we just needed some JS to fire up the Twilio SDK.

Magic ✨

That’s it! All of our markup and real-time interactivity is simply erb code generated by our server, delivered over Turbo Streams websockets. The calls themselves (the actual audio connections) are managed by our back-end in response to our user’s wishes and automatically connected to the user’s Twilio SDK instance. The remote-controlled approach:

And I should note again, this only works thanks to Turbo Streams Broadcasting speed. Proxying user interactions off to the back-end (and third party API’s) is a LiveView play, but it works really well when your controllers and systems are running quickly!

Far less code overall
No shared or synced state across different parties (back-end / browser)
Single source of truth for markup / design
Server-driven (HTML-over-the-wire) approach for real-time front-end tools
Simpler maintenance with fewer moving parts
Almost zero JS of our own for a huge JS-driven feature / implementation 🤯

Gimme’ gimme’ gimme!

This setup works really well for us and our particular needs. But that doesn’t mean it’s great for everything. As noted in so many other places, this sort of setup (really a Turbo Streams / back-end setup at all) wouldn’t work for many extremely client-side heavy apps (Figma, Excalidraw, that kind of thing). But this remote-controlled approach does work for a lot of features that would traditionally require extensive Javascript. We hope this writeup serves as an inspiration more than anything.