Choose Your Data Persistence

Rails Wizards Pt. 2

Jon Sully

Wed Apr 21 '21

12 Minutes

Choose Your Data Persistence

Part of the ‘Rails Wizards’ Series:
- • Apr 2021: Rails Wizards Pt. 1
- • Apr 2021: Rails Wizards Pt. 2 (This page)
- • Apr 2021: Rails Wizards Pt. 3
- • Apr 2021: Rails Wizards Pt. 4
- • Apr 2021: Rails Wizards Pt. 5
- • Apr 2021: Rails Wizards Pt. 6
- • Apr 2021: Rails Wizards Pt. 7
- • Apr 2021: Rails Wizards Pt. 8
- • Apr 2021: Rails Wizards Pt. 9

Choose Your Data Persistence

So the whole idea of a server-side-wizard is rooted around the premise that the server will persist the data chunks along as the user works through the multiple steps involved in creating the full object. That means the data has to be persisted on the server somehow. I think there are a few major camps here too that I want to cover: session persistence, database persistence, and cache persistence. What’s the high level for each of those and why might you pick one or the other?

Session Persistence

At a high level, session persistence is the idea that as a user submits portions of data from each step of the wizard (ignoring how it gets validated), that data is continually added to a hash object within that user’s session store. With each step completed, the hash grows to contain all of the values the user submitted. When the user finishes the wizard, the hash should contain all of the attributes and values required to pass to ruby>Model.create so that a fully-hydrated ActiveRecord object can be created and persisted in the database. Illustratively, that would look something like

# in controller action after user submits last form step
puts step_params
#> { miles: 212000 }
puts session[:car_attributes]
#> { make: 'ford', model: 'explorer', year: 2012 }
Car.create! session[:car_attributes].merge step_params

The big issue with storing attributes and values directly in the session object is scalability and replication of session across multiple Rails servers (once your app requires more than one) and the likelihood of overflowing the session size. Even if you’re using a non-default session-store like Redis, session-persisted-wizards should generally just be avoided. Ideally we want to keep our sessions as clean and small as possible, regardless of the provider (direct-in-cookie or third party via Redis/Memcached etc.) so that they can accurately contain metadata about the current user’s session state with our application server. E.g. separate the concerns of the user’s wizard data from their current session data.

Database Persistence

When it comes to database persistence there are a few ways to make the cookie crumble. In a very simplistic sense, the idea is that the user’s submitted data will be stored in the database with each step / chunk of data they submit. In practice, that can mean a number of logistical choices: perhaps the ‘partial’ data is staged in a JSON column on the table where the object is built up before saving the record; perhaps all of the columns are just NULLable and we fill each chunk in as it arrives; perhaps we have a second table altogether that stores the object ‘candidate’ as it’s built up and once it’s completely built, we create a ‘real’ object from the ‘candidate’ — lots of ways to make the cookie crumble indeed. The one commonality between these methods that’s worth noting and calling out for clarity is that a new record is created each time the wizard is used. In the JSON example, a mostly-empty record needs to be created so that the JSON column first exists to be filled; in the NULLable columns example, a record needs to be created with all NULLs to get filled in per-step; in the candidate-object example, the candidate record does need to get created to house all the candidate data. Generally this isn’t an issue of size or database bloat (we live in a world where Postgres routinely plays with billions of records in milliseconds), but moreso one of understanding ahead of time.

Now, it’s worth noting here that we want to avoid conflation between a record being created in the database and that impacting the URL the user sees. These are two distinct concepts and this contrast is explored in Part 3, but we certainly don’t have to change the URL even if we create a new record. This gets into RESTfulness so I’ll save it for Part 3 and we can move on.

Cache Persistence

Lastly, there’s the cache persistence route. The idea here is similar to the session-persistence route in that data is step-by-step compiled into a hash (outside of a database record) as the user submits each form step, the key difference being that the data is actually stored in the ruby>Rails.cache rather than in the individual user’s session store. This setup does have a couple of gotchas, notably that you need to be using a third-party centralized cache (like Redis) across all of your Rails servers and that you need to make sure your cache eviction policy won’t suddenly drop a user’s wizard data while they’re half-way through the form, but this route does gain the benefits of the session route without the scaling drawbacks or conflation of concerns.

Which to Choose

Both database persistence and cache persistence can be great choices. It really comes down to how you intend to use your wizard, what your wizard represents, and who’s going to use it. Two vectors that can be helpful in sniffing out which persistence method to use are your domain-model and your authentication expectations.

Examine Your Domain-Model

Take a moment to really think about your your business’ logical domain. What data is your application actually using to do what your application is meant to do? Yes that’s a vague question, but put into better context here, is your application going to need to know which particular step of the wizard the user is currently on or abandoned on? Is mid-wizard progress status going to be a factor of your application’s domain-model and a data-point that you’ll need to interact with in other models? Or does your application only need to know whether the wizard was completed or not?

Let me tie this back to Agent Pronto for some added context. Our initial site wizard is critically important to our business and we care deeply about measuring the stats behind how folks move through the workflow and/or where they abandon the wizard. But whether or not folks leave the wizard early isn’t actually part of our domain model. That might sound counterintuitive, but think about it in the context of other models — none of our other models actually care about the in-progress status of a particular wizard submission. They only care once it’s actually complete and becomes a ‘full’ record. Our analytics and products teams certainly want to know about the nature of the wizard’s performance, but that’s derived from (front-end) analytics data. That’s distinctly different than being a member of our domain model.

We can contrast that with a simple, hypothetical medical services portal. Let’s assume our company, ‘MadeUp-Med’, allowed new users to register and see the portal, but required that they complete and e-sign a long multi-step disclosure form before actually showing any personal medical data to the user. At some point they realized that the wizard is quite long and determined that legally, they could show some of the content to the user as the user filled out each step in the wizard. This allowed the users to progress in the wizard over multiple sessions but still get most of the information they needed while only a few steps into the wizard. Contrived? Maybe. Regardless, that’s a good case of the domain model being very aware of the wizard’s in-progress steps so that it can know whether to render some data for the user or not. Other models changing behavior based on the wizard’s mid-progress status means that the wizard’s step-by-step status is part of your domain model. At AP, our models only care if a wizard was completed (and thus the ‘full’ record was created), so the wizard’s mid-progress status is not part of our domain model.

All of that to say, if the wizard’s mid-progress status is not part of your domain model, you may be more drawn to using cache persistence since you won’t need to derive other functionality within your domain-model from the wizard’s in-progress status. Cache persistence tends to lend itself better to more transient usage (more on that below) and the mid-progress status of your form not being relevant to your actual data-model suggests that you’re probably okay with this. Great!

On the other hand, if the wizard’s mid-progress status is part of your domain model, you’ll probably want to take a database persistence approach — both because the step-by-step status will be more easily accessible from your other models (which are database backed) and because you generally want your domain model encoded in a unified way, not stitched together between database tables and cache keys. In a way this kind of comes back to “MVC” since the “M” is really (data-)model and trying to use a split persistence between one part of your data model and another means breaking typical MVC paradigms.

Closing off this section, I like to think of it in this perspective. If a form’s mid-progress status isn’t a part of my domain model and I’m using cache persistence to get the user through the form, it’s akin to loaning them some memory space for a while. My cache is essentially loaning the user a few kilobytes of memory that could otherwise just have been in the user’s own browser (the javascript-only Wizard equivalent) to facilitate completing the form, and nothing enters my domain model until the form is fully completed. This is actually just the same for a pure javascript Wizard — batching all the values up client-side and the server not getting any of them until the final POST means that the mid-progress status of the pure javascript wizard couldn’t be part of your domain model because the data simply wouldn’t be accessible server-side. In this route, according to our domain model, the ‘new record’ is either complete or it doesn’t exist.

On the flip side of the coin, if a wizard’s mid-progress status is paramount to our domain model and we’re using a database persistence approach, we’re not so much lending cache space to our users. We’re allowing them to interact with our domain model without needing to complete the entire ‘new record’ first. This can be a nice feature if your wizard is quite long, likely requires time away from the computer to gather data to complete the wizard, or you simply have other application-level features that may want to inject themselves in at particular steps of the wizard. Looking at you, TurboTax!

Okay, enough about domain models.

Examine Your Users

Another important factor in determining our preferred persistence approach comes down to who’s using our wizard. Let’s consider our wizard users as either being authenticated or unauthenticated (public).

For wizards serving unauthenticated users (e.g. public forms), we can reasonably expect some transient usages. Any time you put a multi-step form on the public internet, there will be a percentage of users that abandon the form mid-way through. While this may be a general internet-truth, it’s an important factor for our persistence plan. For starters, we can presume that we won’t need to keep the partially-filled-out-wizards data around forever. This is in part just from knowing that a user that visits our public wizard and abandons half-way through probably won’t come back a year later and expect all the previous data to be filled out (creepy), but also from the fact that anonymous sessions are lost any time a user quits their browser and fires it back up. In addition, a form without any authentication / user object to bind the submission to likely implies that there’s no need to keep track of mid-progress status. That’s not universally true, but it’s likely that if you’ve got a totally public wizard available without any prerequisite domain-model attachment, I’d wager the mid-progress status of your wizard likely also doesn’t need to be part of your domain model. All of that to say, if your wizard primarily serves unauthenticated users you probably will be better suited for a cache persistence strategy. You won’t need to worry about side effects from anonymous wizard users piling up database records that never get completed, you won’t have to worry about cleaning up old unfinished wizard data (as long as you’re using a good cache policy), and you’ll be able to keep your domain model simple by only worrying about completed wizard submissions.

When it comes to authenticated users, we find ourselves in a very different ballpark. On the topic of transient usage, the numbers shift. A user having already signed up and created an account in your application likely won’t abandon your wizard, and your wizard probably serves a more specific purpose that impacts the user in a more direct way. That’s not to say there won’t be abandoned partial-wizard data, but ideally it won’t be the majority — the point here being that cleanup of partial data should be minimal. Beyond that, wizards operating under already-authenticated users can have their data associated to that particular user at the beginning of the wizard. This is a nice way to automatically tie your wizard data into your domain model, but more specifically this foreign-key relationship helps you keep your partially-completed wizard data around in perpetuity. Where in a public form it may be odd to see partially-completed data on a site when revisiting three weeks later, in an authenticated site it may be a welcome experience: “ah, my account remembers what I was working on”. The foreign-keyable nature of authenticated users’ wizards lends itself well to the partial data sticking around long-term. For these reasons, it may often be preferable to use a database persistence strategy if your wizard is going to be for authenticated users.

Pick A Strategy

Okay, so between your domain model examination and user-strategy examination, we should be able to choose a persistence strategy for our wizard. Remember, whichever route we choose is ultimately a design choice, each with their own pros and cons. There is not a universally-correct answer here. It really depends on your project and your wizard needs.

For my Agent Pronto wizard I’m going to choose the cache persistence strategy. Our wizard is publicly available, it’s not tied to any authentication, and our domain model doesn’t require insight into mid-progress wizard status. It makes sense for our use-case. Don’t worry though, I’ll be walking through both in the next few parts of this series.

If your wizard’s use-case doesn’t fit cleanly into one persistence model or the other, I’d offer the following advice. If you felt like the better fit in the domain-model discussion was a cached persistence strategy but then found a database persistence strategy may fit better given the authentication discussion, go with the database persistence model. Having your wizard data cleanly bound to your user model and persisted correctly is more important than simplifying your completed-ness workflow. If on the other hand you felt like the better fit in the domain-model discussion was a database persistence strategy but really aligned with the cached strategy in the authentication discussion, go with the database persistence model. You’ll probably need to make an occasionally-running background job to clean up old abandoned records for your anonymous wizard users that left half-way through, but trying to go the cached route and still wire up mid-progress status functionality will be more difficult across your domain model.

Okay! We’ve got a data persistence strategy! Let’s go on to → Part 3

Hey! 👋 Jon here. Are you stuck on something and found this article in hopes of an answer?

If you'd prefer, we can just pair on it! I do a ton of pair programming and would love to help you too.