On building a time machine
How we implemented time-bound snapshots for one of the biggest tables in Alasco
At Alasco one of our most important promises to our customers is to remain in control of the finances of their construction projects, especially costs. We have a big table called “The Cost Controlling Table” that brings joy to our customers and butterflies to the stomach of any software engineer that has ever touched it. It’s one of our biggest tables. It contains all the cost numbers at a glance, grouped by different levels in a hierarchical manner.
But having a cost controlling that lets you stay in control of the numbers “as of right now” is only part of the solution. Our customers wanted to be able to compare the numbers between any two given dates, and in order to do that, we had to build a time machine.
On building a time machine
Time travel is a fascinating topic. It has fueled countless books, movies and personal nightmares. Being able to experience or even change the past is as exciting as it sounds. The future as well, but we humans tend to be more fixated on the past. However, the excitement quickly loses its spark as soon as time travel paradoxes arrive: Can you change events in the past in a way that would have prevented your time travel in the first place?
For good or bad, humans haven’t developed a true time machine yet, so these paradoxes are only theoretical problems to solve. But in software, building a time machine is not that hard, and neither is encountering a handful of paradoxes there…
Let’s take any data in any storage. Implementing time travel for that data can be as simple as adding a time dimension to it. We do this pretty often every time we put a version on something, be it a package, an API endpoint, or even the data itself. But versions are very discrete as a time dimension. We don’t expect these to change so often, and we typically care only about the most recent one. So they are not true time travel.
If we really want to travel in time, the time dimension has to match as closely as possible the time variable. The higher the precision of the desired time travel, the more exact the time we need to use as dimension.
AWS for instance has a point-in-time recovery backup service that lets you go back to a point in time with precision up to the second. However, this highly precise time travel feature has a hard limit of 35 days in the past. It’s not hard to imagine why. Keeping the metadata needed to recover up-to-the-second snapshots of the data quickly becomes a problem of storage space. Yes, even for Amazon.
The three main strategies for “snapshotting” the past are:
- Full: keeping a copy of the data as it was at any given moment in time,
- Incremental: keeping an initial copy of the data plus a copy of the actions that modified the data over time.
- Differential: a combination of one and two. Full snapshots with incremental snapshots at intervals of time.
Number one provides faster reads at the expense of higher storage usage. Number two optimises for space at the cost of a longer time to reach the point-in-time destination. This, because the actions need to be replayed from the full copy up to the desired moment. The third one is a bit of a compromise between the two others. All very valid for varying needs in terms of backup services, not so much for our cost controlling table and our mission to keep finances under control at a glance.
Speed is crucial for us. Development pragmatism as well. So we decided to take option number one because it optimises for reading, plus it didn’t come with the overhead of building (or outsourcing) a reliable replay mechanism. This was never an easy choice, though: picking the option of storing all the data at any given point in time came with the storage problem attached. But this is time travel we’re doing here, so we did our best to optimise storage while still leaving a few potential headaches for “future us” to solve.
Slowly changing dimensions
The kind of changes that happen in our cost controlling table matches that of a slowly changing dimension. This table is a byproduct of other actions that occur in the system like uploading an invoice or creating a new version of the budget, and while these changes happen very often during working hours, they don’t happen as frequently after business hours or over the weekends. Holiday season also moves the needle into the calmer side.
So we went with a “slowly changing dimension” of type 2, attaching a valid_from and valid_to columns to each of our rows.
Here’s an example in Django:
class CostControllingSnapshot(models.Model):
valid_from = models.DateField()
valid_to = models.DateField(blank=True, null=True)
# sixty-something more columns here with numbers
Having valid_to=null means that the respective row is still in effect.
And this is how to query for rows that are valid on a specific date. The negated Q makes it work when valid_to is either defined or null.
def filter_for_date(target_date):
return CostControllingSnapshot.objects.filter(
Q(valid_from__lte=target_date)
& ~Q(valid_to__lte=target_date)
)
Notice how the example uses DateField instead of DateTimeField. This is a very important design decision that has major consequences. True time travel requires the time component and it is very wise to include it by default. At the end of the day, time travel is not the only problem there is about time. Some say that “time zoning” is an even harder problem to solve. And there is great wisdom in that.
But we are a fast-paced startup and value practicality. So we did what was best for the use case at hand and yet again left a few loose ends for the near future. We are optimistic that those DateFields will become DateTimeFields sooner than later, but the implementation of the other components of this time-aware cost controlling table were so complex, and required such a smooth transition, that we agreed on cutting some corners in order to focus on what delivered the most value in the shortest amount of time.
Allow me to explain: our UI for this time machine feature doesn’t have a time picker, only a date picker. This in itself is dangerously misleading, but reveals an interesting fact about our product reality. Our customers really don’t imagine themselves time-travelling to yesterday before lunch. Yesterday is enough for them, for now. They, of course, have the privilege of being part of single-time-zone companies, as much as we have the privilege of having single-time-zone customers. (A privilege we do hope to forfeit at some point, though. 😉)
You probably see where I am going: when a project manager in Kyoto needs to agree with their counterpart in Frankfurt on how the budget execution ended at the end of last month, they first need to agree on the timezone that marks the end of the month for both. Therefore, what our pragmatic DateField is currently telling is: whatever the timezone of the company is, use it to infer the missing time and timezone components. When our product demands greater complexity, both the code and the database schema can follow suit.
On the positive side, having a simple DateField has the advantage that we don’t need to track more than one set of changes per day. As long as the day hasn’t changed, data mutations overwrite themselves, because we are only interested in the latest state of each day. This represents a considerable amount of “saved rows”.
But when does the day change, you may ask? Well, with companies currently being single-timezoned, the day ends when the day in their timezone ends. Regardless of whether users upload invoices from New York, if they do that at 2AM Central European Time, it will be tracked on the upcoming day.
Careful with time travel on top of time travel
Like we said, our cost controlling table is not directly edited by users, but recalculated as a side effect of other actions happening in the system. And because these recalculations can be time-expensive, we cannot afford to do them synchronously.
Every time a triggering action happens, an asynchronous task is scheduled. A good old task broker with at least once delivery is more than enough for us. We only care that the numbers are recomputed, but if they are recomputed twice, that’s also fine. While most of the brokers guarantee “at least once” delivery, the question of “at least when” is not so easy to answer. Depending on the configuration of the consumers, scheduled tasks can have a high window of retry (or visibility) and still be around pending execution many hours or even days after being scheduled. If this notion is omitted, interesting things will happen.
Like that time when we thought it was a good idea to pass the “triggering date” as an argument to the asynchronous task. Our train of thought was: if the triggering event happened on day X, it will surely be nice to use that day during the task execution itself, you know, just in case there is a change of day in the middle. A noble intention that almost ended in disaster.
Due to a misconfiguration we had, one of those tasks for day X lingered around for 3 days and executed on day X+3 using the hardcoded date of day X. Historical data corruption followed, accompanied by a series of facepalm emojis. But we learned our lesson.
A jump into the future
The time machine for cost controlling has been running since the spring of this year, although we officially launched it in September. We, of course, had to be all on the same page that in the world of software, time machines need to be first “started” and let run for a while before it’s actually possible to travel back in time.
We also thought it would be cool to give our customers some travel options at the time of launch. We would have loved to give them a full quarter at least, but we had to correct the logic a few times over the summer, so we decided to move the time boundary ahead to a point where we were confident about the correctness of the data.
These logic corrections also made us realise the types of time travel paradoxes we would be solving here. Think of our little AHA moment a few paragraphs ago, and let’s generalise a bit: if on day X you encounter a bug in the logic for computing certain numbers in the table, would you go back to the day you introduced the bug and correct the data from that moment on, or would you rather keep it as is? Nontrivial dilemma: either you change the past in order to make it correct, or you keep it wrong in order to keep it consistent.
Here’s another one: imagine you actually wanted to go back to the day you introduced the bug and correct from there. Would you use the up-to-date code of day X (that can potentially even include new features that were not available back then), or would you take the trouble to actually visit every day and use the code of that day (minus the bug) in order to have a high fidelity replay?
Sooner or later a decision must be made on those.
But let’s better look into the future.
What problems can we foresee at this moment in time?
Well, the database keeps growing. We currently have a little over 3 million rows of time machine data, and we add roughly a million every three months, 4 million rows per year. If the trend continues, the number of rows itself shouldn’t become a problem for a while. Disk space is similarly manageable: we’re currently adding roughly half a gigabyte per million of rows, which amounts to roughly 2GB per year. Query time is the one we need to continue monitoring together with our querying patterns, so that it doesn’t degrade over time.
But thinking of hypothetical scenarios, if things were to get slower than acceptable and we couldn’t solve with query optimization alone, we could also opt out of daily-fidelity for data as it gets old. Or we could also retain daily-fidelity, but move very old data to a different storage with different response-time expectations. All this, of course, based on what the needs of customers are and how our product itself evolves.
For now we are happy that the time machine is running and being used. We are also preparing ourselves for potential bug reports as interesting as “Yesterday I exported the table of last month and it’s not the same as the table of last month I exported today”. Hopefully they won’t be a lot, or any? We’ll see.