Time travel in ArcticDB

alex-owens

Alex Owens

Apr 12, 2024

polar bear with suitcase

Previous blog posts have talked about the immutability of data stored in ArcticDB as a means of optimising performance, improving horizontal scalability, and preventing data corruption. Another consequence of this immutability is the option to “time travel” back to earlier versions of a symbol. This is also known as “bitemporality” in the use case where the symbol being stored has a datetime index, as you can retrieve data not only for a specific date-range in the latest version of the symbol, but also from an older version as it looked at a particular point in time.

So why is this useful? Let’s work through a toy example of a real life use case to demonstrate what makes this feature so powerful.

Let’s say you are ingesting data from a data vendor, and each day they provide CSV files containing the open, close, high, low, volume, and trades data for every minute of live trading for a universe of securities for that trading day:

with open("AAPL_2024-01-01.csv", "r") as f:
    daily_df = pandas.read_csv(f, index_col="datetime")
lib.append("AAPL", daily_df)
print(lib.read("AAPL").data)                         open     close       low      high    volume  trades
datetime                                                                     
2024-01-01 08:00:00  0.758089  0.339878  0.781040  0.098037  0.134402     346
2024-01-01 16:29:00  0.221275  0.773076  0.321803  0.586879  0.250758     201

where I have just included the first and last minutes of the day here to keep the size of data involved down, and the values are randomly generated. The general flow each day will look like the above for each security in the universe. However, the data vendor may also sometimes restate data for earlier time periods to correct errors in the original data provided. Let’s assume that we have standard daily data for the first 3 days of 2024:

                         open     close       low      high    volume  trades
datetime                                                                     
2024-01-01 08:00:00  0.758089  0.339878  0.781040  0.098037  0.134402     346
2024-01-01 16:29:00  0.221275  0.773076  0.321803  0.586879  0.250758     201
2024-01-02 08:00:00  0.543514  0.121713  0.908510  0.859530  0.673515     810
2024-01-02 16:29:00  0.363511  0.506118  0.445974  0.031269  0.743444     665
2024-01-03 08:00:00  0.054882  0.319868  0.520630  0.986617  0.812310     847
2024-01-03 16:29:00  0.237748  0.932168  0.990305  0.026718  0.109527     527

Then on the fourth day, the vendor provides both the usual daily data, and restates the data for 2024–01–02. Writing the data to ArcticDB will now include code such as:

with open("AAPL_2024-01-02_restatement.csv", "r") as f:
    restated_df = pandas.read_csv(f, index_col="datetime")
lib.update("AAPL",  restated_df)
print(lib.read("AAPL").data)                         open     close       low      high    volume  trades
datetime                                                                     
2024-01-01 08:00:00  0.758089  0.339878  0.781040  0.098037  0.134402     346
2024-01-01 16:29:00  0.221275  0.773076  0.321803  0.586879  0.250758     201
2024-01-02 08:00:00  0.726245  0.209358  0.382218  0.652552  0.060542     452
2024-01-02 16:29:00  0.912180  0.317210  0.460713  0.581922  0.330218     642
2024-01-03 08:00:00  0.054882  0.319868  0.520630  0.986617  0.812310     847
2024-01-03 16:29:00  0.237748  0.932168  0.990305  0.026718  0.109527     527
2024-01-04 08:00:00  0.813585  0.571894  0.813374  0.598003  0.542102     184
2024-01-04 16:29:00  0.808495  0.375533  0.665888  0.847430  0.885961     245

where the third and fourth rows are now different than they were before. We now have a problem. Let’s say we decided to make a trade on 2024–01–03, based on the information available to us at the time. If we could only read the latest version of the symbol

AAPL
, then the price data from 2024-01-02 would not match the data as it was when we made the decision. This is where time-travel comes to the rescue. The
as_of
parameter to
read
can be used to specify that we want to read an earlier version, not just the latest. In this example:

print(lib.read("AAPL", as_of=2).data)

                         open     close       low      high    volume  trades
datetime                                                                     
2024-01-01 08:00:00  0.758089  0.339878  0.781040  0.098037  0.134402     346
2024-01-01 16:29:00  0.221275  0.773076  0.321803  0.586879  0.250758     201
2024-01-02 08:00:00  0.543514  0.121713  0.908510  0.859530  0.673515     810
2024-01-02 16:29:00  0.363511  0.506118  0.445974  0.031269  0.743444     665
2024-01-03 08:00:00  0.054882  0.319868  0.520630  0.986617  0.812310     847
2024-01-03 16:29:00  0.237748  0.932168  0.990305  0.026718  0.109527     527

will show us the data exactly as it looked immediately after the append on 2024–01–03. This then begs the question, how do we know which version to read?

There are three options. All modifying operations

write/append/update
return a
VersionedItem
object. Amongst other things, this object has a
version
attribute telling you the version number of the symbol that was just created. This information can be stored, either in a separate symbol, or in another database entirely, depending on exact requirements.

Alternatively, a timestamp (in UTC) can be provided to

as_of
, which will retrieve the version that was the latest at the specified point in time. If, for example, the vendor provides the data at 5pm each day, and we always write it into ArcticDB by 6pm, then

lib.read("AAPL", as_of=pandas.Timestamp("2024-01-03T18:00:00"))

will retrieve the same data as requesting version 2.

Finally, we can use snapshots to define a collection of symbol-version pairs exactly as they were at a point in time. The snapshot name can then be provided to the

as_of
argument to retrieve the version of a symbol referenced by that snapshot:

lib.read("AAPL", as_of="snapshot-2024-01-03")

In the trivial example given here, there is no benefit to using snapshots over timestamps. However, in the case where symbols are being modified continuously, and trading decisions are based on multiple symbols rather than just one, snapshots offer a way to refer to a set of symbol-version pairs exactly as they were, without having to worry about details such as clock-skew.

But what if you don’t need this functionality, and only ever care about the latest version? In this case, it would be wasteful of disk-space to keep around old data that will never be read again. In this case, all of the modifying operations have an argument

prune_previous_versions
that will delete older versions of the symbol as the new version is written, excepting any versions referenced by a snapshot.