Alex Owens
Apr 12, 2024
Previous blog posts have talked about the immutability of data stored in ArcticDB as a means of optimising performance, improving horizontal scalability, and preventing data corruption. Another consequence of this immutability is the option to “time travel” back to earlier versions of a symbol. This is also known as “bitemporality” in the use case where the symbol being stored has a datetime index, as you can retrieve data not only for a specific date-range in the latest version of the symbol, but also from an older version as it looked at a particular point in time.
So why is this useful? Let’s work through a toy example of a real life use case to demonstrate what makes this feature so powerful.
Let’s say you are ingesting data from a data vendor, and each day they provide CSV files containing the open, close, high, low, volume, and trades data for every minute of live trading for a universe of securities for that trading day:
with open("AAPL_2024-01-01.csv", "r") as f:
daily_df = pandas.read_csv(f, index_col="datetime")
lib.append("AAPL", daily_df)
print(lib.read("AAPL").data) open close low high volume trades
datetime
2024-01-01 08:00:00 0.758089 0.339878 0.781040 0.098037 0.134402 346
2024-01-01 16:29:00 0.221275 0.773076 0.321803 0.586879 0.250758 201
where I have just included the first and last minutes of the day here to keep the size of data involved down, and the values are randomly generated. The general flow each day will look like the above for each security in the universe. However, the data vendor may also sometimes restate data for earlier time periods to correct errors in the original data provided. Let’s assume that we have standard daily data for the first 3 days of 2024:
open close low high volume trades
datetime
2024-01-01 08:00:00 0.758089 0.339878 0.781040 0.098037 0.134402 346
2024-01-01 16:29:00 0.221275 0.773076 0.321803 0.586879 0.250758 201
2024-01-02 08:00:00 0.543514 0.121713 0.908510 0.859530 0.673515 810
2024-01-02 16:29:00 0.363511 0.506118 0.445974 0.031269 0.743444 665
2024-01-03 08:00:00 0.054882 0.319868 0.520630 0.986617 0.812310 847
2024-01-03 16:29:00 0.237748 0.932168 0.990305 0.026718 0.109527 527
Then on the fourth day, the vendor provides both the usual daily data, and restates the data for 2024–01–02. Writing the data to ArcticDB will now include code such as:
with open("AAPL_2024-01-02_restatement.csv", "r") as f:
restated_df = pandas.read_csv(f, index_col="datetime")
lib.update("AAPL", restated_df)
print(lib.read("AAPL").data) open close low high volume trades
datetime
2024-01-01 08:00:00 0.758089 0.339878 0.781040 0.098037 0.134402 346
2024-01-01 16:29:00 0.221275 0.773076 0.321803 0.586879 0.250758 201
2024-01-02 08:00:00 0.726245 0.209358 0.382218 0.652552 0.060542 452
2024-01-02 16:29:00 0.912180 0.317210 0.460713 0.581922 0.330218 642
2024-01-03 08:00:00 0.054882 0.319868 0.520630 0.986617 0.812310 847
2024-01-03 16:29:00 0.237748 0.932168 0.990305 0.026718 0.109527 527
2024-01-04 08:00:00 0.813585 0.571894 0.813374 0.598003 0.542102 184
2024-01-04 16:29:00 0.808495 0.375533 0.665888 0.847430 0.885961 245
where the third and fourth rows are now different than they were before. We now have a problem. Let’s say we decided to make a trade on 2024–01–03, based on the information available to us at the time. If we could only read the latest version of the symbol
AAPL
, then the price data from 2024-01-02 would not match the data as it was when we made the decision. This is where time-travel comes to the rescue. The as_of
parameter to read
can be used to specify that we want to read an earlier version, not just the latest. In this example:print(lib.read("AAPL", as_of=2).data)
open close low high volume trades
datetime
2024-01-01 08:00:00 0.758089 0.339878 0.781040 0.098037 0.134402 346
2024-01-01 16:29:00 0.221275 0.773076 0.321803 0.586879 0.250758 201
2024-01-02 08:00:00 0.543514 0.121713 0.908510 0.859530 0.673515 810
2024-01-02 16:29:00 0.363511 0.506118 0.445974 0.031269 0.743444 665
2024-01-03 08:00:00 0.054882 0.319868 0.520630 0.986617 0.812310 847
2024-01-03 16:29:00 0.237748 0.932168 0.990305 0.026718 0.109527 527
will show us the data exactly as it looked immediately after the append on 2024–01–03. This then begs the question, how do we know which version to read?
There are three options. All modifying operations
write/append/update
return a VersionedItem
object. Amongst other things, this object has a version
attribute telling you the version number of the symbol that was just created. This information can be stored, either in a separate symbol, or in another database entirely, depending on exact requirements.Alternatively, a timestamp (in UTC) can be provided to
as_of
, which will retrieve the version that was the latest at the specified point in time. If, for example, the vendor provides the data at 5pm each day, and we always write it into ArcticDB by 6pm, thenlib.read("AAPL", as_of=pandas.Timestamp("2024-01-03T18:00:00"))
will retrieve the same data as requesting version 2.
Finally, we can use snapshots to define a collection of symbol-version pairs exactly as they were at a point in time. The snapshot name can then be provided to the
as_of
argument to retrieve the version of a symbol referenced by that snapshot:lib.read("AAPL", as_of="snapshot-2024-01-03")
In the trivial example given here, there is no benefit to using snapshots over timestamps. However, in the case where symbols are being modified continuously, and trading decisions are based on multiple symbols rather than just one, snapshots offer a way to refer to a set of symbol-version pairs exactly as they were, without having to worry about details such as clock-skew.
But what if you don’t need this functionality, and only ever care about the latest version? In this case, it would be wasteful of disk-space to keep around old data that will never be read again. In this case, all of the modifying operations have an argument
prune_previous_versions
that will delete older versions of the symbol as the new version is written, excepting any versions referenced by a snapshot.Sep 20, 2024
In today’s fast-paced financial world, asset managers are continually seeking ways to generate alpha for their clients. One of the most critical factors in achieving this goal is data productivity and management.
Elle Palmer
Aug 22, 2024
The 1 Billion Row Challenge was a tech competition that was run in January 2024.
ArcticDB