ArcticDB
Aug 19, 2024
Anyone who has built models from timeseries data will know that preparation of the data is a very important part of the process.
One method that is often used in data preparation is resampling. In practice, this tends to be downsampling higher frequency data to a lower (and often regular) frequency that will be used in the model.
To cater to the needs of our user community, we have implemented a resample feature in ArcticDB that allows high performance downsampling of data as part of the read. We achieve this high performance via our modern C++ implementation which can use all the cores by running multi-threaded.
We will walk through a few examples below to give you an idea of what it can do and how to use it.
Setting up a database in ArcticDB is very straightforward. It just takes a few lines of Python code like this.
import arcticdb as adb
arctic = adb.Arctic("lmdb://arcticdb_resample")
lib = arctic.get_library('resample', create_if_missing=True)
We will use a synthetic data set consisting of 12,000,000 rows of 1-second timeseries data with 3 columns containing an
id
(integer), a price
(float) and a category
(string) data. A small section of the data looks like this(the full code including this data creation code is in the notebook, which is linked at the end of this article)
Write the data, which is in a dataframe
mkt_data
, into a symbol in the librarysym = 'market_data'
lib.write(sym, mkt_data)
Downsample to 1-minute frequency
Use different aggregators
Resample can be thought of as a time-based groupby
The groups are all the rows within a time interval
# frequency and aggregator params
freq1 = '1min'
aggs1 = {'id': 'max', 'price': 'last', 'category': 'count'}
# create the resample query and apply it on the read
q1 = adb.QueryBuilder().resample(freq1).agg(aggs1)
market_data_1min_df = lib.read(sym, query_builder=q1).data
The last 5 rows of
market_data_1min_df
look like thisThe ArcticDB resample runs more than 4x faster than reading the full data set and resampling in Pandas. Using the notebook linked at the end of the article on a Dell XPS 15 laptop with 14 cores, ArcticDB takes 171ms vs Pandas 750ms.
Downsample to 5-minute frequency
Apply both
max
and last
aggregators to the value2
column.For multiple aggregators, the syntax is
output_column_name: (input_column_name, aggregator)
q2 = adb.QueryBuilder()
q2 = q2.resample('5min').agg({
'id': 'max',
'price_last': ('price' ,'last'),
'price_count': ('price' ,'count'),
'category': 'first'
})
lib.read(sym, query_builder=q2).data
The few and last few rows of the output look like this
Downsample to 2-minutes and 30-seconds frequency
Group the resampled data by the string column
category
Aggregate the
category
groups using mean
q3 = adb.QueryBuilder()
q3 = q3.resample('2min30s').
agg({'id': 'min', 'category': 'first'}).
groupby('category').agg({'id': 'mean'})
lib.read(sym, query_builder=q3).data
The output looks like this
Downsample to 5-minute frequency
Use multiple aggregators on the
value2
columnThis is a simple example of how to convert tick data to OHLC bar data
freq_ohlc = '5min'
agg_ohlc = {
'open': ('price', 'first'),
'high': ('price', 'max'),
'low': ('price', 'min'),
'close': ('price', 'last')
}
q_ohlc = adb.QueryBuilder().resample(freq_ohlc).agg(agg_ohlc)
ohlc_bars = lib.read(sym, query_builder=q_ohlc).data
The first few lines of the output look like this
In a run of the same code with 500,000,000 rows of data, this bar generation took 2.6s on the same hardware setup described above. Pandas took 44s for the same calculation with the data already in memory. You can try this if you have enough memory on your machine (it may not run in the free Google colab environment).
We have shown the capabilities of the new ArcticDB resample feature, in particular
It is very easy to use, especially if you already know Pandas
It performs well, giving a 4x speed boost in our simple example vs reading the data and resampling in Pandas
It can be combined with other ArcticDB query features to create processing pipelines
It can be used to generate timeseries bar data
Also, we have plans to add more resample and query features and further improve performance going forward.
We hope you have enjoyed this article. Here are some follow-ups if you would like the try this yourself.
The full notebook for these examples is available on our docs website here https://docs.arcticdb.io/latest/notebooks/ArcticDB_demo_resample/ and can be run on Google colab with one click and no additional setup.
The documentation for ArcticDB Querybuilder, including resample, is https://docs.arcticdb.io/latest/api/query_builder/.
Get in touch with us on via our website https://arcticdb.io/ if you would like to know more.
Sep 20, 2024
In today’s fast-paced financial world, asset managers are continually seeking ways to generate alpha for their clients. One of the most critical factors in achieving this goal is data productivity and management.
Elle Palmer
Aug 22, 2024
The 1 Billion Row Challenge was a tech competition that was run in January 2024.
ArcticDB