Why you should use ArcticDB instead of CSV to save your Pandas DataFrames

ArcticDB

Jun 7, 2024

1*FUYWtpY3MqcNtvYfLsihig

Storing time series data in Python? Forget CSVs, ArcticDB is the smarter choice. It’s designed for large datasets and offers incredible efficiency. Consider a DataFrame that has 1,000 columns, named `c0` to `c999`, each containing 100,000 rows of time series data incremented in seconds, with random floats between 0 and 1 as the values, as shown in the examples below: saving that to a CSV would be cumbersome and require 1.93GB of storage, but ArcticDB is sleek at just 0.8GB.

Writing speed? On a Jupyter notebook with 2 cores and 8GB of memory, ArcticDB saves the data in an average of 5 seconds over 5 runs to local storage, while saving CSVs lags behind at 180 seconds on average using a pyarrow engine. Reading speeds are also fast with ArcticDB, taking only 1.2 seconds on average over 5 runs, compared to the CSVs’ 53.5 second average.

Also with ArcticDB you can query that data set in storage so for example you could select a time range you would like to read where as with CSV you would need to read the entire file back and to a query locally.

ArcticDB works anywhere — from your local PC to cloud storage like S3 or Azure Blob Storage — and simplifies sharing, eliminating the hassle of file transfers. It’s optimized for Pandas dataframes, enabling direct storage and retrieval without complications. With its efficient columnar storage, ArcticDB reduces size and accelerates access. Moreover, its built-in versioning provides a transparent history of your data modifications, an advantage that CSVs lack.

Below is some template code to help you get started. For more information see the ArcticDB API guide here Introduction — ArcticDB

import numpy as np
import pandas as pd
import arcticdb as adb
   
# Connect to Arctic database using the specified URL
arctic = adb.Arctic("lmdb://arcticdb_demo")
 
# Get or create a library for storing data
lib = arctic.get_library('test_adb_vscsv', create_if_missing=True)
 
# Define the size of the DataFrame to be 100,000 x 1000
rows = 100000
cols = 1000
large = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('1/2/2020', periods=rows, freq="H"))
 
# Write the DataFrame to the Arctic library
write_record = lib.write("large", large)
 
# Display the write result
write_record
 
# Read the DataFrame back from the Arctic library
read_record = lib.read("large")
 
# Display the read result
read_record
 
# Print the data from the read operation
print(read_record.data)

Did you read this mini-blog and think to yourself, “I have a lot of CSV files and I could benefit from switching to ArcticDB?” If so, there is an example below of how to import data from CSV files to ArcticDB. The `csv_to_ADB` function can be used to automate the process of importing this data into ArcticDB. You could loop through each CSV file a folder and call `csv_to_ADB` for each one, using the CSV filename as the symbol name or providing a new symbol name if needed. This is particularly useful for batch processing large volumes of data. Please find a few examples of using this function at the bottom of the code snippet.

For best practices:

  • Ensure consistent formatting across all CSV files, particularly the `DateTime` column, to avoid index errors.

  • Use the `pyarrow` engine for faster reading of CSV files, especially for large datasets.

  • Handle exceptions such as file read errors or write conflicts to ArcticDB gracefully to ensure the process does not stop abruptly.

  • Consider parallel processing if the number of files is large and system resources allow for it, to speed up the entire operation.

  • Maintain a log that records the success or failure of each file processed for audit and troubleshooting purposes.

# install pandas, arcticdb, pyarrow
import os
import numpy as np 
import pandas as pd
import arcticdb as adb
 
   
def csv_to_ADB(csv_path, lib, index_name='DateTime', symbol_name=None, engine="pyarrow"):
    """
    The function `csv_to_ADB` takes a CSV file, converts it into a pandas DataFrame,
    and writes the DataFrame to an ArcticDB chunksize rows at a time.
  
    Parameters
    ----------
    csv_name: string
        The name/path of the CSV file to be read.
  
    lib: Arctic Library
        Where the data will be stored.
      
    index_name: string
        The column in the CSV to be used as the DataFrame index (defaults to 'DateTime').
  
    symbol_name: string
        Optional new name for the symbol when stored in the Arctic library.
        The function uses the CSV file basename as the symbol name if symbol_name is not provided. 
  
    engine: string
        The parser engine to use for reading the CSV (defaults to 'pyarrow' for performance).
    """
    if not symbol_name:
        symbol_name = os.path.splitext(csv_path)[0]
   
    df = pd.read_csv(csv_path, engine=engine)
    df.set_index(index_name, inplace=True)
    lib.update(symbol_name, df, upsert=True)
  
# Connect to a local Arctic database and get or create a library
arctic = adb.Arctic("lmdb://arcticdb_demo")
lib = arctic.get_library('test_adb_vscsv', create_if_missing=True)
    
# Here are some examples of using this function. First, what we will do is:
# Make random second data from the 1/1/2020, then save that to ArcticDB from CSV using the CSV file name as the symbol.
rows = 100000
cols = 1000
data1 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('1/1/2020', periods=rows, freq="s"))
 
data1.index.name = 'DateTime'
data1.to_csv('test_data_1.csv', index=True)
csv_to_ADB('test_data_1.csv',lib)
print(lib.read('test_data_1').data.tail())
 
# Make random second data from 2020-01-02 03:46:40, then save that to ArcticDB from CSV using the CSV file name as the symbol.
data2 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('2020-01-02 03:46:40', periods=rows, freq="s"))
data2.index.name = 'DateTime'
data2.to_csv('test_data_2.csv', index=True)
csv_to_ADB('test_data_2.csv',lib)
print(lib.read('test_data_2').data)
   
# Use the CSV file for 1/1/2020 and write that to ArcticDB with a different symbol name called made_up_data_1.
csv_to_ADB('test_data_1.csv',lib, symbol_name='made_up_data_1')
print(lib.read('made_up_data_1').data)
   
# Use the CSV file for 2/1/2020 and write that to ArcticDB with a different symbol name called made_up_data_2.
csv_to_ADB('test_data_2.csv',lib, symbol_name='made_up_data_2')
print(lib.read('made_up_data_1').data)
   
   
# Update made_up_data_1 with the data from 2/1/2020 so that made_up_data_1 will have the data for both 1/1/2020 and 2/1/2020.
csv_to_ADB('test_data_2.csv',lib, symbol_name='made_up_data_1')
print(lib.read('made_up_data_1').data)
   
 
# Rewrite data to test_data_1.csv with the data from 2020-01-02 03:46:40 and write that new data to ArcticDB.
data2 = pd.DataFrame(np.random.rand(rows, cols), columns=[f'c{i}' for i in range(cols)], index=pd.date_range('2020-01-02 03:46:40', periods=rows, freq="s"))
data2.index.name = 'DateTime'
data2.to_csv('test_data_1.csv', index=True)
csv_to_ADB('test_data_1.csv',lib)
print(lib.read('test_data_1').data)