Getting Started
What is ArcticDB?
ArcticDB is an embedded/serverless database engine designed to integrate with Pandas and the Python Data Science ecosystem. ArcticDB enables you to store, retrieve and process DataFrames at scale, backed by commodity S3 storage.
ArcticDB requires zero additional infrastructure beyond a running Python environment and access to S3 storage and can be installed in seconds.
ArcticDB is:
- Fast: ArcticDB is incredibly fast, able to process millions of (on-disk) rows a second, and is very easy to install:
pip install arcticdb
! - Flexible: Supporting data with and without a schema, ArcticDB is also fully compatible with streaming data ingestion. The platform is bitemporal, allowing you to see all previous versions of stored data.
- Familiar: ArcticDB is the world's simplest database, designed to be immediately familiar to anyone with prior Python and Pandas experience.
What is ArcticDB not?
ArcticDB is designed for high throughput analytical workloads. It is not a transactional database and as such is not a replacement for tools such as PostgreSQL.
Getting Started
The below guide covers installation, setup and basic usage. More detailed information on advanced functionality such as snapshots and parallel writers can be found in the tutorials section.
Installation
ArcticDB supports Python 3.6 - 3.10. To install, simply run:
pip install arcticdb
Usage
ArcticDB is a storage engine designed for S3. As a result, you must have an available S3 bucket to store data using ArcticDB.
Storage
ArcticDB supports any S3 API compatible storage. It has been tested against AWS S3 and storage appliances like VAST Universal Storage.
ArcticDB also supports LMDB for local/file based storage - to use LMDB, pass an LMDB path as the URI: Arctic('lmdb://path/to/desired/database')
.
To get started, we can import ArcticDB and instantiate it:
>>> from arcticdb import Arctic
>>> ac = Arctic(<URI>)
For more information on the format of <URI>, please view the docstring (>>> help(Arctic)
). Below we'll run through some setup examples.
S3 Configuration Examples
There are two methods to configure S3 access. If you happen to know the access and secret key, simply connect as follows:
>>> from arcticdb import Arctic
>>> ac = Arctic('s3://ENDPOINT:BUCKET?region=blah&access=ABCD&secret=DCBA')
Otherwise, you can delegate authentication to the AWS SDK (obeys standard AWS configuration options):
>>> ac = Arctic('s3://ENDPOINT:BUCKET?aws_auth=true')
Same as above, but using HTTPS:
>>> ac = Arctic('s3s://ENDPOINT:BUCKET?aws_auth=true')
S3
Use s3s
if your S3 endpoint used HTTPS
Connecting to a defined storage endpoint
Connect to local storage (not AWS - HTTP endpoint of s3.local) with a pre-defined access and storage key:
>>> ac = Arctic('s3://s3.local:arcticdb-test-bucket?access=EFGH&secret=HGFE')
Connecting to AWS
Connecting to AWS with a pre-defined region:
>>> ac = Arctic('s3s://s3.eu-west-2.amazonaws.com:arcticdb-test-bucket?aws_auth=true')
Note that no explicit credential parameters are given. When aws_auth
is passed, authentication is delegated to the AWS SDK which is responsible for locating the appropriate credentials in the .config
file or
in environment variables.
Using a specific path within a bucket
You may want to restrict access for the ArcticDB library to a specific path within the bucket. To do this, you can use the path_prefix
parameter:
>>> ac = Arctic('s3s://s3.eu-west-2.amazonaws.com:arcticdb-test-bucket?path_prefix=test/&aws_auth=true')
Library Setup
ArcticDB is geared towards storing many (potentially millions) of tables. Individual tables are called symbols and are stored in collections called libraries. A single library can store an effectively unlimited number of symbols.
Libraries must first be initialized prior to use:
>>> ac.create_library('data') # fixed schema - see note below
>>> ac.list_libraries()
['data']
A library can then be retrieved:
>>> library = ac['data']
ArcticDB Schemas & the Dynamic Schema library option
ArcticDB enforces a strict schema that is defined on first write. This schema defines the name, order, index type and type of each column in the DataFrame.
If you wish to add, remove or change the type of columns via update
or append
options, please see the documentation for the dynamic_schema
option within the library_options
parameter of the create_library
method. Note that whether to use fixed or dynamic schemas must be set at
library creation time.
Reading And Writing Data(Frames)!
Now we have a library set up, we can get to reading and writing data! ArcticDB exposes a set of simple API primitives to enable DataFrame storage.
Let's first look at writing a DataFrame to storage:
# 50 columns, 25 rows, random data, datetime indexed.
>>> from datetime import datetime
>>> cols = ['COL_%d' % i for i in range(50)]
>>> df = pd.DataFrame(np.random.randint(0, 50, size=(25, 50)), columns=cols)
>>> df.index = pd.date_range(datetime(2000, 1, 1, 5), periods=25, freq="H")
>>> df.head(2)
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 35 46 4 0 17 35 33 25 ...
2000-01-01 06:00:00 9 24 18 30 0 39 43 20 ...
Write the DataFrame:
>>> lib.write('test_frame', df)
VersionedItem(symbol=test_frame,library=data,data=n/a,version=0,metadata=None,host=<host>)
ArcticDB index
When writing Pandas DataFrames, ArcticDB supports the following index types:
pandas.Index
containingint64
orfloat64
(or the corresponding dedicated typesInt64Index
,UInt64Index
andFloat64Index
)RangeIndex
with the restrictions noted belowDatetimeIndex
MultiIndex
composed of above supported types
Currently, ArcticDB only supports append()
-ing to a RangeIndex
with a continuing RangeIndex
(i.e. the appending RangeIndex.start
== RangeIndex.stop
of the existing data and they have the same RangeIndex.step
). If a DataFrame with a non-continuing RangeIndex
is passed to append()
, ArcticDB does not convert it Int64Index
like Pandas and will produce an error.
Also note, the "row" concept in head()/tail()
refers to the physical row, not the value in the pandas.Index
.
Read it back:
>>> from_storage_df = library.read('test_frame').data
>>> from_storage_df.head(2)
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 35 46 4 0 17 35 33 25 ...
2000-01-01 06:00:00 9 24 18 30 0 39 43 20 ...
Slicing and Filtering
ArcticDB enables you to slice by row and by column.
ArcticDB indexing
ArcticDB will construct a full index for ordered numerical and timeseries (e.g. DatetimeIndex) Pandas indexes. This will enable optimised slicing across index entries. If the index is unsorted or not numeric, then whilst your data can be stored, row-slicing will be slower.
Row-slicing
>>> lib.read('test_frame', date_range=(df.index[5], df.index[8])).data
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 10:00:00 43 28 36 18 10 37 31 32 ...
2000-01-01 11:00:00 36 5 30 18 44 15 31 28 ...
2000-01-01 12:00:00 6 34 0 5 19 41 17 15 ...
2000-01-01 13:00:00 14 48 6 6 2 3 44 42 ...
Column slicing
>>> _range = (df.index[5], df.index[8])
>>> _columns = ['COL_30', 'COL_31']
>>> lib.read('test_frame', date_range=_range, columns=_columns).data
COL_30 COL_31
2000-01-01 10:00:00 7 26
2000-01-01 11:00:00 29 18
2000-01-01 12:00:00 36 26
2000-01-01 13:00:00 48 42
Filtering
ArcticDB uses a Pandas-like syntax to describe how to filter data. For more details including the limitations, please view the docstring (help(QueryBuilder)
).
ArcticDB Filtering Philosphy & Restrictions
Note that in most cases this should be more memory efficient and performant than the equivalent Pandas operation as the processing is within the C++ storage engine and parallelized over multiple threads of execution.
We do not intend to re-implement the entirety of the Pandas filtering/masking operations, but instead target a maximally useful subset.
>>> _range = (df.index[5], df.index[8])
>>> _cols = ['COL_30', 'COL_31']
>>> from arcticdb import QueryBuilder
>>> q = QueryBuilder()
>>> q = q[(q["COL_30"] > 30) & (q["COL_31"] < 50)]
>>> lib.read('test_frame', date_range=_range, columns=_cols, query_builder=q).data
>>>
COL_30 COL_31
2000-01-01 12:00:00 36 26
2000-01-01 13:00:00 48 42
Modifications, Versioning (time travel!)
ArcticDB fully supports modifying stored data via two primitives: update and append.
Update
The update primitive enables you to overwrite a contiguous chunk of data. In the below example, we use update
to modify 2000-01-01 05:00:00,
remove 2000-01-01 06:00:00 and insert a duplicate entry for 2000-01-01 07:00:00.
# Recreate the DataFrame with new (and different!) random data, and filter to only the first and third row
>>> random_data = np.random.randint(0, 50, size=(25, 50))
>>> df = pd.DataFrame(random_data, columns=['COL_%d' % i for i in range(50)])
>>> df.index = pd.date_range(datetime(2000, 1, 1, 5), periods=25, freq="H")
# Filter!
>>> df = df.iloc[[0,2]]
>>> df
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 46 24 4 20 7 32 1 18 ...
2000-01-01 07:00:00 44 37 16 27 30 1 35 25 ...
>>> library.update('test_frame', df)
VersionedItem(symbol=test_frame,library=data,data=n/a,version=1,metadata=None,host=<host>)
Now let's look at the first 2 rows in the symbol:
>>> library.head('test_frame', 2) # head/tail are similar to the equivalent Pandas operations
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 05:00:00 46 24 4 20 7 32 1 18 ...
2000-01-01 07:00:00 44 37 16 27 30 1 35 25 ...
Append
Let's append data to the end of the timeseries:
>>> random_data = np.random.randint(0, 50, size=(5, 50))
>>> df_append = pd.DataFrame(random_data, columns=['COL_%d' % i for i in range(50)])
>>> df_append.index = pd.date_range(datetime(2000, 1, 2, 5), periods=5, freq="H")
>>> df_append
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-02 05:00:00 34 33 5 44 15 25 1 25 ...
2000-01-02 06:00:00 9 39 15 18 49 47 7 45 ...
2000-01-02 07:00:00 12 40 9 27 49 31 45 0 ...
2000-01-02 08:00:00 43 25 39 26 13 7 20 40 ...
2000-01-02 09:00:00 2 1 20 47 47 16 14 48 ...
Note the starting date of this DataFrame is after the final row written previously!
Let's now append that DataFrame to what was written previously, and then pull back the final 7 rows from storage:
>>> lib.append('test_frame', df_append)
VersionedItem(symbol=test_frame,library=data,data=n/a,version=2,metadata=None,host=<host>)
>>> lib.tail('test_frame', 7).data
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-02 04:00:00 4 13 8 14 25 11 11 11 ...
2000-01-02 05:00:00 14 41 24 7 16 10 15 36 ...
2000-01-02 05:00:00 34 33 5 44 15 25 1 25 ...
2000-01-02 06:00:00 9 39 15 18 49 47 7 45 ...
2000-01-02 07:00:00 12 40 9 27 49 31 45 0 ...
2000-01-02 08:00:00 43 25 39 26 13 7 20 40 ...
2000-01-02 09:00:00 2 1 20 47 47 16 14 48 ...
The final 7 rows included the 5 rows we have just appended and the last two rows that were written previously.
Versioning
You might have noticed that read calls do not return the data directly - but instead returns a VersionedItem structure. You may also have noticed that modification operations (write, append and update) increment the version counter. ArcticDB versions all modifications, which means you can retrieve earlier versions of data (ArcticDB is a bitemporal database!):
>>> lib.tail('test_frame', 7, as_of=0).data
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 ...
2000-01-01 23:00:00 26 38 12 30 25 29 47 27 ...
2000-01-02 00:00:00 12 14 42 11 44 32 19 11 ...
2000-01-02 01:00:00 12 47 4 45 28 38 35 36 ...
2000-01-02 02:00:00 22 0 12 48 37 11 18 14 ...
2000-01-02 03:00:00 14 16 38 30 19 41 29 43 ...
2000-01-02 04:00:00 4 13 8 14 25 11 11 11 ...
2000-01-02 05:00:00 14 41 24 7 16 10 15 36 ...
Note the timestamps - we've read the data prior to the append operation. Please note that you can also pass a datetime into any as_of argument.
Versioning & Prune Previous
By default, write
, append
, and update
operations will remove the previous versions to save on space.
Use the prune_previous
argument to control this behaviour.