Data access

atoti can read data from a variety of data stores including local file systems and cloud object stores. This is done by prepending a protocol like s3:// to paths used in common data access functions like atoti.session.Session.read_csv():

import atoti as tt

session = tt.create_session()

local_csv_table = session.read_csv("local/path/to/data-*.csv", table_name="Example")
local_parquet_table = session.read_parquet("local/path/to/data-*.parquet", table_name="Example")
azure_csv_table = session.read_csv(f"https://{ACCOUNT_NAME}.blob.core.windows.net/path/to/data-*.csv", table_name="Example")
gcp_csv_table = session.read_csv("gs://bucket/path/to/data-*.csv", table_name="Example")
s3_csv_table = session.read_csv("s3://bucket/path/to/data-*.csv", table_name="Example")

When specifying a storage location, a URL should be provided using the general form protocol://path/to/data. The following protocols are available:

  • Local or network file system: The default in the absence of any protocol. Locations specified relative to the current working directory will be respected (as they would be with Python’s built-in open()).

  • Amazon S3: s3:// - atoti-aws plugin required.

  • Azure Blob Storage: https://{ACCOUNT_NAME}.blob.core.windows.net/ - atoti-azure plugin required.

  • Google Cloud Storage: gs:// - atoti-gcp plugin required.

Internals

Cloud object stores

Nowadays, most computers have a 10 Gb/s network interface, and therefore the capacity to transfer data at 1 GB/s. However the data is actually downloaded from cloud object stores and the throughput of a data transfer from these stores is far less than that, being HTTP based.

A special connector has been developed for atoti to overcome this limitation. It opens tens of HTTP connections to the cloud store and performs the transfer in parallel, transparently reassembling blocks directly in memory. With this trick, atoti actually downloads 300 GB in about 5 minutes.

Some parameters can impact the overall download speed:

  • Speed of the CPU cores. HTTPS connections and client side-encryption consume CPU resources.

  • Bandwidth of the network interface.

  • Small files will not have good download speed (< 60 MB/s).

  • Type (Hot/Cold) of the storage. Do not use cold storage.

  • The host running atoti and the data must be in the same cloud region.