atoti can read data from a variety of data stores including local file systems and cloud object stores.
This is done by prepending a protocol like
s3:// to paths used in common data access functions like
import atoti as tt session = tt.create_session() store = session.read_csv("s3://bucket/path/to/data-*.csv") store = session.read_parquet("gcs://bucket/path/to/data-*.parq")
The following protocols are available:
Local or Network File System: the local file system, default in the absence of any protocol
s3://- Amazon S3 remote binary store
When specifying a storage location, a URL should be provided using the general form
If no protocol is provided, the local file system is assumed.
Local File System¶
Local files are always accessible. This is the default back-end, used if no protocol is passed at all.
Locations specified relative to the current working directory will be respected (as they would be with Python’s built-in
Amazon S3 is a web service offered by Amazon Web Services.
Authentication for S3 is provided by the underlying AWS SDK For Java library. Please refer to their auth docs for the available options.
Cloud object stores¶
Nowadays, most computers have a 10 GB/s network interface, and therefore the capacity to transfer data at 1 GB/s. However the data is actually downloaded from cloud object stores and the throughput of a data transfer from these stores is far less than that, being HTTP based.
A special connector has been developed for atoti to overcome this limitation. It opens tens of HTTP connections to the cloud store and performs the transfer in parallel, transparently reassembling blocks directly in memory. With this trick, atoti actually downloads 300 GB in about 5 minutes.
Some parameters can impact the overall download speed:
Speed of the CPU cores. HTTPS connections and client side-encryption consume CPU resources.
Bandwidth of the network interface.
Small files will not have good download speed (< 60 MB/s).
Type (Hot/Cold) of the storage. Do not use cold storage.
The host running atoti and the data must be in the same cloud region.