Data sources

Stores can be fed from several sources :

CSV

[1]:
import atoti as tt

session = tt.create_session()
[2]:
csv_store = session.read_csv("data/example.csv", keys=["ID"], store_name="First store")
csv_store.head()
[2]:
Date Continent Country City Color Quantity Price
ID
1 2019-01-01 Europe France Paris red 1000.0 500.0
2 2019-01-02 Europe France Lyon red 2000.0 400.0
3 2019-01-05 Europe France Paris blue 3000.0 420.0
4 2018-01-01 Europe France Bordeaux blue 1500.0 480.0
5 2019-01-01 Europe UK London green 3000.0 460.0

Parquet

Apache Parquet is a columnar storage format. Those files can be used as a source :

[3]:
parquet_store = session.read_parquet("data/example.parquet", keys=["ProductId"])
parquet_store.head()
[3]:
IdType City Country Capital Quantity Currency Price Cost Pattern
ProductId
4 SKU Toulouse France Paris 9 EUR 606.34 70.00 Tou
5 SKU New York USA Washington D.C. 9 USD 1234.09 1000.00 York
2 DDS London United Kingdom London 2 GBP 16.52 16.52 bbb
1 DDS Toulouse France Paris 3 EUR 271.26 500.00 eee

Pandas

pandas is an open source library providing easy-to-use data structures and data analysis tools. For more details about how to use pandas you can refer to its cookbook.

Its DataFrame can be used as a source to feed a store:

[4]:
import pandas as pd

dataframe = pd.read_csv("data/example.csv")
pandas_store = session.read_pandas(dataframe, keys=["ID"], store_name="Second store")
pandas_store.head()
[4]:
Date Continent Country City Color Quantity Price
ID
1 2019-01-01 Europe France Paris red 1000.0 500.0
2 2019-01-02 Europe France Lyon red 2000.0 400.0
3 2019-01-05 Europe France Paris blue 3000.0 420.0
4 2018-01-01 Europe France Bordeaux blue 1500.0 480.0
5 2019-01-01 Europe UK London green 3000.0 460.0

Spark

Apache Spark is a unified analytics engine for large-scale data processing.

Its DataFrame can be used as a source to feed a store:

[5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Demo").getOrCreate()
[6]:
spark_df = spark.read.csv("data/example.csv", header=True, inferSchema=True)
spark_df.show()
+---+-------------------+---------+-------+--------+-----+--------+-----+
| ID|               Date|Continent|Country|    City|Color|Quantity|Price|
+---+-------------------+---------+-------+--------+-----+--------+-----+
|  1|2019-01-01 00:00:00|   Europe| France|   Paris|  red|  1000.0|500.0|
|  2|2019-01-02 00:00:00|   Europe| France|    Lyon|  red|  2000.0|400.0|
|  3|2019-01-05 00:00:00|   Europe| France|   Paris| blue|  3000.0|420.0|
|  4|2018-01-01 00:00:00|   Europe| France|Bordeaux| blue|  1500.0|480.0|
|  5|2019-01-01 00:00:00|   Europe|     UK|  London|green|  3000.0|460.0|
|  6|2019-01-01 00:00:00|   Europe|     UK|  London|  red|  2500.0|500.0|
|  7|2019-01-02 00:00:00|     Asia|  China| Beijing| blue|  2000.0|410.0|
|  8|2019-01-05 00:00:00|     Asia|  China|HongKong|green|  4000.0|350.0|
|  9|2018-01-01 00:00:00|     Asia|  India|   Dehli|  red|  2200.0|360.0|
| 10|2019-01-01 00:00:00|     Asia|  India|  Mumbai| blue|  1500.0|400.0|
+---+-------------------+---------+-------+--------+-----+--------+-----+

[7]:
spark_store = session.read_spark(spark_df, keys=["ID"], store_name="Spark store")
spark_store.head()
[7]:
Date Continent Country City Color Quantity Price
ID
1 2019-01-01T00:00 Europe France Paris red 1000.0 500.0
2 2019-01-02T00:00 Europe France Lyon red 2000.0 400.0
3 2019-01-05T00:00 Europe France Paris blue 3000.0 420.0
4 2018-01-01T00:00 Europe France Bordeaux blue 1500.0 480.0
5 2019-01-01T00:00 Europe UK London green 3000.0 460.0