Data sources¶
Stores can be fed from several sources :
CSV¶
[1]:
import atoti as tt
session = tt.create_session()
[2]:
csv_store = session.read_csv("data/example.csv", keys=["ID"], store_name="First store")
csv_store.head()
[2]:
Date | Continent | Country | City | Color | Quantity | Price | |
---|---|---|---|---|---|---|---|
ID | |||||||
1 | 2019-01-01 | Europe | France | Paris | red | 1000.0 | 500.0 |
2 | 2019-01-02 | Europe | France | Lyon | red | 2000.0 | 400.0 |
3 | 2019-01-05 | Europe | France | Paris | blue | 3000.0 | 420.0 |
4 | 2018-01-01 | Europe | France | Bordeaux | blue | 1500.0 | 480.0 |
5 | 2019-01-01 | Europe | UK | London | green | 3000.0 | 460.0 |
Parquet¶
Apache Parquet is a columnar storage format. Those files can be used as a source :
[3]:
parquet_store = session.read_parquet("data/example.parquet", keys=["ProductId"])
parquet_store.head()
[3]:
IdType | City | Country | Capital | Quantity | Currency | Price | Cost | Pattern | |
---|---|---|---|---|---|---|---|---|---|
ProductId | |||||||||
4 | SKU | Toulouse | France | Paris | 9 | EUR | 606.34 | 70.00 | Tou |
5 | SKU | New York | USA | Washington D.C. | 9 | USD | 1234.09 | 1000.00 | York |
2 | DDS | London | United Kingdom | London | 2 | GBP | 16.52 | 16.52 | bbb |
1 | DDS | Toulouse | France | Paris | 3 | EUR | 271.26 | 500.00 | eee |
Pandas¶
pandas is an open source library providing easy-to-use data structures and data analysis tools. For more details about how to use pandas you can refer to its cookbook.
Its DataFrame can be used as a source to feed a store:
[4]:
import pandas as pd
dataframe = pd.read_csv("data/example.csv")
pandas_store = session.read_pandas(dataframe, keys=["ID"], store_name="Second store")
pandas_store.head()
[4]:
Date | Continent | Country | City | Color | Quantity | Price | |
---|---|---|---|---|---|---|---|
ID | |||||||
1 | 2019-01-01 | Europe | France | Paris | red | 1000.0 | 500.0 |
2 | 2019-01-02 | Europe | France | Lyon | red | 2000.0 | 400.0 |
3 | 2019-01-05 | Europe | France | Paris | blue | 3000.0 | 420.0 |
4 | 2018-01-01 | Europe | France | Bordeaux | blue | 1500.0 | 480.0 |
5 | 2019-01-01 | Europe | UK | London | green | 3000.0 | 460.0 |
Spark¶
Apache Spark is a unified analytics engine for large-scale data processing.
Its DataFrame can be used as a source to feed a store:
[5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Demo").getOrCreate()
[6]:
spark_df = spark.read.csv("data/example.csv", header=True, inferSchema=True)
spark_df.show()
+---+-------------------+---------+-------+--------+-----+--------+-----+
| ID| Date|Continent|Country| City|Color|Quantity|Price|
+---+-------------------+---------+-------+--------+-----+--------+-----+
| 1|2019-01-01 00:00:00| Europe| France| Paris| red| 1000.0|500.0|
| 2|2019-01-02 00:00:00| Europe| France| Lyon| red| 2000.0|400.0|
| 3|2019-01-05 00:00:00| Europe| France| Paris| blue| 3000.0|420.0|
| 4|2018-01-01 00:00:00| Europe| France|Bordeaux| blue| 1500.0|480.0|
| 5|2019-01-01 00:00:00| Europe| UK| London|green| 3000.0|460.0|
| 6|2019-01-01 00:00:00| Europe| UK| London| red| 2500.0|500.0|
| 7|2019-01-02 00:00:00| Asia| China| Beijing| blue| 2000.0|410.0|
| 8|2019-01-05 00:00:00| Asia| China|HongKong|green| 4000.0|350.0|
| 9|2018-01-01 00:00:00| Asia| India| Dehli| red| 2200.0|360.0|
| 10|2019-01-01 00:00:00| Asia| India| Mumbai| blue| 1500.0|400.0|
+---+-------------------+---------+-------+--------+-----+--------+-----+
[7]:
spark_store = session.read_spark(spark_df, keys=["ID"], store_name="Spark store")
spark_store.head()
[7]:
Date | Continent | Country | City | Color | Quantity | Price | |
---|---|---|---|---|---|---|---|
ID | |||||||
1 | 2019-01-01T00:00 | Europe | France | Paris | red | 1000.0 | 500.0 |
2 | 2019-01-02T00:00 | Europe | France | Lyon | red | 2000.0 | 400.0 |
3 | 2019-01-05T00:00 | Europe | France | Paris | blue | 3000.0 | 420.0 |
4 | 2018-01-01T00:00 | Europe | France | Bordeaux | blue | 1500.0 | 480.0 |
5 | 2019-01-01T00:00 | Europe | UK | London | green | 3000.0 | 460.0 |