Data sources¶

Stores can be fed from several sources :

CSV files
Parquet files
Pandas dataframes
Spark dataframes

CSV¶

[1]:

import atoti as tt

session = tt.create_session()

Welcome to atoti 0.4.0!

By using this community edition, you agree with the license available at https://www.atoti.io/eula.
Browse the official documentation at https://docs.atoti.io.
Join the community at https://www.atoti.io/register.

You can hide this message by setting the ATOTI_HIDE_EULA_MESSAGE environment variable to True.

[2]:

csv_store = session.read_csv("data/example.csv", keys=["ID"], store_name="First")
csv_store.head()

[2]:

	Date	Continent	Country	City	Color	Quantity	Price
ID
1	2019-01-01	Europe	France	Paris	red	1000.0	500.0
2	2019-01-02	Europe	France	Lyon	red	2000.0	400.0
3	2019-01-05	Europe	France	Paris	blue	3000.0	420.0
4	2018-01-01	Europe	France	Bordeaux	blue	1500.0	480.0
5	2019-01-01	Europe	UK	London	green	3000.0	460.0

Parquet¶

Apache Parquet is a columnar storage format. Those files can be used as a source :

[3]:

parquet_store = session.read_parquet("data/example.parquet", keys=["ProductId"])
parquet_store.head()

[3]:

	IdType	City	Country	Capital	Quantity	Currency	Price	Cost	Pattern
ProductId
4	SKU	Toulouse	France	Paris	9	EUR	606.34	70.00	Tou
5	SKU	New York	USA	Washington D.C.	9	USD	1234.09	1000.00	York
2	DDS	London	United Kingdom	London	2	GBP	16.52	16.52	bbb
1	DDS	Toulouse	France	Paris	3	EUR	271.26	500.00	eee

Pandas¶

pandas is an open source library providing easy-to-use data structures and data analysis tools. For more details about how to use pandas you can refer to its cookbook.

Its DataFrame can be used as a source to feed a store:

[4]:

import pandas as pd

dataframe = pd.read_csv("data/example.csv")
pandas_store = session.read_pandas(dataframe, "Second", keys=["ID"])
pandas_store.head()

[4]:

	Date	Continent	Country	City	Color	Quantity	Price
ID
1	2019-01-01	Europe	France	Paris	red	1000.0	500.0
2	2019-01-02	Europe	France	Lyon	red	2000.0	400.0
3	2019-01-05	Europe	France	Paris	blue	3000.0	420.0
4	2018-01-01	Europe	France	Bordeaux	blue	1500.0	480.0
5	2019-01-01	Europe	UK	London	green	3000.0	460.0

Spark¶

Apache Spark is a unified analytics engine for large-scale data processing.

Its DataFrame can be used as a source to feed a store:

[5]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Demo").getOrCreate()

[6]:

spark_df = spark.read.csv("data/example.csv", header=True, inferSchema=True)
spark_df.show()

+---+-------------------+---------+-------+--------+-----+--------+-----+
| ID|               Date|Continent|Country|    City|Color|Quantity|Price|
+---+-------------------+---------+-------+--------+-----+--------+-----+
|  1|2019-01-01 00:00:00|   Europe| France|   Paris|  red|  1000.0|500.0|
|  2|2019-01-02 00:00:00|   Europe| France|    Lyon|  red|  2000.0|400.0|
|  3|2019-01-05 00:00:00|   Europe| France|   Paris| blue|  3000.0|420.0|
|  4|2018-01-01 00:00:00|   Europe| France|Bordeaux| blue|  1500.0|480.0|
|  5|2019-01-01 00:00:00|   Europe|     UK|  London|green|  3000.0|460.0|
|  6|2019-01-01 00:00:00|   Europe|     UK|  London|  red|  2500.0|500.0|
|  7|2019-01-02 00:00:00|     Asia|  China| Beijing| blue|  2000.0|410.0|
|  8|2019-01-05 00:00:00|     Asia|  China|HongKong|green|  4000.0|350.0|
|  9|2018-01-01 00:00:00|     Asia|  India|   Dehli|  red|  2200.0|360.0|
| 10|2019-01-01 00:00:00|     Asia|  India|  Mumbai| blue|  1500.0|400.0|
+---+-------------------+---------+-------+--------+-----+--------+-----+

[7]:

spark_store = session.read_spark(spark_df, "Spark", keys=["ID"])
spark_store.head()

[7]:

	Date	Continent	Country	City	Color	Quantity	Price
ID
1	2019-01-01T00:00	Europe	France	Paris	red	1000.0	500.0
2	2019-01-02T00:00	Europe	France	Lyon	red	2000.0	400.0
3	2019-01-05T00:00	Europe	France	Paris	blue	3000.0	420.0
4	2018-01-01T00:00	Europe	France	Bordeaux	blue	1500.0	480.0
5	2019-01-01T00:00	Europe	UK	London	green	3000.0	460.0