Advanced Store Manipulations¶
[1]:
import atoti as tt
session = tt.create_session()
store = session.read_csv("data/example.csv", keys=["ID"], store_name="MyStore")
cube = session.create_cube(store, "FirstCube")
cube.query()
Welcome to atoti 0.4.0!
By using this community edition, you agree with the license available at https://www.atoti.io/eula.
Browse the official documentation at https://docs.atoti.io.
Join the community at https://www.atoti.io/register.
You can hide this message by setting the ATOTI_HIDE_EULA_MESSAGE environment variable to True.
[1]:
Price.MEAN | Price.SUM | Quantity.MEAN | Quantity.SUM | contributors.COUNT | |
---|---|---|---|---|---|
0 | 428.0 | 4280.0 | 2270.0 | 22700.0 | 10 |
Store description¶
The lengh of a store is the number of rows in the store
[2]:
len(store)
[2]:
10
You can get the shape of the store, which is the number of columns and rows of the store
[3]:
store.shape
[3]:
{'rows': 10, 'columns': 8}
The description of the columns and types of the store
[4]:
store
[4]:
- MyStore
- ID
- key: True
- nullable: True
- type: int
- Date
- key: False
- nullable: True
- type: LocalDate[yyyy-MM-dd]
- Continent
- key: False
- nullable: True
- type: string
- Country
- key: False
- nullable: True
- type: string
- City
- key: False
- nullable: True
- type: string
- Color
- key: False
- nullable: True
- type: string
- Quantity
- key: False
- nullable: True
- type: double
- Price
- key: False
- nullable: True
- type: double
- ID
Sampling mode¶
When loading large datasets into a store it is possible to sample the data. There are currently 3 mode supported for sampling:
first_lines
: keeps only the given amount of lines.first_files
: keeps only the given amount of files.FULL
: Loads everything
Sampling the data helps having a very responsive project during the design phase, when this design is over it is possible to load everything by calling session.load_all_data
Sampling mode can be defined in the session for all the stores or store by store.
[5]:
from atoti.sampling import first_lines
sample_store = session.read_csv(
"data/example.csv", keys=["ID"], store_name="sampled", sampling_mode=first_lines(5)
)
len(sample_store)
The store has been sampled because there are more than 5 lines in the files to load. Call Session.load_all_data() to trigger the full load of the data.
[5]:
5
[6]:
# Do some design such as adding a cube, some measures...
cube = session.create_cube(sample_store, "Sampling")
cube.query()
[6]:
Price.MEAN | Price.SUM | Quantity.MEAN | Quantity.SUM | contributors.COUNT | |
---|---|---|---|---|---|
0 | 452.0 | 2260.0 | 2100.0 | 10500.0 | 5 |
[7]:
session.load_all_data()
len(sample_store)
[7]:
10
[8]:
len(sample_store)
cube.query()
[8]:
Price.MEAN | Price.SUM | Quantity.MEAN | Quantity.SUM | contributors.COUNT | |
---|---|---|---|---|---|
0 | 428.0 | 4280.0 | 2270.0 | 22700.0 | 10 |
Specify other types for the columns¶
atoti automatically detects the type of the columns. It is possible to bypass this behaviour and specify the types of some columns manually.
[9]:
types = {
"ID": tt.types.DOUBLE,
"City": tt.types.STRING,
"Quantity": tt.types.STRING,
"Price": tt.types.FLOAT,
}
[10]:
custom_store = session.read_csv(
"data/example.csv", keys=["ID"], store_name="Custom", types=types
)
custom_store
[10]:
- Custom
- ID
- key: True
- nullable: False
- type: double
- Date
- key: False
- nullable: True
- type: LocalDate[yyyy-MM-dd]
- Continent
- key: False
- nullable: True
- type: string
- Country
- key: False
- nullable: True
- type: string
- City
- key: False
- nullable: True
- type: string
- Color
- key: False
- nullable: True
- type: string
- Quantity
- key: False
- nullable: True
- type: string
- Price
- key: False
- nullable: False
- type: float
- ID
Insert new rows¶
New records can be inserted into the store. If a record has the same key columns as an existing record, the previous record will be overriden.
[11]:
# New key
store.append((11, "2019-03-01", "Europe", "Germany", "Berlin", "yellow", 1000, 400))
# Existing key
store += (1, "2019-03-01", "Europe", "France", "Paris", "red", 2000, 600)
store.head()
[11]:
Date | Continent | Country | City | Color | Quantity | Price | |
---|---|---|---|---|---|---|---|
ID | |||||||
1 | 2019-03-01 | Europe | France | Paris | red | 2000.0 | 600.0 |
2 | 2019-01-02 | Europe | France | Lyon | red | 2000.0 | 400.0 |
3 | 2019-01-05 | Europe | France | Paris | blue | 3000.0 | 420.0 |
4 | 2018-01-01 | Europe | France | Bordeaux | blue | 1500.0 | 480.0 |
5 | 2019-01-01 | Europe | UK | London | green | 3000.0 | 460.0 |
Append data to a store¶
A CSV file with the same structure as the initial source can be appended into an existing store:
[12]:
store.load_csv("data/additional_example.csv")
store.head()
[12]:
Date | Continent | Country | City | Color | Quantity | Price | |
---|---|---|---|---|---|---|---|
ID | |||||||
21 | 2017-12-25 | Europe | UK | London | red | 1000.0 | 500.0 |
1 | 2019-03-01 | Europe | France | Paris | red | 2000.0 | 600.0 |
22 | 2017-12-25 | Europe | France | Lyon | blue | 2000.0 | 400.0 |
2 | 2019-01-02 | Europe | France | Lyon | red | 2000.0 | 400.0 |
23 | 2017-12-25 | Europe | France | Paris | green | 1600.0 | 500.0 |
[13]:
len(store)
[13]:
14
Join stores¶
Stores can be joined together using a mapping between columns. The mapping columns of the target store must be included in its key columns. If no mapping is specified, the columns with the same names are used.
[14]:
capital_store = session.read_csv(
"data/capitals.csv", keys=["Country name"], store_name="Capital"
)
capital_store.head()
[14]:
Capital | |
---|---|
Country name | |
France | Paris |
UK | London |
China | Beijing |
India | Dehli |
[15]:
store.join(capital_store, mapping={"Country": "Country name"})
[16]:
cube = session.create_cube(store, "Cube")
See the store schema¶
It is possible to display the schema of the stores used by a cube, starting from its base store:
[17]:
cube.schema
[17]:
and the schema of the all the stores of the session:
[18]:
session.stores.schema
[18]:
Automatic scenario creation¶
If the data you use to feed a store respects a specific directory structure, we are able to automatically split it into different source scenarios. You can control this behavior by using the load_csv
method on the store’s scenarios
attribute.
[19]:
# Start by creating an empty store
scenario_store = session.create_store(
{
"ID": tt.types.INT,
"Date": tt.types.LOCAL_DATE,
"Continent": tt.types.STRING,
"Country": tt.types.STRING,
"Color": tt.types.STRING,
"Quantity": tt.types.DOUBLE,
"Price": tt.types.DOUBLE,
},
store_name="ScenarioStore",
)
scenario_store.scenarios.load_csv(
"data/scenario_directory/", base_scenario_directory="base",
)
As you can see, because the provided directory had two subdiretories, Base
and Scenario1
, the Base
scenario was fed with the data in the base sub directory and a new scenario was created with the data from the Scenario1
sub directory.
[20]:
session.scenarios
[20]:
['scenario1', 'Base']