Array measures¶
atoti is designed to handle array data efficiently.
⚠️ This is an advanced tutorial, make sure to learn the basics first.
Loading arrays from CSV¶
atoti can load array from CSV files. The separator for array elements must be provided to the read_csv
method, and the CSV columns must use another separator. All the arrays in a column must have the same length.
[1]:
import atoti as tt
session = tt.create_session()
[2]:
store = session.read_csv(
"data/arrays.csv", keys=["Trade ID"], store_name="Store with arrays", array_sep=";"
)
store.head()
[2]:
Continent | Country | City | PnL | PnL array | Int array | |
---|---|---|---|---|---|---|
Trade ID | ||||||
0 | Europe | France | Paris | 3.469516 | doubleVector[10]{-0.46511920664962714, ...} | intVector[10]{1, ...} |
1 | Europe | France | Paris | 92.851919 | doubleVector[10]{10.587927935842618, ...} | intVector[10]{11, ...} |
2 | Europe | France | Lyon | 425.866214 | doubleVector[10]{68.43655816283167, ...} | intVector[10]{19, ...} |
3 | Europe | France | Bordeaux | 454.127060 | doubleVector[10]{-149.61978026139195, ...} | intVector[10]{14, ...} |
4 | Europe | UK | London | 187.015613 | doubleVector[10]{11.449089651224922, ...} | intVector[10]{13, ...} |
[3]:
cube = session.create_cube(store, "Cube")
Arrays default aggregations¶
As for scalar measures, atoti provides the default SUM
and MEAN
aggregations on array measures. They are applied element by element on the array.
[4]:
lvl = cube.levels
m = cube.measures
cube.query(m["PnL array.SUM"], levels=lvl["Continent"])
[4]:
PnL array.SUM | |
---|---|
Continent | |
Asia | [-14.696414629727794, -35.38205147348333, -62.... |
Europe | [6.711474315042935, -1.4346671713308226, -72.3... |
[5]:
cube.query(m["PnL array.MEAN"], levels=lvl["Continent"])
[5]:
PnL array.MEAN | |
---|---|
Continent | |
Asia | [-2.939282925945559, -7.076410294696666, -12.5... |
Europe | [1.1185790525071557, -0.23911119522180374, -12... |
Additional array functions¶
Arithmetic operations¶
[6]:
m["PnL +10 array"] = m["PnL array.SUM"] + 10.0
m["PnL -10 array"] = m["PnL array.SUM"] - 10.0
m["PnL x10 array"] = m["PnL array.SUM"] * 10.0
m["PnL /10 array"] = m["PnL array.SUM"] / 10.0
cube.query(
m["PnL +10 array"], m["PnL -10 array"], m["PnL x10 array"], m["PnL /10 array"]
)
[6]:
PnL +10 array | PnL -10 array | PnL x10 array | PnL /10 array | |
---|---|---|---|---|
0 | [2.0150596853151406, -26.816718644814152, -125... | [-17.98494031468486, -46.81671864481415, -145.... | [-79.8494031468486, -368.16718644814154, -1352... | [-0.798494031468486, -3.681671864481415, -13.5... |
Sum, mean, min or max of all the array elements¶
[7]:
m["PnL sum"] = tt.array.sum(m["PnL array.SUM"])
m["PnL mean"] = tt.array.mean(m["PnL array.SUM"])
m["min PnL"] = tt.array.min(m["PnL array.SUM"])
m["max PnL"] = tt.array.max(m["PnL array.SUM"])
cube.query(
m["PnL sum"], m["PnL mean"], m["min PnL"], m["max PnL"], levels=lvl["Continent"]
)
[7]:
PnL sum | PnL mean | min PnL | max PnL | |
---|---|---|---|---|
Continent | ||||
Asia | 357.870873 | 35.787087 | -222.821477 | 232.607605 |
Europe | 49.101546 | 4.910155 | -86.079550 | 195.824737 |
Length¶
[8]:
m["PnL array length"] = tt.array.len(m["PnL array.SUM"])
cube.query(m["PnL array length"])
[8]:
PnL array length | |
---|---|
0 | 10 |
Variance and Standard Deviation¶
[9]:
m["PnL array variance"] = tt.array.var(m["PnL array.SUM"])
m["PnL array standard deviation"] = tt.array.std(m["PnL array.SUM"])
cube.query(
m["PnL array variance"], m["PnL array standard deviation"], levels=lvl["Continent"]
)
[9]:
PnL array variance | PnL array standard deviation | |
---|---|---|
Continent | ||
Asia | 16149.335147 | 127.080034 |
Europe | 6952.275020 | 83.380304 |
Sort¶
[10]:
m["Sorted PnL array"] = tt.array.sort(m["PnL array.SUM"])
cube.query(m["Sorted PnL array"], levels=lvl["Continent"])
[10]:
Sorted PnL array | |
---|---|
Continent | |
Asia | [-222.82147745688468, -62.89214785725346, -35.... |
Europe | [-86.07955036608166, -72.30864853533133, -71.1... |
Quantile¶
[11]:
m["95 quantile"] = tt.array.quantile(m["PnL array.SUM"], 0.95, mode="simple")
m["95 exc quantile"] = tt.array.quantile(m["PnL array.SUM"], 0.95, mode="exc")
m["95 inc quantile"] = tt.array.quantile(m["PnL array.SUM"], 0.95, mode="inc")
m["95 centered quantile"] = tt.array.quantile(m["PnL array.SUM"], 0.95, mode="centered")
cube.query(
m["95 quantile"],
m["95 exc quantile"],
m["95 inc quantile"],
m["95 centered quantile"],
levels=[lvl["Continent"], lvl["Country"]],
)
[11]:
95 quantile | 95 exc quantile | 95 inc quantile | 95 centered quantile | ||
---|---|---|---|---|---|
Continent | Country | ||||
Asia | China | 195.220094 | 256.014079 | 201.299492 | 256.014079 |
India | 22.352927 | 25.270438 | 22.644678 | 25.270438 | |
Europe | France | 128.173379 | 163.731662 | 131.729208 | 163.731662 |
UK | 112.243790 | 146.715692 | 115.690980 | 146.715692 |
[12]:
m["95 linear quantile"] = tt.array.quantile(
m["PnL array.SUM"], 0.95, mode="inc", interpolation="linear"
)
m["95 lower quantile"] = tt.array.quantile(
m["PnL array.SUM"], 0.95, mode="inc", interpolation="lower"
)
m["95 higher quantile"] = tt.array.quantile(
m["PnL array.SUM"], 0.95, mode="inc", interpolation="higher"
)
m["95 nearest quantile"] = tt.array.quantile(
m["PnL array.SUM"], 0.95, mode="inc", interpolation="nearest"
)
m["95 midpoint quantile"] = tt.array.quantile(
m["PnL array.SUM"], 0.95, mode="inc", interpolation="midpoint"
)
cube.query(
m["95 linear quantile"],
m["95 lower quantile"],
m["95 higher quantile"],
m["95 nearest quantile"],
m["95 midpoint quantile"],
)
[12]:
95 linear quantile | 95 lower quantile | 95 higher quantile | 95 nearest quantile | 95 midpoint quantile | |
---|---|---|---|---|---|
0 | 299.288892 | 290.080802 | 306.822784 | 306.822784 | 298.451793 |
n greatest / n lowest¶
Returns an array with the n greatest/lowest values of a another array.
[13]:
m["Top 3 PnL array"] = tt.array.n_greatest(m["PnL array.SUM"], 3)
m["Bottom 2 PnL array"] = tt.array.n_lowest(m["PnL array.SUM"], 2)
cube.query(m["Top 3 PnL array"], m["Bottom 2 PnL array"])
[13]:
Top 3 PnL array | Bottom 2 PnL array | |
---|---|---|
0 | [290.0808017776343, 306.8227843309065, 195.349... | [-135.20079639258478, -308.90102782296634] |
nth greatest value / nth lowest value¶
Returns nth greatest or lowest value of a vector
[14]:
m["3rd greatest PnL"] = tt.array.nth_greatest(m["PnL array.SUM"], 3)
m["2nd lowest PnL"] = tt.array.nth_lowest(m["PnL array.SUM"], 2)
cube.query(m["3rd greatest PnL"], m["2nd lowest PnL"])
[14]:
3rd greatest PnL | 2nd lowest PnL | |
---|---|---|
0 | 195.349055 | -135.200796 |
Element at index¶
Extract the element at a given index:
[15]:
m["First element"] = m["PnL array.SUM"][0]
cube.query(m["First element"], m["PnL array.SUM"])
[15]:
First element | PnL array.SUM | |
---|---|---|
0 | -7.98494 | [-7.984940314684859, -36.81671864481415, -135.... |
With the create_parameter_hierarchy
function, it is possible to create a hierarchy corresponding to the indices of the array. This hierarchy can then be used to “slice” this array and create a measure which depends on the selected index.
[16]:
cube.create_parameter_hierarchy("Index", list(range(0, 10)))
m["PnL at index"] = m["PnL array.SUM"][lvl["Index"]]
cube.query(m["PnL at index"], levels=lvl["Index"])
[16]:
PnL at index | |
---|---|
Index | |
0 | -7.984940 |
1 | -36.816719 |
2 | -135.200796 |
3 | 30.371913 |
4 | 195.349055 |
5 | 306.822784 |
6 | 96.072594 |
7 | -308.901028 |
8 | 290.080802 |
9 | -22.821246 |
You can also build non-integer hierarchies and map each member to its index in the hierarchy using the index_measure
argument:
[17]:
from datetime import date, timedelta
cube.create_parameter_hierarchy(
"Dates",
[date(2020, 1, 1) + timedelta(days=x) for x in range(0, 10)],
index_measure="Date index",
)
m["PnL at date"] = m["PnL array.SUM"][m["Date index"]]
cube.query(m["Date index"], m["PnL at date"], levels=lvl["Dates"])
[17]:
Date index | PnL at date | |
---|---|---|
Dates | ||
2020-01-01 | 0 | -7.984940 |
2020-01-02 | 1 | -36.816719 |
2020-01-03 | 2 | -135.200796 |
2020-01-04 | 3 | 30.371913 |
2020-01-05 | 4 | 195.349055 |
2020-01-06 | 5 | 306.822784 |
2020-01-07 | 6 | 96.072594 |
2020-01-08 | 7 | -308.901028 |
2020-01-09 | 8 | 290.080802 |
2020-01-10 | 9 | -22.821246 |
In cases the indices need to be of arbitrary order or range, it is also possible to manually provide them as a list.
[18]:
cube.create_parameter_hierarchy(
"Custom dates",
[date(2020, 1, 1) + timedelta(days=x) for x in range(0, 10)],
indices=[9, 8, 7, 6, 5, 0, 1, 2, 3, 4],
index_measure="Custom date index",
)
m["PnL at custom date"] = m["PnL array.SUM"][m["Custom date index"]]
cube.query(m["Custom date index"], m["PnL at custom date"], levels=lvl["Custom dates"])
[18]:
Custom date index | PnL at custom date | |
---|---|---|
Custom dates | ||
2020-01-01 | 9 | -22.821246 |
2020-01-02 | 8 | 290.080802 |
2020-01-03 | 7 | -308.901028 |
2020-01-04 | 6 | 96.072594 |
2020-01-05 | 5 | 306.822784 |
2020-01-06 | 0 | -7.984940 |
2020-01-07 | 1 | -36.816719 |
2020-01-08 | 2 | -135.200796 |
2020-01-09 | 3 | 30.371913 |
2020-01-10 | 4 | 195.349055 |
Array slices¶
Extract a slice of the array:
[19]:
m["First 2 elements"] = m["PnL array.SUM"][0:2]
cube.query(m["First 2 elements"], m["PnL array.SUM"])
[19]:
First 2 elements | PnL array.SUM | |
---|---|---|
0 | [-7.984940314684859, -36.81671864481415] | [-7.984940314684859, -36.81671864481415, -135.... |
Load DataFrame with lists¶
atoti can load a pandas DataFrame containing NumPy arrays and Python lists
[20]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Index": [0, 1, 2],
"NumPy array": [
np.array([3.2, 1.0, 8, 9, 4.5, 7, 6, 18]),
np.array([4.2, 4.0, 4, 9, 4.5, 8, 7, 8]),
np.array([12, 1.0, 8, 9, 4.5, 7, 6, 18]),
],
"Python list": [
[3.2, 1.0, 8, 9, 4.5, 7, 6, 18],
[4.2, 4.0, 4, 9, 4.5, 8, 7, 8],
[12, 1.0, 8, 9, 4.5, 7, 6, 18],
],
}
)
df
[20]:
Index | NumPy array | Python list | |
---|---|---|---|
0 | 0 | [3.2, 1.0, 8.0, 9.0, 4.5, 7.0, 6.0, 18.0] | [3.2, 1.0, 8, 9, 4.5, 7, 6, 18] |
1 | 1 | [4.2, 4.0, 4.0, 9.0, 4.5, 8.0, 7.0, 8.0] | [4.2, 4.0, 4, 9, 4.5, 8, 7, 8] |
2 | 2 | [12.0, 1.0, 8.0, 9.0, 4.5, 7.0, 6.0, 18.0] | [12, 1.0, 8, 9, 4.5, 7, 6, 18] |
[21]:
pd_store = session.read_pandas(df, "pandas")
pd_store
[21]:
- pandas
- Index
- key: False
- nullable: False
- type: long
- NumPy array
- key: False
- nullable: True
- type: atoti_numpy_double[][ ]
- Python list
- key: False
- nullable: True
- type: atoti_list_double[][,]
- Index
[22]:
pd_store.head()
[22]:
Index | NumPy array | Python list | |
---|---|---|---|
0 | 0 | doubleVector[8]{3.2, ...} | doubleVector[8]{3.2, ...} |
1 | 1 | doubleVector[8]{4.2, ...} | doubleVector[8]{4.2, ...} |
2 | 2 | doubleVector[8]{12.0, ...} | doubleVector[8]{12.0, ...} |
[ ]: