atoti.Session.read_csv()#
- Session.read_csv(path, /, *, keys=frozenset({}), table_name=None, separator=',', encoding='utf-8', process_quotes=True, partitioning=None, types={}, columns={}, array_separator=None, date_patterns={}, default_values={}, client_side_encryption=None, **kwargs)#
Read a CSV file into a table.
- Parameters:
The path to the CSV file to load.
.gz
,.tar.gz
and.zip
files containing compressed CSV(s) are also supported.The path can also be a glob pattern (e.g.
path/to/directory/**.*.csv
).keys (Set[str] | Sequence[str]) –
The columns that will become
keys
of the table.If a
Set
is given, the keys will be ordered as the table columns.table_name (str | None) – The name of the table to create. Required when path is a glob pattern. Otherwise, defaults to the capitalized final component of the path argument.
separator (str | None) –
The character separating the values of each line.
If
None
, the separator will be inferred in a preliminary partial read.encoding (str) – The encoding to use to read the CSV.
process_quotes (bool | None) –
Whether double quotes should be processed to follow the official CSV specification:
True
:Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
A double quote appearing inside a field must be escaped by preceding it with another double quote.
Fields containing line breaks, double quotes, and commas should be enclosed in double-quotes.
False
: all double-quotes within a field will be treated as any regular character, following Excel’s behavior. In this mode, it is expected that fields are not enclosed in double quotes. It is also not possible to have a line break inside a field.None
: the behavior will be inferred in a preliminary partial read.
partitioning (str | None) –
The description of how the data will be split across partitions of the table.
Default rules:
Only non-joined tables are automatically partitioned.
Tables are automatically partitioned by hashing their key columns. If there are no key columns, all the dictionarized columns are hashed.
Joined tables can only use a sub-partitioning of the table referencing them.
Automatic partitioning is done modulo the number of available cores.
Example
hash4(country)
splits the data across 4 partitions based on the country column’s hash value.types (Mapping[str, DataType]) – Types for some or all columns of the table. Types for non specified columns will be inferred from the first 1,000 lines.
Mapping from file column names to table column names. When the mapping is not empty, columns of the file absent from the mapping keys will not be loaded. Other parameters accepting column names expect to be passed table column names (i.e. values of this mapping) and not file column names.
>>> import csv >>> from pathlib import Path >>> from tempfile import mkdtemp >>> directory = mkdtemp() >>> file_path = Path(directory) / "largest-cities.csv" >>> with open(file_path, "w") as csv_file: ... writer = csv.writer(csv_file) ... writer.writerows( ... [ ... ("city", "area", "country", "population"), ... ("Tokyo", "Kantō", "Japan", 14_094_034), ... ("Johannesburg", "Gauteng", "South Africa", 4_803_262), ... ( ... "Barcelona", ... "Community of Madrid", ... "Madrid", ... 3_223_334, ... ), ... ] ... )
Dropping the population column and renaming and reordering the remaining ones:
>>> table = session.read_csv( ... file_path, ... columns={"country": "Country", "area": "Region", "city": "City"}, ... keys={"Country"}, ... ) >>> table.head().sort_index() Region City Country Japan Kantō Tokyo Madrid Community of Madrid Barcelona South Africa Gauteng Johannesburg
array_separator (str | None) –
The character separating array elements.
If not
None
, any field containing this separator will be parsed as an array.date_patterns (Mapping[str, str]) – A column name to date pattern mapping that can be used when the built-in date parsers fail to recognize the formatted dates in the passed files.
default_values (Mapping[str, ConstantValue | None]) – Mapping from column name to column
default_value
.client_side_encryption (ClientSideEncryptionConfig | None) – The client side encryption configuration to use when loading data.
- Return type: