Dataitem kinds
At the moment, we support the following kinds:
table: represents a tablecroissant: represents an ML Croissant dataset
For each different kind, the Dataitem object has its own subclass with different spec and status attributes.
Table
The table kind indicates that the dataitem is a generic table. It's usefull if you intend to manipulate the dataitem as a dataframe, in fact it has some methods to do so. The default dataframe framework we use to represent a table as dataframe is pandas.
Table spec parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
path |
str | Path of the dataitem, can be a local path or a remote path, a single filepath or a directory/partition. | required |
schema |
TableSchema | Frictionless table schema | None |
Table methods
The table kind has the following additional methods:
as_df
Read dataitem file (csv or parquet) as a DataFrame from spec.path. It's possible to pass additional arguments to the this function. These keyword arguments will be passed to the DataFrame reader function such as pandas's read_csv or read_parquet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_format
|
str
|
Format of the file to read. By default, it will be inferred from the extension of the file. |
None
|
engine
|
str
|
Dataframe framework, by default pandas. |
'pandas'
|
**kwargs
|
dict
|
Keyword arguments passed to the read_df function. |
{}
|
Returns:
| Type | Description |
|---|---|
Any
|
DataFrame. |
write_df
Write DataFrame as parquet/csv/table into dataitem spec.path.
keyword arguments will be passed to the DataFrame reader function such as
pandas's to_csv or to_parquet.
Note that by default the index is dropped when writing the dataframe. To
keep the index, you can pass index=True as a keyword argument.
If the dataitem path is a SQL scheme, the dataframe will be written to the
table specified in the path (sql://
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Any
|
DataFrame to write. |
required |
extension
|
str
|
Extension of the file (supported parquet and csv). |
None
|
**kwargs
|
dict
|
Keyword arguments passed to the write_df function. |
{}
|
Returns:
| Type | Description |
|---|---|
str
|
Path to the written dataframe. |
Examples:
>>> import digitalhub as dh
>>> import pandas as pd
>>>
>>> p = dh.get_project("my_project")
>>> df = pd.read_df("data/my_data.csv")
>>> di = p.new_dataitem(
... name="my_dataitem",
... kind="table",
... path="s3://my-bucket/my-data.parquet",
... )
>>> di.write_df(
... df,
... extension="parquet",
... index=True,
... )
's3://my-bucket/my-data.parquet'
Croissant
The croissant kind indicates that the dataitem stores an ML Croissant dataset, defined by a
metadata.json file and its referenced local files. Use this kind when you want to load the
dataset through the mlcroissant library.
When logging a Croissant dataitem, ensure the metadata file is named metadata.json. If you
set an explicit path, it must be an S3 partition path ending with /.
Croissant spec parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
path |
str | Path to the Croissant dataset location (directory/partition containing metadata.json). |
required |
Croissant methods
The croissant kind has the following additional methods:
as_dataset
Get the Croissant Dataset object from the Dataitem.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
overwrite
|
bool
|
Flag to indicate overwrite of local files. |
False
|
Returns:
| Type | Description |
|---|---|
Dataset
|
Croissant Dataset object. |