Dataitem kinds

At the moment, we support the following kinds:

table: represents a table
croissant: represents an ML Croissant dataset

For each different kind, the Dataitem object has its own subclass with different spec and status attributes.

Table

The table kind indicates that the dataitem is a generic table. It's usefull if you intend to manipulate the dataitem as a dataframe, in fact it has some methods to do so. The default dataframe framework we use to represent a table as dataframe is pandas.

Table spec parameters

Parameter	Type	Description	Default
`path`	str	Path of the dataitem, can be a local path or a remote path, a single filepath or a directory/partition.	required
`schema`	TableSchema	Frictionless table schema	`None`

Table methods

The table kind has the following additional methods:

`as_df`

Read dataitem file (csv or parquet) as a DataFrame from spec.path. It's possible to pass additional arguments to the this function. These keyword arguments will be passed to the DataFrame reader function such as pandas's read_csv or read_parquet.

Parameters:

Name	Type	Description	Default
`file_format`	`str`	Format of the file to read. By default, it will be inferred from the extension of the file.	`None`
`engine`	`str`	Dataframe framework, by default pandas.	`'pandas'`
`**kwargs`	`dict`	Keyword arguments passed to the read_df function.	`{}`

Returns:

Type	Description
`Any`	DataFrame.

`write_df`

Write DataFrame as parquet/csv/table into dataitem spec.path. keyword arguments will be passed to the DataFrame reader function such as pandas's to_csv or to_parquet. Note that by default the index is dropped when writing the dataframe. To keep the index, you can pass index=True as a keyword argument. If the dataitem path is a SQL scheme, the dataframe will be written to the table specified in the path (sql://(/)/).

Parameters:

Name	Type	Description	Default
`df`	`Any`	DataFrame to write.	required
`extension`	`str`	Extension of the file (supported parquet and csv).	`None`
`**kwargs`	`dict`	Keyword arguments passed to the write_df function.	`{}`

Returns:

Type	Description
`str`	Path to the written dataframe.

Examples:

>>> import digitalhub as dh
>>> import pandas as pd
>>>
>>> p = dh.get_project("my_project")
>>> df = pd.read_df("data/my_data.csv")
>>> di = p.new_dataitem(
...     name="my_dataitem",
...     kind="table",
...     path="s3://my-bucket/my-data.parquet",
... )
>>> di.write_df(
...     df,
...     extension="parquet",
...     index=True,
... )
's3://my-bucket/my-data.parquet'

Croissant

The croissant kind indicates that the dataitem stores an ML Croissant dataset, defined by a metadata.json file and its referenced local files. Use this kind when you want to load the dataset through the mlcroissant library.

When logging a Croissant dataitem, ensure the metadata file is named metadata.json. If you set an explicit path, it must be an S3 partition path ending with /.

Croissant spec parameters

Parameter	Type	Description	Default
`path`	str	Path to the Croissant dataset location (directory/partition containing `metadata.json`).	required

Croissant methods

The croissant kind has the following additional methods:

`as_dataset`

Get the Croissant Dataset object from the Dataitem.

Parameters:

Name	Type	Description	Default
`overwrite`	`bool`	Flag to indicate overwrite of local files.	`False`

Returns:

Type	Description
`Dataset`	Croissant Dataset object.