Skip to content

Dataitem kinds

At the moment, we support the following kinds:

  • table: represents a table
  • croissant: represents an ML Croissant dataset

For each different kind, the Dataitem object has its own subclass with different spec and status attributes.

Table

The table kind indicates that the dataitem is a generic table. It's usefull if you intend to manipulate the dataitem as a dataframe, in fact it has some methods to do so. The default dataframe framework we use to represent a table as dataframe is pandas.

Table spec parameters

Parameter Type Description Default
path str Path of the dataitem, can be a local path or a remote path, a single filepath or a directory/partition. required
schema TableSchema Frictionless table schema None

Table methods

The table kind has the following additional methods:

as_df

Read dataitem file (csv or parquet) as a DataFrame from spec.path. It's possible to pass additional arguments to the this function. These keyword arguments will be passed to the DataFrame reader function such as pandas's read_csv or read_parquet.

Parameters:

Name Type Description Default
file_format str

Format of the file to read. By default, it will be inferred from the extension of the file.

None
engine str

Dataframe framework, by default pandas.

'pandas'
**kwargs dict

Keyword arguments passed to the read_df function.

{}

Returns:

Type Description
Any

DataFrame.

write_df

Write DataFrame as parquet/csv/table into dataitem spec.path. keyword arguments will be passed to the DataFrame reader function such as pandas's to_csv or to_parquet. Note that by default the index is dropped when writing the dataframe. To keep the index, you can pass index=True as a keyword argument. If the dataitem path is a SQL scheme, the dataframe will be written to the table specified in the path (sql://(/)/).

Parameters:

Name Type Description Default
df Any

DataFrame to write.

required
extension str

Extension of the file (supported parquet and csv).

None
**kwargs dict

Keyword arguments passed to the write_df function.

{}

Returns:

Type Description
str

Path to the written dataframe.

Examples:

>>> import digitalhub as dh
>>> import pandas as pd
>>>
>>> p = dh.get_project("my_project")
>>> df = pd.read_df("data/my_data.csv")
>>> di = p.new_dataitem(
...     name="my_dataitem",
...     kind="table",
...     path="s3://my-bucket/my-data.parquet",
... )
>>> di.write_df(
...     df,
...     extension="parquet",
...     index=True,
... )
's3://my-bucket/my-data.parquet'

Croissant

The croissant kind indicates that the dataitem stores an ML Croissant dataset, defined by a metadata.json file and its referenced local files. Use this kind when you want to load the dataset through the mlcroissant library.

When logging a Croissant dataitem, ensure the metadata file is named metadata.json. If you set an explicit path, it must be an S3 partition path ending with /.

Croissant spec parameters

Parameter Type Description Default
path str Path to the Croissant dataset location (directory/partition containing metadata.json). required

Croissant methods

The croissant kind has the following additional methods:

as_dataset

Get the Croissant Dataset object from the Dataitem.

Parameters:

Name Type Description Default
overwrite bool

Flag to indicate overwrite of local files.

False

Returns:

Type Description
Dataset

Croissant Dataset object.