Data and transformations
The platform supports data of different type to be stored and operated by the underlying storage subsystems.
Digital Hub natively supports two types of storages:
- persistence object storage (datalake S3 Minio), which manages immutable data in the form of files.
- operational relational data storage (PostgreSQL database), which is used for efficient querying of mutable data. Postgres is rich with extensions, most notably for geo-spatial and time-series data.
The data is represented in the platform as entities of different types, depending on its usage and format. More specifically, we distinguish
- data items which represent immutable tabular datasets resulting from different transformation operations and ready for use in differerent types of analysis. Data items are enriched with metadata (e.g., versions, lineage, stats, profiling, schema) and unique keys and managed and persisted to the datalake directly by the platform in the form of Apache Parquet files.
- artifacts which represent arbitrary files stored to the datalake with some extra metadata, but are not limited to tabular formats.
Each data entity may be accessed and manipulated by the platform via UI or using the API, e.g., with SDK.
Manipulating data via UI
Artifacts
Artifacts can be created and managed as entities with the console. This can be done accessing through the user's menu or using the shortcut on the dashboard.
Pressing on Artifact side menu button, the paginated list of the artifacts is showed. From this pages is possible:
create
a new artifactexpand
an artifact and see the last 5 versionsshow
the details of an artifactedit
an artifactdelete
an artifactfilter
the artifact by name and kind
In the next section, we will see how to create, read, update and delete artifacts.
CRUD
Here we analyze how to Create, Read, Update and Delete Artifacts using the UI, similarly to what happens with the SDK.
Create
A project is created pressing the button CREATE
in the Artifacts' list page. After pressing the button, the dialog asking the Artifact's parameter is shown:
It has the following mandatory parameters: The mandatory parameters are:
name
: the name of the artifactkind
: the kind of the artifact
The only Metadata
mandatory parameter is:
path
: the remote path where the artifact is stored
The other Metadata
parameters are optional and mutable after the creation:
name
: the name of the artifactversion
: the version of the artifactdescription
: a human readable description of the artifactupdated
: the date of the last modification made to the artifactsrc_path
: local path of the artifact, used in case of upload into remote storagelabels
: the labels of the artifact
Read
To read an artifact you can click on the SHOW
button.
The page shows the following details
id
: the id of the artifactkind
: the kind of the artifactKey
: the unique URL that identifies the resource
The Metadata
values are:
name
: the name of the artifactdescription
: a human readable description of the artifactversion
: the version of the artifactcreated
: the date of the creation to the artifactupdated
: the date of the last modification made to the artifactlabels
: the labels of the artifactpath
: the remote path where the artifact is storedsrc_path
: local path of the artifact, used in case of upload into remote storage
On the right side of this page are all the version of the resource is listed and the actual version is highlighted. Selecting a different element the different version is shown.
From the menu on top is possible to EDIT
, DELETE
, INSPECT
or EXPORT
the current artifact. For the first 2 options there are specific section
of this document.
Clicking on INSPECTOR
a dialog that shows the artifact in JSON format is shown.
Clicking the EXPORT
button the artifact is downloaded in a yaml file.
Update
You can update artifact's Metadata
pressing the button EDIT
in the list or in the show page. All the Metadata
values can be modified
name
: the name of the artifactdescription
: a human readable description of the artifactversion
: the version of the artifactupdated
: the date of the last modification made to the artifactlabels
: the labels of the artifactpath
: the remote path where the artifact is storedsrc_path
: local path of the artifact, used in case of upload into remote storage
Delete
You can delete an artifact from the list or from the detail pressing the button DELETE
. A dialog asking confirmation is shown
Dataitems
Dataitems can be created and managed as entities with the console. This can be done accessing through the user's menu or using the shortcut on the dashboard.
Pressing on Data items side menu button, the paginated list of the resource is showed. From this pages is possible:
create
a new dataitemexpand
an dataitem and see the last 5 versionsshow
the details of an dataitemedit
an dataitemdelete
an dataitemfilter
the dataitem by name and kind
In the next section, we will see how to create, read, update and delete dataitems.
CRUD
Here we analyze how to Create, Read, Update and Delete Dataitems using the UI, similarly to what happens with the SDK.
Create
A project is created pressing the button CREATE
in the Dataitems' list page. After pressing the button, the dialog asking the Dataitem's parameter is shown:
It has the following mandatory parameters: The mandatory parameters are:
name
: the name of the dataitemkind
: the kind of the dataitem
The only Metadata
mandatory parameter is:
path
: the remote path where the dataitem is stored
The other Metadata
parameters are optional and mutable after the creation:
name
: the name of the dataitemversion
: the version of the dataitemdescription
: a human readable description of the dataitemupdated
: the date of the last modification made to the dataitemsrc_path
: local path of the dataitem, used in case of upload into remote storagelabels
: the labels of the dataitem
Kind
There are 2 possible kinds for dataitems:
Dataitem
: indicates that the dataitem is a generic dataitem. There are no specific attributes in the creation page.table
: indicates that the dataitem point to a table. The optional parameter is the schema of the table in table_schema format
Read
To read an dataitem you can click on the SHOW
button.
The page shows the following details
id
: the id of the dataitemkind
: the kind of the dataitemKey
: the unique URL that identifies the resource
The Metadata
values are:
name
: the name of the dataitemdescription
: a human readable description of the dataitemversion
: the version of the dataitemcreated
: the date of the creation to the dataitemupdated
: the date of the last modification made to the dataitemlabels
: the labels of the dataitempath
: the remote path where the dataitem is stored
Based on the kind of the dataitem, there may be schema
, indicates that the dataitem point to a table.
On the right side of this page are all the version of the resource is listed and the actual version is highlighted. Selecting a different element the different version is shown.
From the menu on top is possible to EDIT
, DELETE
, INSPECT
or EXPORT
the current dataitem. For the first 2 options there are specific section
of this document.
Clicking on INSPECTOR
a dialog that shows the dataitem in JSON format is shown.
Clicking the EXPORT
button the dataitem is downloaded in a yaml file.
Update
You can update dataitem's Metadata
pressing the button EDIT
in the list or in the show page. All the Metadata
values can be modified
name
: the name of the dataitemdescription
: a human readable description of the dataitemversion
: the version of the dataitemupdated
: the date of the last modification made to the dataitemlabels
: the labels of the dataitempath
: the remote path where the dataitem is stored
Based on the kind of the dataitem, there may be schema
, indicates that the dataitem point to a table.
Delete
You can delete an dataitem from the list or from the detail pressing the button DELETE
. A dialog asking confirmation is shown
Managing data with SDk
Artifacts
Artifacts (ARTIFACT) are (binary) objects stored in one of the artifact stores of the platform, and available to every process, module and component as files (or data streams).
Artifacts can be created and managed as entities with the SDK CRUD methods. This can be done directly from the package or through the Project
object.
To manage artifacts, you need to have digitalhub_core
layer installed.
In the first section, we will see how to create, read, update and delete artifacts.
In the second section, we will see what can be done with the Artifact
object.
CRUD
An artifact
is created entity can be managed with the following methods.
new_artifact
: create a new artifactget_artifact
: get an artifactupdate_artifact
: update an artifactdelete_artifact
: delete an artifactlist_artifacts
: list all artifacts
This is done in two ways. The first is through the SDK and the second is through the Project
object.
Example:
import digitalhub as dh
project = dh.get_or_create_project("my-project")
## From library
artifact = dh.new_artifact(project="my-project",
name="my-artifact",
kind="artifact",
path="s3://my-bucket/my-artifact.ext")
## From project
artifact = project.new_artifact(name="my-artifact",
kind="artifact",
path="s3://my-bucket/my-artifact.ext")
The syntax is the same for all CRUD methods. The following sections describe how to create, read, update and delete an artifact. It focus on managing artifacts from library. If you want to managie artifacts from the project, you can use the Project
object and avoid to specify the project
parameter.
Create
To create an artifact you can use the new_artifact()
method.
The mandatory parameters are:
project
: the project in which the artifact will be createdname
: the name of the artifactkind
: the kind of the artifactpath
: the remote path where the artifact is stored
The optional parameters are:
uuid
: the uuid of the artifact (this is automatically generated if not provided). Must be a valid uuid v4.description
: the description of the artifactsource
: the remote source of the artifact (git repository)labels
: the labels of the artifactembedded
: whether the artifact is embedded or not. IfTrue
, the artifact is embedded (all the spec details are expressed) in the project. IfFalse
, the artifact is not embedded in the projectsrc_path
: local path of the artifact, used in case of upload into remote storagekwargs
: keyword arguments passed to the spec constructor
Example:
artifact = dh.new_artifact(project="my-project",
name="my-artifact",
kind="artifact",
path="s3://my-bucket/my-artifact.ext")
Read
To read an artifact you can use the get_artifact()
or import_artifact()
methods. The first one searches for the artifact into the backend, the second one load it from a local yaml.
Get
The mandatory parameters are:
project
: the project in which the artifact will be created
The optional parameters are:
entity_name
: to use the name of the artifact as identifier. It returns the latest version of the artifactentity_id
: to use the uuid of the artifact as identifier. It returns the specified version of the artifactkwargs
: keyword arguments passed to the client that comunicate with the backend
Example:
artifact = dh.get_artifact(project="my-project",
entity_name="my-artifact")
artifact = dh.get_artifact(project="my-project",
entity_id="uuid-of-my-artifact")
Import
The mandatory parameters are:
file
: file path to the artifact yaml
Example:
artifact = dh.import_artifact(file="./some-path/my-artifact.yaml")
Update
To update an artifact you can use the update_artifact()
method.
The mandatory parameters are:
artifact
: artifact object to be updated
The optional parameters are:
kwargs
: keyword arguments passed to the client that comunicate with the backend
Example:
artifact = dh.new_artifact(project="my-project",
name="my-artifact",
kind="artifact",
path="s3://my-bucket/my-artifact.ext")
artifact.metadata.description = "My new description"
artifact = dh.update_artifact(artifact=artifact)
Delete
To delete an artifact you can use the delete_artifact()
method.
The mandatory parameters are:
project
: the project in which the artifact will be created
The optional parameters are:
entity_name
: to use the name of the artifact as identifierentity_id
: to use the uuid of the artifact as identifierdelete_all_versions
: ifTrue
, all versions of the artifact will be deletedkwargs
: keyword arguments passed to the client that comunicate with the backend
Example:
artifact = dh.new_artifact(project="my-project",
name="my-artifact",
kind="artifact",
path="s3://my-bucket/my-artifact.ext")
dh.delete_artifact(project="my-project",
entity_id=artifact.id)
List
To list all artifacts you can use the list_artifacts()
method.
The mandatory parameters are:
project
: the project in which the artifact will be created
The optional parameters are:
kwargs
: keyword arguments passed to the client that comunicate with the backend
Example:
artifacts = dh.list_artifacts(project="my-project")
Artifact object
The Artifact
object is built using the new_artifact()
method. There are several variations of the Artifact
object based on the kind
of the artifact. The SDK supports the following kinds:
artifact
: represents a generic artifact
For each different kind, the Artifact
object has a different set of methods and different spec
, status
and metadata
.
All the Artifact
kinds have a save()
and an export()
method to save and export the entity artifact into backend or locally as yaml.
To create a specific artifact, you must use the desired kind
in the new_artifact()
method.
Artifact
The artifact
kind indicates that the artifact is a generic artifact.
There are no specific spec
parameters.
The artifact
kind has the following methods:
as_file()
: collects the artifact into a local temporary filedownload()
: downloads the artifact into a specified pathupload()
: uploads the artifact to a specified path
As file
The as_file()
method returns the artifact as a temporary file. The file is not automatically deleted when the program ends.
The method returns the path of the downloaded artifact.
Download
The download()
method downloads the artifact into a specified path.
The method returns the path of the downloaded artifact.
The method accepts the following parameters:
target
: remote path of the artifact to be downloaded (eg.s3://my-bucket/my-artifact.ext
). By default, it is used thespec
path
dst
: local path where the artifact will be downloaded. By default, it is in the current working directoryoverwrite
: ifTrue
, the target path will be overwritten if it already exists
Upload
The upload()
method uploads the artifact to a specified path.
The method returns the path of the uploaded artifact.
The method accepts the following parameters:
source
: local path of the artifact to be uploadedtarget
: remote path of the artifact to be uploaded (eg.s3://my-bucket/my-artifact.ext
). By default, it is used thespec
path
Dataitems
Data items (DATAITEM) are data objects which contain a dataset of a given type, stored in an addressable repository and accessible to every component able to understand the type (kind) and the source (path). Do note that data items could be stored in the artifact store as artifacts, but that is not a dependency or a requirement.
Dataitems can be created and managed as entities with the SDK CRUD methods. This can be done directly from the package or through the Project
object.
To manage dataitems, you need to have digitalhub_data
layer installed.
In the first section, we will see how to create, read, update and delete dataitems.
In the second section, we will see what can be done with the Dataitem
object.
CRUD
An dataitem
is created entity can be managed with the following methods.
new_dataitem
: create a new dataitemget_dataitem
: get a dataitemupdate_dataitem
: update a dataitemdelete_dataitem
: delete a dataitemlist_dataitems
: list all dataitems
This is done in two ways. The first is through the SDK and the second is through the Project
object.
Example:
import digitalhub as dh
project = dh.get_or_create_project("my-project")
## From library
dataitem = dh.new_dataitem(project="my-project",
name="my-dataitem",
kind="dataitem",
path="s3://my-bucket/my-dataitem.ext")
## From project
dataitem = project.new_dataitem(name="my-dataitem",
kind="dataitem",
path="s3://my-bucket/my-dataitem.ext")
The syntax is the same for all CRUD methods. The following sections describe how to create, read, update and delete a dataitem. It focus on managing dataitems from library. If you want to managie dataitems from the project, you can use the Project
object and avoid to specify the project
parameter.
Create
To create a dataitem you can use the new_dataitem()
method.
The mandatory parameters are:
project
: the project in which the dataitem will be createdname
: the name of the dataitemkind
: the kind of the dataitempath
: the remote path where the dataitem is stored
The optional parameters are:
uuid
: the uuid of the dataitem (this is automatically generated if not provided). Must be a valid uuid v4.description
: the description of the dataitemsource
: the remote source of the dataitem (git repository)labels
: the labels of the dataitemembedded
: whether the dataitem is embedded or not. IfTrue
, the dataitem is embedded (all the spec details are expressed) in the project. IfFalse
, the dataitem is not embedded in the projectkwargs
: keyword arguments passed to the spec constructor
Example:
dataitem = dh.new_dataitem(project="my-project",
name="my-dataitem",
kind="dataitem",
path="s3://my-bucket/my-dataitem.ext")
Read
To read a dataitem you can use the get_dataitem()
or import_dataitem()
methods. The first one searches for the dataitem into the backend, the second one load it from a local yaml.
Get
The mandatory parameters are:
project
: the project in which the dataitem will be created
The optional parameters are:
entity_name
: to use the name of the dataitem as identifier. It returns the latest version of the dataitementity_id
: to use the uuid of the dataitem as identifier. It returns the specified version of the dataitemkwargs
: keyword arguments passed to the client that comunicate with the backend
Example:
dataitem = dh.get_dataitem(project="my-project",
entity_name="my-dataitem")
dataitem = dh.get_dataitem(project="my-project",
entity_id="uuid-of-my-dataitem")
Import
The mandatory parameters are:
file
: file path to the dataitem yaml
Example:
dataitem = dh.import_dataitem(file="./some-path/my-dataitem.yaml")
Update
To update a dataitem you can use the update_dataitem()
method.
The mandatory parameters are:
dataitem
: dataitem object to be updated
The optional parameters are:
kwargs
: keyword arguments passed to the client that comunicate with the backend
Example:
dataitem = dh.new_dataitem(project="my-project",
name="my-dataitem",
kind="dataitem",
path="s3://my-bucket/my-dataitem.ext")
dataitem.metadata.description = "My new description"
dataitem = dh.update_dataitem(dataitem=dataitem)
Delete
To delete a dataitem you can use the delete_dataitem()
method.
The mandatory parameters are:
project
: the project in which the dataitem will be created
The optional parameters are:
entity_name
: to use the name of the dataitem as identifierentity_id
: to use the uuid of the dataitem as identifierdelete_all_versions
: ifTrue
, all versions of the dataitem will be deletedkwargs
: keyword arguments passed to the client that comunicate with the backend
Example:
dataitem = dh.new_dataitem(project="my-project",
name="my-dataitem",
kind="dataitem",
path="s3://my-bucket/my-dataitem.ext")
dh.delete_dataitem(project="my-project",
entity_id=dataitem.id)
List
To list all dataitems you can use the list_dataitems()
method.
The mandatory parameters are:
project
: the project in which the dataitem will be created
The optional parameters are:
kwargs
: keyword arguments passed to the client that comunicate with the backend
Example:
dataitems = dh.list_dataitems(project="my-project")
Dataitem object
The Dataitem
object is built using the new_dataitem()
method. There are several variations of the Dataitem
object based on the kind
of the dataitem. The SDK supports the following kinds:
dataitem
: represents a generic dataitemtable
: represents a table dataitem
For each different kind, the Dataitem
object has a different set of methods and different spec
, status
and metadata
.
To create a specific dataitem, you must use the desired kind
in the new_dataitem()
method.
All the Dataitem
kinds have a save()
and an export()
method to save and export the entity dataitem into backend or locally as yaml.
Dataitem
The dataitem
kind indicates that the dataitem is a generic dataitem.
There are no specific spec
parameters nor specific method exposed. It acts as a generic dataitem.
Table
The table
kind indicates that the dataitem point to a table.
The optional spec
parameters are:
schema
: the schema of the table in table_schema format
The table
kind also has the following methods:
as_df()
: to collect the data in a pandas dataframewrite_df()
: to write the dataitem as parquet
Read table
The as_df()
method returns the data in a pandas dataframe.
The method accepts the following parameters:
format
: the format of the data. If not provided, the format will be inferred from the file extension. We support ONLY parquet or csv.kwargs
: keyword arguments passed to the pandasread_parquet
orread_csv
method
Write table
The write_df()
method writes the dataitem as parquet.
The method accepts the following parameters:
target_path
: the path of the target parquet file. If not provided, the target path will created by the SDK and the dataitem will be stored in the default storekwargs
: keyword arguments passed to the pandasto_parquet
method