Scikit-learn ML scenario introduction
This scenario provides a quick overview of developing and deploying a scikit-learn machine learning application using the functionalities of the platform.
We will prepare data, train a generic model and expose it as a service. Access Jupyter from your Coder instance and create a new notebook. Alternatively, you can find the final notebook file for this scenario in the tutorial repository.
Set-up
First, import necessary libraries and create a project to host the functions and executions
import digitalhub as dh
project = dh.get_or_create_project("project-ml-ci")
Create folder for source code:
from pathlib import Path
Path("src").mkdir(exist_ok=True)
Generate data
Define the following function, which generates the dataset as required by the model:
%%writefile "src/data-prep.py"
import pandas as pd
from sklearn.datasets import load_breast_cancer
from digitalhub_runtime_python import handler
@handler(outputs=["dataset"])
def breast_cancer_generator():
"""
A function which generates the breast cancer dataset
"""
breast_cancer = load_breast_cancer()
breast_cancer_dataset = pd.DataFrame(
data=breast_cancer.data, columns=breast_cancer.feature_names
)
breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["target"])
breast_cancer_dataset = pd.concat(
[breast_cancer_dataset, breast_cancer_labels], axis=1
)
return breast_cancer_dataset
Register it:
data_gen_fn = project.new_function(name="data-prep",
kind="python",
python_version="PYTHON3_10",
code_src="src/data-prep.py",
handler="breast_cancer_generator")
Run it locally:
gen_data_run = data_gen_fn.run("job", local_execution=True)
You can view the state of the execution with gen_data_run.status
or its output with gen_data_run.outputs()
. You can see a few records from the output artifact:
gen_data_run.output("dataset").as_df().head()
We will now proceed to training a model.