Collect the data
Create a new folder to store the function's code in:
new_folder = 'src'
if not os.path.exists(new_folder):
os.makedirs(new_folder)
Define a function for downloading data as-is and persisting it in the data-lake:
%%writefile "src/download-data.py"
import mlrun
import pandas as pd
import requests
@mlrun.handler(outputs=["dataset"])
def downloader(context, url: mlrun.DataItem):
# read and rewrite to normalize and export as data
df = url.as_df(format='csv',sep=";")
return df
Register the function in MLRun:
project.set_function("src/download-data.py", name="download-data", kind="job", image="mlrun/mlrun", handler="downloader")
Then, execute it (locally) as a test. Note that it may take a few minutes.
project.run_function("download-data", inputs={'url':URL}, local=True)
The result will be saved as an artifact in the data store, versioned and addressable with a unique key. By default, this key follows the format <function-name>-<handler>-<output>
.
Write this key into a variable, so we can read the artifact:
DF_KEY = 'store://datasets/demo-etl/download-data-downloader_dataset'
Load the data item and then into a data frame:
di = mlrun.get_dataitem(DF_KEY)
df = di.as_df()
Run df.head()
and, if it prints a few records, you can confirm that the data was properly stored. It's time to process this data.