Process the data
Raw data, as ingested from the remote API, is usually not suitable for consumption. We'll define a set of functions to process it.
Define a function to derive the dataset, group information about spires (id
, geolocation
, address
, name
...) and save the result in the store:
%%writefile "src/process-spire.py"
from digitalhub_runtime_python import handler
KEYS=['codice spira','longitudine','latitudine',
'Livello','tipologia','codice','codice arco',
'codice via','Nome via', 'stato','direzione',
'angolo','geopoint']
@handler(outputs=["dataset-spire"])
def process(project, di):
df = di.as_df()
sdf= df.groupby(['codice spira']).first().reset_index()[KEYS]
return sdf
Register the function in Core:
process_func = project.new_function(name="process-spire",
kind="python",
python_version="PYTHON3_10",
code_src="src/process-spire.py",
handler="process")
Run it locally:
process_run = process_func.run("job",
inputs={'di':dataset_di.key},
wait=True)
The results has been saved as an artifact in the data store:
spire_di = project.get_dataitem('dataset-spire')
spire_df = spire_di.as_df()
Now you can view the results with spire_df.head()
.
Let's transform the data. We will extract a new data frame, where each record contains the identifier of the spire and how much traffic it detected on a specific date and time slot.
A record that looks like this:
data | codice spira | 00:00-01:00 | 01:00-02:00 | ... | Nodo a | ordinanza | stato | codimpsem | direzione | angolo | longitudine | latitudine | geopoint | giorno settimana |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2023-03-25 | 0.127 3.88 4 1 | 90 | 58 | ... | 15108 | 4000/343434 | A | 125 | NO | 355.0 | 11.370234 | 44.509137 | 44.5091367043883, 11.3702339463537 | Sabato |
Will become 24 records, each containing the spire's code and recorded traffic within each time slot in a specific date:
time | codice spira | value |
---|---|---|
2023-03-25 00:00 | 0.127 3.88 4 1 | 90 |
... | ... | ... |
Load the data item into a data frame and remove all columns except for date, spire identifier and recorded values for each time slot:
keys = ['00:00-01:00', '01:00-02:00', '02:00-03:00', '03:00-04:00', '04:00-05:00', '05:00-06:00', '06:00-07:00', '07:00-08:00', '08:00-09:00', '09:00-10:00', '10:00-11:00', '11:00-12:00', '12:00-13:00', '13:00-14:00', '14:00-15:00', '15:00-16:00', '16:00-17:00', '17:00-18:00', '18:00-19:00', '19:00-20:00', '20:00-21:00', '21:00-22:00', '22:00-23:00', '23:00-24:00']
columns=['data','codice spira'] + keys
rdf = dataset_df[columns]
Derive dataset for recorded traffic within each time slot for each spire:
ls = []
for key in keys:
k = key.split("-")[0]
xdf = rdf[['data','codice spira',key]]
xdf['time'] = xdf.data.apply(lambda x: x+' ' +k)
xdf['value'] = xdf[key]
vdf = xdf[['time','codice spira','value']]
ls.append(vdf)
edf = pd.concat(ls)
You can verify with edf.head()
that the derived dataset matches our goal.
Let's put this into a function:
%%writefile "src/process-measures.py"
from digitalhub_runtime_python import handler
import pandas as pd
KEYS = ['00:00-01:00', '01:00-02:00', '02:00-03:00', '03:00-04:00',
'04:00-05:00', '05:00-06:00', '06:00-07:00', '07:00-08:00',
'08:00-09:00', '09:00-10:00', '10:00-11:00', '11:00-12:00',
'12:00-13:00', '13:00-14:00', '14:00-15:00', '15:00-16:00',
'16:00-17:00', '17:00-18:00', '18:00-19:00', '19:00-20:00',
'20:00-21:00', '21:00-22:00', '22:00-23:00', '23:00-24:00']
COLUMNS=['data','codice spira']
@handler(outputs=["dataset-measures"])
def process(project, di):
df = di.as_df()
rdf = df[COLUMNS+KEYS]
ls = []
for key in KEYS:
k = key.split("-")[0]
xdf = rdf[COLUMNS + [key]]
xdf['time'] = xdf.data.apply(lambda x: x+' ' +k)
xdf['value'] = xdf[key]
ls.append(xdf[['time','codice spira','value']])
edf = pd.concat(ls)
return edf
Register it:
process_measures_func = project.new_function(name="process-measures",
kind="python",
python_version="PYTHON3_10",
code_src="src/process-measures.py",
handler="process")
Run it locally:
process_measures_run = process_measures_func.run("job",
inputs={'di':dataset_di.key},
wait=True)
Inspect the resulting data artifact:
measures_di = project.get_dataitem('dataset-measures')
measures_df = measures_di.as_df()
measures_df.head()
Now that we have defined three functions to collect data, process it and extract information, let's put them in a pipeline.