DataDrift.report_drift¶

DataDrift.report_drift(psi_nbins=1000, psi_bin_min_pct=0.04, stat='psi', drift_missing=True, return_meta_ref=False, dim_threshold=5000)¶

Restituisce il report con PSI o p-value per ciascuna feature in entrambi i dataset (storico e corrente) e segnala avvisi (Warning) se i PSI superano le soglie predefinite o se i p-value non superano il livello di significatività (alpha).

Questa funzione calcola il report del drift dei dati confrontando il dataset storico (di riferimento) con il dataset corrente. Il data drift viene calcolato utilizzando il Population Stability Index (PSI) o i p-value, a seconda del valore del parametro stat. Se il PSI supera la soglia predefinita o se il p-value non supera il livello di significatività (alpha), vengono generati degli avvisi.

Parameters:

psi_nbins (int, opzionale) – Numero di intervalli (bins) in cui le feature verranno suddivise per il calcolo del PSI. Default: 1000.
psi_bin_min_pct (float, opzionale) – Percentuale minima di osservazioni per ciascun intervallo (bucket). Default: 0.04 (4%).
stat (str, opzionale) – Tipo di statistica da utilizzare. Può essere “psi” (Population Stability Index) o “pval” (p-value derivato dal test di Kolmogorov-Smirnov per feature numeriche o dal test del Chi-quadrato per feature categoriche). Default: “psi”.
drift_missing (bool, opzionale) – Se True, include nel report anche il drift dei valori mancanti (missing values). Default: True.
return_meta_ref (bool, opzionale) – Se True, salva il dizionario dei metadati di riferimento come attributo della classe. Default: False.
dim_threshold (int, opzionale) – Dimensione massima del set di test significativo per il test del Chi-quadrato. Default: 5000.

Returns:

Il report generato dalla classe che contiene le informazioni sul data drift delle feature. Il DataFrame include il valore del PSI o del p-value per ciascuna feature, eventuali avvisi e, se richiesto, anche le informazioni sul drift dei valori mancanti.

Return type:

pd.DataFrame

Note

Se il parametro stat è impostato su “psi”, la funzione calcolerà il PSI per le feature numeriche e categoriche.
Se il parametro stat è impostato su “pval”, la funzione eseguirà il test di Kolmogorov-Smirnov per le feature numeriche e il test del Chi-quadrato per quelle categoriche.
Gli avvisi sono generati se i PSI superano la soglia massima predefinita o se i p-value non superano il livello di significatività (alpha).

Esempio 1: Utilizzo con DataFrame Pandas

>>> import pandas as pd
>>> from model_monitoring.data_drift import DataDrift
>>> data_storico = pd.DataFrame({
...  'feature_num': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
...  'feature_cat': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C']
...  })
>>> data_corrente = pd.DataFrame({
...  'feature_num': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11], # Leggero shift
...  'feature_cat': ['B', 'A', 'C', 'B', 'A', 'C', 'D', 'E', 'C', 'A'] # Distribuzione diversa
...  })
>>> drift_detector = DataDrift(data_storico, data_corrente, type_data="data")
>>> report = drift_detector.report_drift(stat="psi", drift_missing=True)
>>> report

feature	common_psi	warning	total_psi	proportion_new_data	proportion_old-fashioned_data	validity_warning	drift_perc_missing
feature_cat	0.069315	None	2.510517	0.2	0.0	Red Alert - new categorical data	0.0
feature_num	1.220596	Red Alert	1.220596	0.0	0.0	Information - values above the upper bound in the new data	0.0

Esempio 2: Utilizzo con Metadati (struttura semplificata)

>>> meta_storico = {
    'feature_num': {'type': 'numerical',
      'min_val': 1,
      'max_val': 10,
      'not_missing_values': 10,
      'bin_0': {'min': -inf, 'max': 1.0, 'freq': 0.1},
      'bin_1': {'min': 1.0, 'max': 2.0, 'freq': 0.1},
      'bin_2': {'min': 2.0, 'max': 3.0, 'freq': 0.1},
      'bin_3': {'min': 3.0, 'max': 4.0, 'freq': 0.1},
      'bin_4': {'min': 4.0, 'max': 5.0, 'freq': 0.1},
      'bin_5': {'min': 5.0, 'max': 6.0, 'freq': 0.1},
      'bin_6': {'min': 6.0, 'max': 7.0, 'freq': 0.1},
      'bin_7': {'min': 7.0, 'max': 8.0, 'freq': 0.1},
      'bin_8': {'min': 8.0, 'max': 9.0, 'freq': 0.1},
      'bin_9': {'min': 9.0, 'max': inf, 'freq': 0.1},
      'missing_values': 0.0},
    'feature_cat': {'type': 'categorical',
      'not_missing_values': 10,
      'A': {'labels': ['A'], 'freq': 0.4},
      'B': {'labels': ['B'], 'freq': 0.3},
      'C': {'labels': ['C'], 'freq': 0.3},
      'missing_values': 0.0}}
>>> meta_corrente = {
    'feature_cat': {'type': 'categorical',
      'A': {'labels': ['A'], 'freq': 0.3},
      'B': {'labels': ['B'], 'freq': 0.2},
      'C': {'labels': ['C'], 'freq': 0.3},
      '_other_': {'labels': ['D', 'E'], 'freq': 0.2},
      'missing_values': 0.0,
      'not_missing_values': 10},
     'feature_num': {'type': 'numerical',
      'min_val': 2,
      'max_val': 11,
      'bin_0': {'min': -inf, 'max': 1.0, 'freq': 0.0},
      'bin_1': {'min': 1.0, 'max': 2.0, 'freq': 0.1},
      'bin_2': {'min': 2.0, 'max': 3.0, 'freq': 0.1},
      'bin_3': {'min': 3.0, 'max': 4.0, 'freq': 0.1},
      'bin_4': {'min': 4.0, 'max': 5.0, 'freq': 0.1},
      'bin_5': {'min': 5.0, 'max': 6.0, 'freq': 0.1},
      'bin_6': {'min': 6.0, 'max': 7.0, 'freq': 0.1},
      'bin_7': {'min': 7.0, 'max': 8.0, 'freq': 0.1},
      'bin_8': {'min': 8.0, 'max': 9.0, 'freq': 0.1},
      'bin_9': {'min': 9.0, 'max': inf, 'freq': 0.2},
      'missing_values': 0.0,
      'not_missing_values': 10}}
>>> drift_detector_meta = DataDrift(meta_storico, meta_corrente, type_data="metadata")
>>> report_meta = drift_detector_meta.report_drift(stat="psi", drift_missing=True)
>>> report_meta

feature	common_psi	warning	total_psi	proportion_new_data	proportion_old-fashioned_data	validity_warning	drift_perc_missing
feature_cat	0.069315	None	2.510517	0.2	0.0	Red Alert - new categorical data	0.0
feature_num	1.220596	Red Alert	1.220596	0.0	0.0	Information - values above the upper bound in the new data	0.0

DataDrift.report_drift¶

Table of Contents

Previous topic

Next topic

This Page