d:\workspace\repos\deepfix\.venv\Lib\site-packages\deepchecks\core\serialization\dataframe\html.py:16: UserWarning:
pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
No categorical features provided, will automatically detect them. (Not Recommended)
No categorical features provided, will automatically detect them. (Not Recommended)
#train_data.data.head()
# Fit modelmodel_name="HistGradientBoostingClassifier"clf=HistGradientBoostingClassifier(max_depth=3)clf=clf.fit(train_data.X,train_data.y)
FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
deepchecks - WARNING - Could not find built-in feature importance on the model, using permutation feature importance calculation instead
deepchecks - INFO - Calculating permutation feature importance. Expected to finish in 15 seconds
Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]
Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]
Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]
Output()
✓ Analysis complete!
# Visualize resultsresult.to_text(verbose=False)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮│ DEEPFIX ANALYSIS RESULT │╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────────────────────────────── Summary ───────────────────────────────────────────────────────╮│The cross-artifact analysis reveals a concerning disconnect between excellent current model performance (AUC >0.99) ││and significant underlying data quality and configuration issues. The Deepchecks analysis identified severe ││multicollinearity (27 feature pairs with correlation >0.9) and potential data leakage (3 features with PPS >0.7), ││while the ModelCheckpoint analysis shows restrictive model parameters (max_depth=3) with no regularization and ││critical missing deployment metadata. These issues are compounded by non-deterministic configuration and unused ││high-variance features. Despite the current strong performance, the model appears to be leveraging redundant feature││relationships in a potentially unsustainable way. Immediate actions should focus on feature selection to address ││multicollinearity, model reconfiguration with proper regularization, and ensuring complete production metadata ││before deployment.│╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Summary Statistics Metric Value Total Findings 4 Severity Distribution HIGH: 2 MEDIUM: 2
HIGH Severity Issues (2) ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ # ┃ Finding ┃ Action ┃┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│ 1 │ Severe multicollinearity and potential │ Implement comprehensive feature │││ data leakage in features │ selection to remove redundant features │││Evidence: 27 feature pairs with │ and investigate high-PPS features for │││correlation >0.9 identified by │ data leakage │││Deepchecks, combined with 3 features │High correlation causes │││exceeding 0.7 predictive power score │multicollinearity issues while high PPS │││threshold, suggesting redundant features│scores may indicate improper data │││and possible label information leakage│separation or target information leakage││││into features││ 2 │ Critical production readiness gaps in │ Include complete model metadata │││ model deployment metadata │ (classes_, feature_names_in_, training │││Evidence: Missing classes_, │ metrics) and set fixed random_state │││feature_names_in_, n_iter_, and training│ before production deployment │││metrics from checkpoint, combined with │Essential metadata is required for │││non-deterministic random_state=None │correct inference, model interpretation,│││configuration│and reproducible results in production ││││environments│└─────┴──────────────────────────────────────────┴──────────────────────────────────────────┘
MEDIUM Severity Issues (2) ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ # ┃ Finding ┃ Action ┃┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│ 1 │ Suboptimal model configuration with │ Increase model capacity (max_depth=6-10) │││ restrictive parameters and missing │ and add moderate regularization │││ regularization │ (L2=0.1-1.0) while implementing proper │││Evidence: Model uses max_depth=3 with no│ feature selection │││L2 regularization │More flexible model architecture with │││(l2_regularization=0.0) while ignoring │proper regularization can better capture│││16 high-variance features, creating │complex patterns while preventing │││underfitting risk despite current │overfitting on the correlated feature │││excellent performance│set││ 2 │ Excellent current performance masks │ Address fundamental data quality and │││ underlying data quality and │ model configuration issues before │││ configuration issues │ relying on current performance metrics │││Evidence: AUC >0.99 with minimal drift │ for production decisions │││despite high feature correlation, │Current excellent performance may be │││restrictive model parameters, and │unsustainable and mask underlying │││missing regularization - suggesting the │problems that could cause failures when │││model is leveraging redundant feature │data distributions shift or features │││relationships│change│└─────┴──────────────────────────────────────────┴──────────────────────────────────────────┘