All notable changes to ds-toolkit are documented here.
Format follows Keep a Changelog.
This project adheres to Semantic Versioning.
Stage 1 — Data Understanding & Validation
DataProfiler — one-call dataset summary: shape, dtypes, memory, missing%, cardinality, skew, kurtosis, outlier flag per columnSchemaValidator — Pydantic-backed schema enforcement with dtype, range, null, uniqueness, regex checks; strict=True mode raises on first violationDistributionReport — auto-generates histograms, KDE plots, QQ plots, box plots, correlation heatmap; exports self-contained HTMLStage 2 — Data Cleaning & Preprocessing
MissingHandler — 7 imputation strategies (mean, median, mode, constant, KNN, MICE, none); per-column overrides; CV-safe fit/transform splitOutlierDetector — IQR, Z-score, Isolation Forest, LOF detection; flag/cap/drop actions per columnTypeCaster — auto datetime parsing, int64/float64 downcast, low-cardinality → category; full change logDeduplicator — exact dedup on key columns + fuzzy dedup via rapidfuzzStage 3 — Feature Engineering
EncoderFactory — auto-selects OHE / OrdinalEncoder / TargetEncoder (smoothed, CV-safe) / HashingEncoder by cardinality and taskDatetimeDecomposer — year/month/day/dow/quarter/week + sin/cos cyclical encodings + holiday flag; auto-detects datetime columnsInteractionBuilder — product (A×B), ratio (A/B), polynomial interactions; variance pruning; optional RF-based top-k selectionFeatureSelector — 4-stage pipeline: variance threshold → correlation filter → RFECV → SHAP; full drop report per featureScaler — standard / minmax / robust; auto-detects numeric columns; excludes booleans; inverse_transform supportStage 4 — Model Training & Selection
ModelRegistry — catalogue of pre-configured estimators for clf and reg tasks; optional boosting libraries auto-detectedCVHarness — stratified/KFold/TimeSeriesSplit CV across multiple models; auto class-weight on imbalanced data; ranked summaryTunerOptuna — Optuna hyperparameter search with pre-built search spaces for LR, Ridge, RF, ExtraTrees, GBM, XGBoost, LightGBMEnsembleBuilder — stacking (configurable meta-learner), voting (soft), weighted blending (CV-derived weights)Stage 5 — Evaluation & Diagnostics
MetricsReport — auto-detects clf/reg; full suite of classification and regression metrics; tidy DataFrame outputExplainerSHAP — auto-selects TreeExplainer / KernelExplainer; summary, bar, waterfall, dependence plotsDiagnosticPlotter — confusion matrix, ROC, PR curve, calibration (clf); residuals, Q-Q, scale-location, Cook’s distance (reg)ErrorAnalyser — worst-prediction cohort segmentation; feature distribution shift detection; worst vs rest comparisonStage 6 — Experiment Tracking & Reproducibility
ExperimentLogger — MLflow context manager; auto-logs params, metrics, model, SHAP, requirements.txt, git hashConfigManager — YAML/JSON loader with ${ENV_VAR} override syntax; dot-access; required-key validation; version stampPipelineSerialiser — versioned .pkl with SHA-256 checksum + metadata sidecar; raises ChecksumError on tampered filesStage 7 — Reporting & Notebook Output
NotebookReporter — metric cards, ranked model table, SHAP bar chart, all inline in JupyterHTMLExporter — single self-contained HTML export (base64-embedded images); safe to emailModelCard — Mitchell et al. format; .display() / .to_html() / .to_md() outputs; generate_model_card() convenience functionInfrastructure
pyproject.toml with optional dependency groups: boosting, fuzzy, explain, tune, track, all, dev, docsruff + black + mypy toolchain configurationdocs/ MkDocs site with per-module API reference