Who is this for? This document is written for anyone - grower, student, analyst, developer, or stakeholder - with zero prior machine learning knowledge. It explains the full notebook in plain language first, then adds deeper technical details with direct mapping to notebook variables and outputs.
test_disease_progression.pyDisease progression in greenhouse tomatoes is not random. Environmental conditions like humidity, leaf wetness, airflow, temperature, and crop stage influence how quickly diseases appear and spread.
The operational challenge is timing:
This notebook turns raw hourly greenhouse data into actionable foresight:
So instead of reacting after visible damage, teams can make proactive control decisions.
For every timestamp window, the pipeline produces disease-wise predictions.
| Output | Type | Plain-English Meaning |
|---|---|---|
| Presence flags | Multi-label classification | “Is disease X active right now?” (yes/no per disease) |
| Future severity (24h) | Multi-output regression | “What infection percentage will disease X have in 24 hours?” |
| Future trend (24h) | Rule-derived class | “Will disease X be absent, emerging, reducing, stable, or worsening?” |
Important: trend is derived from current severity and predicted future severity, not trained as a separate neural-network softmax output.
The notebook works over the disease set found in the synthetic disease progression dataset (for example classes such as early blight, late blight, leaf mold, septoria leaf spot, and spider mites).
At runtime, diseases are discovered directly from the dataset and encoded into disease-specific column groups:
PRES_COLS for current presence columnsSEV_COLS for current severity columnsfuture_severity_24h__* for 24-hour future targetstrend_24h__* for trend labelsThis dynamic approach means the pipeline can adapt if the disease list changes, as long as source columns remain consistent.
Primary dataset path:
data/processed/Disease Progression/tomato_disease_progression_synthetic_hourly.csv
The source is hourly synthetic but physically plausible greenhouse data in long format.
One timestamp has multiple rows, one per disease.
Example rows:
2025-01-01 10:00, disease=early_blight, current_infection_pct=3.12025-01-01 10:00, disease=late_blight, current_infection_pct=0.0After pivot, one timestamp becomes one row with disease-specific columns.
Example shape concept:
current_infection_pct__early_blight=3.1current_infection_pct__late_blight=0.0disease_present__early_blight=1disease_present__late_blight=0Why this transformation matters: machine learning models require fixed-size feature vectors per sample.
The notebook has Sections 0 to 27. Each section has a clear purpose.
What it does:
RUN_IDCONFIGKey configuration values:
history_window = 24forecast_horizon = 24severity_delta = 3.0severity_floor = 0.5presence_threshold = 0.5Why it matters: reproducibility and clean experiment tracking without artifact overwrites.
What it does:
pandas, numpy, scikit-learn, tensorflow, plotting)Why it matters: makes runs reproducible and comparable.
What it does:
Why it matters: sequence learning depends on correct time order.
What it does:
Why it matters: catches malformed data before expensive training.
What it does:
PRES_COLS, SEV_COLS, and disease list metadataWhy it matters: creates stable tabular structure for baseline and sequence paths.
What it does:
compute_trend(...)severity_floor to avoid low-value rounding confusionWhy it matters: this is where supervised labels are created.
Rule behavior example with delta=3.0, floor=0.5:
current=0.2, future=0.1 -> absentcurrent=0.0, future=2.0 -> emergingcurrent=15, future=10 -> reducingcurrent=12, future=13 -> stablecurrent=8, future=14 -> worseningWhat it does:
Why it matters: prevents leakage and ensures all model inputs are numeric.
What it does:
Why it matters: avoids runtime fitting errors.
What it does:
Why it matters: simulates real forecasting where future cannot influence past training.
What it does:
Why it matters: provides fair temporal context for tabular baselines.
What it does:
Why it matters: strong non-deep benchmark for multi-label classification.
What it does:
Why it matters: stable baseline for regression and physically valid outputs.
What it does:
Why it matters: keeps trend interpretation consistent with severity behavior.
What it does:
Why it matters: standardized evaluation across all model families.
What they do: run full baseline metrics on validation and test splits.
Why they matter: baseline context is required to interpret deep model gains.
What it does:
samples x time_steps x featuresWhy it matters: LSTM/GRU need sequence-shaped inputs.
What it does:
Why it matters: stable optimization with no validation/test leakage.
What it does:
presence_head for multi-label disease presencefuture_head for future severity regressionWhy it matters: one shared temporal encoder supports both tasks.
What it does:
Why it matters: captures best validation epoch and reduces overfitting.
What it does:
Why it matters: automatic, reproducible model-family selection.
What it does:
Why it matters: fast visual diagnostics for convergence and overfitting.
What it does:
Important enhancement included:
Blend interpretation:
alpha = 1.0 -> pure LSTM future predictionalpha = 0.0 -> pure baseline future prediction0 < alpha < 1 -> weighted hybridWhat it does:
Why it matters: reveals calibration, spread, and regression bias patterns.
What it does:
Why it matters: stakeholder-friendly decision summary.
What it does:
Why it matters: mirrors deployment-style inference workflow.
What it does:
Why it matters: reproducibility and production handoff.
What it does:
Why it matters: converts model output into practical agronomy narratives.
Presence baseline pattern:
presence_baseline = MultiOutputClassifier(
HistGradientBoostingClassifier(
learning_rate=0.05,
max_depth=6,
max_iter=200,
random_state=SEED,
)
)
Future severity baseline pattern:
future_sev_baseline = MultiOutputRegressor(
HistGradientBoostingRegressor(
learning_rate=0.05,
max_depth=6,
max_iter=250,
random_state=SEED,
)
)
Interpretation:
[0, 100] for physical realism.Sequence models process the past 24 hourly steps and learn temporal dependencies.
Intuition:
Two-head concept:
presence_head: sigmoid probabilities for disease presencefuture_head: regression-style output for next-24h severityCompile logic in the notebook uses:
loss = {
"presence_head": "binary_crossentropy",
"future_head": "mse",
}
Meaning:
The training uses callbacks such as:
These improve stability and ensure best checkpoints (best_lstm_*.keras, best_gru_*.keras) are preserved.
Trend is derived from (current_severity, predicted_future_severity) and rule thresholds.
Main controls:
severity_delta = 3.0severity_floor = 0.5Why floor is critical:
0.0 in display.emerging.With floor-aware logic:
absentemergingreducingstableworseningExample decisions:
current=0.00, future=0.20 -> absentcurrent=0.00, future=1.80 -> emergingcurrent=22.0, future=17.0 -> reducingcurrent=10.0, future=11.0 -> stablecurrent=7.0, future=13.0 -> worsening1.0 perfect0.0 equal to mean predictor< 0.0 worse than mean predictorIf future R2 is negative, future-severity predictions are noisier than a simple average baseline.
The notebook now includes validation-tuned blending during evaluation:
\[\hat{y}_{blend} = \alpha\hat{y}_{LSTM} + (1-\alpha)\hat{y}_{baseline}\]Workflow:
alpha on validation set for best mean R2.Key point: no LSTM retraining is required, so checkpoint behavior and training loss dynamics remain unchanged.
Artifacts are saved under:
src/agritwin_gh/models/artifacts/disease_progression_<RUN_ID>/
| File | What it contains |
|---|---|
baseline_presence_hgb.joblib |
Baseline presence model |
baseline_future_severity_hgb.joblib |
Baseline future-severity model |
best_lstm_<RUN_ID>.keras |
Best LSTM checkpoint |
best_gru_<RUN_ID>.keras |
Best GRU checkpoint |
lstm_disease_progression_best.keras |
Exported best LSTM model |
gru_disease_progression_best.keras |
Exported best GRU model |
sequence_feature_scaler.joblib |
Sequence feature scaler |
deep_model_metrics.json |
Detailed deep-model metrics |
model_comparison.csv |
Baseline/LSTM/GRU summary comparison |
config.json |
Run configuration snapshot |
run_<RUN_ID>.log |
Run log |
| File | What it shows |
|---|---|
lstm_training_curves.png |
LSTM training metrics over epochs |
gru_training_curves.png |
GRU training metrics over epochs |
lstm_roc_pr_curves.png |
Presence ROC/PR diagnostics |
lstm_future_residual_histograms.png |
Future-severity residual distribution |
lstm_future_parity_plots.png |
Predicted vs actual future severity |
src/agritwin_gh/models/disease_progression_<RUN_ID>.keras
From the project root:
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
notebooks/tomato_disease_progression.ipynb..venv) as kernel.Hourly disease CSV (long format)
|
v
[Section 2] Load + sort + parse timestamps
|
v
[Section 3] Sanity checks
|
v
[Section 4] Pivot long -> wide disease matrix
|
v
[Section 5] Create 24h future targets + trend labels
|
v
[Section 6-7] Encode and validate feature matrix
|
v
[Section 8] Chronological split (train/val/test)
|
+--> [Sections 9-15] Baseline models + baseline evaluation
|
+--> [Sections 16-22] Sequence tensors + LSTM/GRU train/eval
|
v
[Section 23] Deep diagnostics plots
|
v
[Section 24] Model comparison summary
|
v
[Section 25] Real-time inference demo
|
v
[Section 26] Export models, scaler, metrics, config
|
v
[Section 27] Scenario simulation for practical interpretation
Q: Why are there two model families (baseline and deep)? A: Baselines provide a strong reference. Deep models must beat or complement them to justify complexity.
Q: Why is the split chronological instead of random? A: Random split leaks future patterns into training. Chronological split is the realistic forecasting setup.
Q: Why can trend still show “emerging” when values look near zero?
A: Display rounding can hide small non-zero values. The severity_floor is used to prevent misleading low-noise transitions.
Q: Is the R2 improvement a retraining trick? A: No. It is an evaluation-time calibrated blend of future predictions, selected on validation data.
Q: Which model should be deployed? A: The notebook logs and exports the selected global best checkpoint and supporting config/metrics.
Q: Can this run on CPU only? A: Yes. Training is slower but still valid.
| Term | Plain-English Definition |
|---|---|
| Baseline model | A simpler model used for comparison against complex models |
| Brier score | A metric for probability calibration; lower is better |
| Chronological split | Train/validation/test split by time order |
| Deep learning | Neural-network-based machine learning |
| Feature | Input variable used by model |
| Forecast horizon | How far ahead prediction is made (24h here) |
| GRU | Gated Recurrent Unit, a recurrent sequence model |
| HistGradientBoosting | Tree ensemble method used for strong tabular baselines |
| LSTM | Long Short-Term Memory recurrent model |
| MAE | Mean absolute error |
| Multi-label classification | Predicting multiple yes/no labels simultaneously |
| Multi-output regression | Predicting multiple numeric outputs simultaneously |
| PR-AUC | Precision-recall area under curve |
| ROC-AUC | Receiver operating characteristic area under curve |
| R2 | Coefficient of determination for regression goodness |
| Severity delta | Minimum change required to call trend reducing/worsening |
| Severity floor | Value below which severity is treated as effectively zero for trend logic |
| Time window | Fixed history length used for sequence inputs |
| Trend class | Derived label: absent, emerging, reducing, stable, worsening |
If you are learning this topic for the first time, follow this order.
Outcome: you can interpret predictions and make action decisions confidently.
Outcome: you can explain how the notebook works end to end and why each metric matters.
Outcome: you can safely modify, rerun, and ship the pipeline.
Use this map when you are confused about where to look next.
If you do not understand trend labels: Read Section 7 first, then Section 5 (Section 5 target construction), then Section 17.
If you do not understand why R2 changed: Read Section 8 (R2 meaning), then Section 9 (blend logic), then Section 22 notes inside Section 5.
If model outputs feel contradictory: Check Section 17 checklist, then Section 13 FAQ, then exported plots in Section 10.
If training and inference behavior feel different: Read Section 6.4 callbacks, Section 9 blend workflow, and Section 26 export notes in Section 5.
If you are unsure what to trust for reporting: Use Section 8 metrics definitions and Section 10 artifact files as the source of truth.
Use this 6-step checklist every time you read an inference output row.
severity_delta and severity_floor rules.Mini example:
Interpretation: risk is rising and action should be planned before the next 24-hour cycle.
Validation checklist before sharing outputs:
Try these quickly after reading the guide.
Given severity_floor=0.5 and severity_delta=3.0, classify each case:
Expected labels:
Which statement is correct?
Correct answer: 2.
If a disease is currently low but predicted to rise above delta in 24h, what should happen operationally?
Suggested answer: mark as proactive intervention candidate and monitor associated climate factors (humidity, wetness, airflow) before the next cycle.
This diagram shows how information flows from raw data to decisions.
flowchart TD
A[Hourly Disease Dataset\nLong Format] --> B[Sanity Checks and Time Ordering]
B --> C[Long to Wide Transformation]
C --> D[Future Target Creation\n24h Severity + Trend]
D --> E[Feature Encoding]
E --> F[Chronological Split\nTrain / Validation / Test]
F --> G1[Baseline Path\nHistGradientBoosting]
F --> G2[Sequence Path\nLSTM and GRU]
G1 --> H1[Presence + Future Severity Predictions]
G2 --> H2[Presence + Future Severity Predictions]
H1 --> I[Trend Derivation Rules\nFloor + Delta]
H2 --> I
I --> J[Evaluation Metrics\nF1, ROC-AUC, PR-AUC, MAE, RMSE, R2]
J --> K[Model Comparison]
K --> L[Best Model Selection]
L --> M[Artifact Export\nModel, Scaler, Metrics, Plots]
M --> N[Operational Use\nMonitoring and Intervention Planning]
How to use this diagram:
test_disease_progression.pyFile location: scripts/test_disease_progression.py
Purpose:
Standalone test script to validate the trained Disease Progression model (LSTM/GRU) across 10 diverse scenarios covering disease absence, outbreak conditions, environmental stress, treatment effects, and trend verification.
Why it exists:
The model predicts disease presence (yes/no per disease), future severity (24h ahead, per disease), and trend labels (absent/emerging/reducing/stable/worsening). This script exercises the model with synthetic disease scenarios without requiring the training notebook or live sensor data — enabling rapid validation and confidence checks.
# Run all 10 scenarios
python scripts/test_disease_progression.py
# Run a specific scenario (1–10)
python scripts/test_disease_progression.py --scenario 3
| # | Scenario | What it validates |
|---|---|---|
| 1 | Healthy greenhouse – optimal conditions | Model correctly predicts all diseases absent; presence flags = 0 |
| 2 | High humidity / poor ventilation – Leaf Mold risk | Model identifies emerging Leaf Mold; other diseases remain absent |
| 3 | Hot dry stress – Spider Mites + Powdery Mildew risk | Model detects multiple disease risks under stress conditions |
| 4 | Seedling stage – moderate conditions baseline | Model calibrated for early growth stage; low disease pressure |
| 5 | Ripe stage – damp late-season conditions | Model identifies late-season disease risk (Late Blight in humid conditions) |
| 6 | Worsening – Early Blight severity ramps 5→40% | Model predicts worsening trend; future severity should increase |
| 7 | Recovery – Leaf Mold drops 45→5% with treatment | Treatment control flag active; model predicts reducing trend |
| 8 | All diseases at 60% severity – multi-disease outbreak | Model handles simultaneous multi-disease simulation; validates independence |
| 9 | Nocturnal damp spell – night humidity peak | Day/night cycle test; night conditions favour fungal diseases |
| 10 | Post-treatment – Spider Mites 30% + treatment active | Validates model response to control action flags |
For each scenario, the script prints a table:
──────────────────────────────────────────────────────────────────────
Scenario 6: Worsening — Early Blight severity ramps 5→40%
Disease Presence Current % Future 24h % Trend
──────────────────────────────────────────────────────────────────
early_blight 1.0 5.0 15.3 worsening
late_blight 0.0 0.0 0.0 absent
leaf_mold 0.0 0.0 0.0 absent
powdery_mildew 0.0 0.0 0.0 absent
spider_mites 0.0 0.0 0.0 absent
Columns:
Presence – (0 or 1) Is disease currently active? (binary classification output)Current % – Current severity/infection percentage (0–100)Future 24h % – Model-predicted 24-hour-ahead severityTrend – Derived label: absent |
emerging | reducing | stable | worsening |
Trend derivation rules (from Section 7 of main notebook):
absent — current % < floor (0.5%) AND future < flooremerging — current < floor AND future ≥ floor + delta (3.0)reducing — current > future + deltaworsening — current < future – deltastable — all else (neither reducing nor worsening significantly)Each scenario constructs a 24-timestep sequence (HISTORY_WINDOW) where:
Key features set per scenario (92 total):
indoor_temp, indoor_humidity, air_velocity (feats 10–12) – environmentleaf_wetness_proxy, vpd (feats 18, 16) – moisture and stress indicatorsstage_name_* (feats 56–61) – one-hot encoded growth stagecontrol_action_flag__* (feats 26, 31, 36, 41, 46) – treatment active? (per disease)current_infection_pct__* (feats 82–86) – current severity per diseasedisease_present_flag__* (feats 87–91) – presence indicator per diseaseAll trends show “absent”:
Unexpected future severity (e.g., increases despite treatment):
control_action_flag is set to 1.0 for the treated diseasecurrent_infection_pct is above zero (model may not predict recovery from zero)Import errors:
“Feature count mismatch” errors:
_set_disease() and _set_stage()This script provides a standalone diagnostic interface to the Disease Progression model:
For greenhouse deployment, live sensor data flows through src/agritwin_gh/models/disease_inference.py → REST API → greenhouse control logic.
Use this when Mermaid rendering is not available.
Raw hourly disease logs
-> quality checks
-> long-to-wide pivot
-> 24h future labels + trend labels
-> feature encoding
-> time-based split
-> [baseline models] and [LSTM/GRU models]
-> predictions (presence + 24h severity)
-> trend rules (floor + delta)
-> metrics + comparison
-> best model + artifact export
-> greenhouse action planning
Memory anchor for beginners:
“Observe -> Prepare -> Predict -> Explain -> Decide”