AgriTwin-GH

Tomato Disease Progression Synthetic Dataset — Technical Documentation

Generated by: tomato_disease_progression_synthetic_generator.ipynb Input backbone: tomato_growth_progression_synthetic_hourly.csv Output CSV: tomato_disease_progression_synthetic_hourly.csv Random seed: 42 Total rows: 54,240 (10,848 hourly timesteps × 5 diseases) Timeline: 2024-07-01 to 2025-10-21


1. Overview and Objective

This synthetic dataset provides a realistic, sequential record of tomato disease progression in a controlled greenhouse environment. It uses hourly environmental and crop-stage backbone data as the spine and overlays multi-cycle, multi-severity disease dynamics for five diseases.

Intended use: Sequence model training for disease severity nowcasting and progression modelling. Future prediction targets (24h, 48h look-ahead) are intentionally excluded — they will be generated separately by a downstream process, allowing full flexibility in horizon and label design.


2. Diseases Simulated

Disease Pathogen / Organism Primary Trigger Conditions
early_blight Alternaria solani Warm (25–32 °C), moderate-high humidity (55–80%)
late_blight Phytophthora infestans Cool (10–22 °C), very high humidity (>85%)
leaf_mold Passalora fulva Moderate temp (16–24 °C), very high humidity (>82%)
powdery_mildew Leveillula taurica Moderate temp (20–28 °C), moderate humidity (40–70%)
spider_mites Tetranychus urticae Hot (28–38 °C), dry conditions (<52% RH)

3. Environmental Favorability Model

Each disease is scored using a piecewise-linear favorability function per environmental variable:

score(x) = {
    1.0,                                         if opt_low <= x <= opt_high
    (x - stress_low) / (opt_low - stress_low),   if stress_low < x < opt_low
    (stress_high - x) / (stress_high - opt_high),if opt_high < x < stress_high
    0.0,                                         if x <= stress_low or x >= stress_high
}

The overall favorability is a weighted sum:

fav = w_temp * temp_score + w_hum * hum_score + w_vpd * vpd_score
fav += lw_bonus * leaf_wetness_proxy      # bonus or penalty for leaf wetness
fav  = max(fav, 0)
if air_velocity > airflow_threshold:
    fav *= airflow_penalty_factor
fav  = clip(fav, 0, 1)

Disease Configuration Parameters

Disease temp_opt hum_opt vpd_opt r (growth) d (decline) lw_bonus
early_blight (25.0, 32.0) (55.0, 80.0) (0.7, 2.0) 0.180 0.042 0.22
late_blight (10.0, 22.0) (85.0, 100.0) (0.0, 0.6) 0.220 0.048 0.30
leaf_mold (16.0, 24.0) (82.0, 100.0) (0.0, 0.5) 0.160 0.038 0.28
powdery_mildew (20.0, 28.0) (40.0, 70.0) (0.6, 1.8) 0.140 0.032 0.00
spider_mites (28.0, 38.0) (28.0, 52.0) (2.0, 4.5) 0.200 0.044 -0.18

4. Stage Susceptibility

Disease progression speed is modulated by the crop growth stage. A per-stage susceptibility score (0 = fully resistant, 1 = fully susceptible) multiplies the instantaneous growth rate:

Stage early_blight late_blight leaf_mold powdery_mildew spider_mites
seedling 0.35 0.45 0.30 0.50 0.40
early_vegetative 0.55 0.60 0.50 0.70 0.62
flowering_initiation 0.72 0.78 0.68 0.78 0.75
flowering 0.85 0.90 0.88 0.82 0.85
unripe 0.88 0.92 0.80 0.65 0.80
ripe 0.60 0.55 0.45 0.40 0.50

5. Outbreak Trigger Logic

Pre-planned outbreak seed positions are used per disease. Each seed defines the exact row index where initial infection is injected. Each disease receives 6 seeds across the full timeline:

This mix ensures training diversity across the full infection severity spectrum.

Disease Start Row Initial Seed % Peak Severity % Category Max Duration (h)
early_blight 300 0.8 16.0 mild 420
early_blight 1400 2.0 82.0 severe 1080
early_blight 3100 1.0 38.0 moderate 580
early_blight 5600 2.5 91.0 severe 1150
early_blight 7600 0.6 20.0 mild 440
early_blight 9200 1.4 46.0 moderate 620
late_blight 420 1.5 78.0 severe 1040
late_blight 3000 0.5 14.0 mild 360
late_blight 4200 1.2 42.0 moderate 600
late_blight 6400 0.6 18.0 mild 400
late_blight 8100 2.2 88.0 severe 1120
late_blight 9700 1.0 36.0 moderate 540
leaf_mold 200 1.8 74.0 severe 1000
leaf_mold 2200 0.7 19.0 mild 420
leaf_mold 3800 1.2 44.0 moderate 620
leaf_mold 5800 2.0 86.0 severe 1100
leaf_mold 8300 0.5 15.0 mild 380
leaf_mold 9800 1.3 40.0 moderate 560
powdery_mildew 650 0.7 21.0 mild 450
powdery_mildew 1900 2.2 85.0 severe 1100
powdery_mildew 3600 1.0 40.0 moderate 580
powdery_mildew 6000 2.0 92.0 severe 1150
powdery_mildew 7900 0.8 18.0 mild 430
powdery_mildew 9400 1.5 44.0 moderate 600
spider_mites 350 2.0 80.0 severe 1060
spider_mites 2000 0.6 17.0 mild 400
spider_mites 3700 1.2 42.0 moderate 600
spider_mites 6100 2.5 93.0 severe 1200
spider_mites 8000 1.3 46.0 moderate 640
spider_mites 9900 0.5 16.0 mild 380

6. Disease Cycle Phases and Progression Equations

Each outbreak cycle advances through up to six sequential phases:

Phase Entry Condition Growth Behaviour
latent Cycle seed injected Very slow fixed growth; pathogen establishing
onset elapsed >= 24 h OR infection >= 2.5% Moderate logistic growth at 50% of full rate
spread infection >= max(12% of K, 5%) Full aggressive logistic growth (env-responsive)
severe infection >= 70% of K Fluctuation near K; control events cause decline
stabilization elapsed >= 76% of max_duration OR 24 h of low env. fav. Gentle decline from peak
decline elapsed >= 88% of max_duration Exponential decay until infection < 0.5%

Spread Phase Equation (Logistic Growth)

\[\Delta I = r \cdot I \cdot \left(1 - \frac{I}{K}\right) \cdot fav^{0.80} \cdot susc\]

When a control action is active during spread/onset: \(\Delta I_{\text{ctrl}} = \Delta I \cdot (1 - \text{intervention\_reduction})\)

Gaussian noise is added for realism at every step: \(I_{t+1} = \text{clip}\left( I_t + \Delta I + \mathcal{N}(0,\, \sigma), \; 0, 100 \right)\)

where $\sigma = \max(I \cdot 0.03,\; 0.08)$ during spread.

Decline Phase

\[I_{t+1} = I_t \cdot (1 - d_r) + \mathcal{N}(0, 0.12)\]

where $d_r$ = decline rate (multiplied by 2.5× when control action is active).


7. Intervention / Control Logic

Parameter Value
Trigger level infection >= 15%
Trigger probability 35% per eligible hour
Active duration 48 hours per event
Cooldown period 72 hours between events
Events per cycle Multiple allowed (cooldown-gated)

During a control event: growth increment reduced by intervention_reduction (55–68%); decline rate multiplied by 2.5×.

Control Action Pools per Disease

Disease Control Action Options
early_blight fungicide_spray, copper_treatment, reduced_irrigation
late_blight fungicide_spray, ventilation_increase, humidity_reduction
leaf_mold ventilation_increase, humidity_reduction, leaf_pruning
powdery_mildew sulfur_treatment, potassium_bicarbonate, improved_airflow
spider_mites acaricide_spray, humidity_increase, biological_release

8. Synthetic Column Descriptions

Column Type Description
disease_name str One of the five simulated diseases
disease_present_flag int (0/1) 1 if infection > 0% at this hour
disease_cycle_id int Sequential cycle number (1–6 per disease); 0 = no active cycle
disease_cycle_stage str Phase: latent / onset / spread / severe / stabilization / decline / none
outbreak_trigger_flag int (0/1) 1 at the exact hour an outbreak seed was injected
control_action_flag int (0/1) 1 during an active control intervention window
control_action_type str Action applied (e.g., fungicide_spray); ‘none’ otherwise
stage_susceptibility_score float Susceptibility of current growth stage $\in [0, 1]$
disease_risk_score float fav × susc; combined hourly disease risk $\in [0, 1]$
hours_since_disease_onset float Hours elapsed since current cycle onset; NaN if no active cycle
current_infection_pct float Simulated crop infection percentage $\in [0, 100]$
infection_growth_rate_hourly float Delta infection per hour (positive = spreading, negative = declining)

9. Target Policy — Future Targets Intentionally Excluded

This dataset does not contain direct prediction target columns such as predicted_infection_pct_24h or predicted_infection_pct_48h.

Rationale:

How to derive targets from this dataset:

# Example: create 24h and 48h look-ahead targets per disease
df_disease = df_final[df_final["disease_name"] == "early_blight"].copy()
df_disease = df_disease.sort_values("timestamp").reset_index(drop=True)
n = len(df_disease)
df_disease["target_24h"] = df_disease["current_infection_pct"].shift(-24)
df_disease["target_48h"] = df_disease["current_infection_pct"].shift(-48)

10. Fallback Logic for Missing Columns

Missing Column Fallback Action
leaf_wetness_proxy Derived as (indoor_humidity > 95).astype(float)
vpd Uses vpd_proxy if available; otherwise 1.2 kPa
Any environmental column Default neutral value; printed as warning
stage_name Defaults to early_vegetative (susceptibility 0.5)

11. Key Assumptions

  1. The backbone hourly time-series is monotonically increasing in time with 1-hour gaps.
  2. Disease simulations are statistically independent — co-infection dynamics are not modelled.
  3. The logistic growth model uses a single carrying capacity K per outbreak cycle.
  4. Environmental favorability is computed independently at each hour (no memory of past conditions).
  5. Control actions are probabilistically triggered; their probability and duration are simplified approximations of real greenhouse management practices.
  6. Severe cycles can approach but typically do not exceed 95% infection due to logistic ceiling K.
  7. Multiple active cycles of the same disease may overlap; the maximum infection value is retained.

Generated using Python · pandas · numpy · AgriTwin-GH project