Who is this for?
This document is written for anyone — farmer, student, developer, or curious reader — with zero prior knowledge of machine learning or data science. Every concept is explained from the ground up, with analogies and plain language throughout.
test_growth_stage_progression.pyA tomato plant does not grow randomly — it moves through a fixed sequence of biological stages, from a tiny seedling all the way to a ripe, harvestable fruit. At every stage along that journey, the plant has different needs:
The problem: In a large greenhouse, a grower cannot watch every plant every hour. By the time a human notices that a plant is transitioning into a critical stage, it may already be too late to respond optimally.
The solution this model provides: Using only the greenhouse’s existing sensor readings (temperature, humidity, light, CO₂, etc.), the model:
This turns a greenhouse from a passive environment into a proactive, predictive system — giving operators time to prepare the right intervention before a critical stage transition occurs.
The model makes five simultaneous predictions for every hour of sensor data it receives:
| Output | Type | Plain-English Meaning |
|---|---|---|
| Current Stage | Classification | “Right now this plant is in the flowering stage.” |
| Next Stage | Classification | “After flowering, it will enter the unripe stage.” |
| Hours to Transition | Regression (a number) | “The transition will happen in approximately 38.5 hours.” |
| Transition within 24h | Binary (yes/no) | “Yes, there will be a stage change in the next 24 hours.” |
| Transition within 48h | Binary (yes/no) | “Yes, there will be a stage change in the next 48 hours.” |
All five predictions come from a single model in one pass — this is called multi-task learning (explained in Section 7).
The model works with six sequential stages. A tomato plant always passes through them in this exact order — it cannot skip or go backwards.
| # | Stage Name | Plain Description |
|---|---|---|
| 0 | seedling |
Germinated seed, first tiny leaves |
| 1 | early_vegetative |
True leaves growing, plant building structure |
| 2 | flowering_initiation |
First flower buds becoming visible |
| 3 | flowering |
Open yellow flowers, pollination occurring |
| 4 | unripe |
Green fruits developing |
| 5 | ripe |
Fruits colouring, ready to harvest |
For a full description of each stage including visual features, management decisions, and environmental requirements, see the Tomato Growth Stage Classification document.
File: data/processed/Growth Progression/tomato_growth_progression_synthetic_hourly.csv
This is a synthetic (computer-generated, but physically realistic) hourly time-series dataset. “Hourly” means there is one data row for every hour of a tomato plant’s life, across multiple complete growth cycles.
| Column | What it means | Example value |
|---|---|---|
timestamp |
The exact date and hour this reading was taken | 2025-03-01 14:00:00 |
cycle_id |
Which growth cycle this belongs to (each cycle = one full plant life) | 3 |
stage_name |
The growth stage label at this hour | flowering |
indoor_temp (→ temperature) |
Air temperature inside the greenhouse in °C | 22.4 |
indoor_humidity (→ humidity) |
Relative humidity in the greenhouse in % | 71.2 |
solarradiation (→ light) |
Light intensity reaching the plants | 345.0 |
indoor_CO2 (→ co2) |
Carbon dioxide concentration in ppm | 812.0 |
Note: Column names in the raw file vary (e.g.,
indoor_tempvstemperature). The notebook’s standardisation step (Section 3) handles all these aliases automatically — you do not need to rename anything manually.
Think of a cycle as one complete life of a single tomato plant — from germination all the way to harvest. The dataset contains multiple cycles, each representing a separate plant (or planting season). The model learns patterns from all cycles combined, then is asked to predict on cycles it has never seen before.
The notebook is divided into 19 numbered sections (Sections 0–18). Each section has a specific job. Here is what each one does and why.
What it does:
This is the very first thing that runs. It implements an intelligent Run ID detection system:
src/agritwin_gh/models/artifacts/ for directories matching the pattern growth_stage_progression_*20260310_142305)This means:
Why it matters:
Without Run ID detection, you would have to manually track which model was trained and either reuse it by hard-coding the path or retrain from scratch. The automatic detection makes the workflow reproducible and efficient.
What it sets up:
E:\AgriTwin-GH\
├── src\agritwin_gh\models\
│ ├── growth_stage_progression_<RUN_ID>.keras ← the trained model
│ └── artifacts\growth_stage_progression_<RUN_ID>\
│ ├── plots\ ← all generated PNG charts
│ ├── metrics\ ← JSON/CSV metric files
│ ├── logs\ ← execution log file
│ └── reports\ ← text reports and prediction CSVs
All folder creation happens automatically — you do not need to create anything by hand.
Implementation details:
The detection scans the artifacts folder and looks for directories like growth_stage_progression_20260310_192038. When found, it extracts the Run ID (20260310_192038) and verifies the model file exists before reusing.
What it does:
Loads all the Python libraries (tools) the notebook needs and sets random seeds to make results reproducible — meaning if you run the notebook twice with the same data, you get the same results.
Key libraries loaded:
| Library | What it does (plain English) |
|---|---|
pandas |
Works with tabular data (like Excel, but in Python) |
numpy |
Does fast numerical calculations on arrays of numbers |
matplotlib / seaborn |
Creates charts and plots |
scikit-learn |
Provides classic ML algorithms and preprocessing tools |
xgboost |
A powerful tree-based algorithm used for baseline comparison |
tensorflow / keras |
The deep learning framework used to build and train the LSTM |
Also defines the save_fig() helper — every plot in the notebook is automatically saved as a PNG file to the artifacts folder when this function is called.
What it does:
Reads the CSV dataset from disk and performs a quick health check — printing the shape (number of rows and columns), listing all column names, and counting missing values.
Key output: dataset_summary.json — a JSON file recording the dataset’s basic statistics, saved to the artifacts folder as a permanent record of what data went into this run.
What “loading” looks like (simplified):
# Read the file
df = pd.read_csv("...tomato_growth_progression_synthetic_hourly.csv")
# First 5 rows
df.head()
What it does:
Real datasets rarely arrive in perfect, consistent format. This section handles three problems:
Column name aliases — The raw file may call temperature indoor_temp, air_temperature, temp_c, or simply temperature. The notebook maps all known variants to a single canonical name automatically.
Stage label normalisation — The dataset might spell early_vegetative as "early vegetative" or "Early Vegetative". All variants are mapped to the exact lowercase underscore form the model expects.
Sorting and deduplication — Rows are sorted chronologically within each cycle and any exact duplicates are removed.
Why this matters:
If stage labels are inconsistent, the model might think "Flowering" and "flowering" are two different classes — which would corrupt the training data completely.
What it does:
Before training, the notebook validates every growth cycle to catch data quality problems. For each cycle it checks:
| Check | What it’s looking for |
|---|---|
| Minimum rows | The cycle must have at least 10 hourly readings |
| Monotonic timestamps | Time must always go forward, never backwards |
| Duplicate timestamps | No two rows with the same date-hour in one cycle |
| Stage reversals | A plant in flowering cannot go back to seedling |
| Missing stages | Each complete cycle should pass through all 6 stages |
What happens to bad cycles?
flowering_initiation rather than seedling) are flagged but kept — the model can still learn from partial cycles.Key output: cycle_summary.csv — a CSV file with one row per cycle and a column listing any issues found.
What it does:
This is where the five “answers” the model needs to learn are computed. For every row in the dataset, the code looks ahead in time (within the same cycle) to calculate:
next_stage — What is the next stage the plant will enter?hours_to_next_stage — How many hours until that transition?transition_within_24h — Is that transition within 24 hours? (1 = yes, 0 = no)transition_within_48h — Is that transition within 48 hours? (1 = yes, 0 = no)stage_progress_fraction — How far through the current stage is the plant? (0.0 = just entered, 1.0 = about to leave)Analogy: Imagine the plant’s life as a road trip with 6 cities as stops. At any given point on that road, you can calculate: What city are you currently in? What’s the next city? How many hours until you arrive? Will you get there in 24 hours? In 48 hours? How far through the current leg of the journey are you?
The last row in each stage (just before transitioning) will always have NaN (not a number) for next_stage and related columns, because there is no further transition to look forward to after entering the final ripe stage.
What it does:
The raw sensor readings (temperature, humidity, light, CO₂) are useful, but the model learns much better if it also has access to derived features — things like:
Rolling statistics — “What was the average temperature over the last 6 hours? 12 hours? 24 hours?” These capture trends that a single snapshot cannot.
Lag features — “What was the temperature 1 hour ago? 2 hours ago? 6 hours ago?” This gives the model a sense of direction — is the temperature rising or falling?
Computed biologically-meaningful features:
| Feature | Biological Meaning |
|---|---|
elapsed_hours |
How long has this cycle been running? |
hour_of_day |
What hour of the day is it (for circadian rhythm patterns)? |
cumulative_gdh |
Growing Degree Hours — cumulative heat units the plant has received (plants need a certain amount of heat to advance stages) |
cumulative_light |
How much total light has the plant received (important for flowering trigger) |
vpd |
Vapour Pressure Deficit — a measure of how effectively the plant can transpire; computed from temperature and humidity |
Result: Typically 60–80 features per timestep, depending on how many raw sensor columns exist.
What it does:
LSTM models do not take one row at a time — they take a window of consecutive rows.
Imagine sliding a 24-hour window across the timeline:
Each window becomes one training sample. The model learns: “Given the last 24 hours of sensor readings, what growth state is the plant in and where is it going next?”
The window size is SEQ_LEN = 24 (24 hours). This is a design choice — a 24-hour window captures a full day-night cycle, which is biologically meaningful for tomato growth.
Output shape:
X.shape = (N_samples, 24, N_features) — a 3D array where:
N_samples = total number of windows built24 = time steps in each windowN_features = number of features at each timestepWhat it does:
Divides the data into three sets. This is done at the cycle level (not the row level) to prevent data leakage.
| Set | Fraction | Purpose |
|---|---|---|
| Training | 70% of cycles | The model learns from this data |
| Validation | 15% of cycles | Used during training to check for overfitting |
| Test | 15% of cycles | Held back completely until final evaluation |
Why split by cycle, not by row?
If you split by row, sequences from the same cycle could appear in both training and test sets — the model would effectively be “previewing” the answer. Splitting by cycle ensures the model is evaluated on completely unseen plants.
Key output: split_summary.json — records exactly how many cycles and sequences are in each split.
What it does:
Raw sensor values have very different numerical ranges:
Neural networks train much better when all inputs are on the same scale. A StandardScaler transforms every feature so that across the training set, each feature has a mean of 0 and a standard deviation of 1.
Analogy: Imagine judging athletes where one runs 100m in 10 seconds and another lifts 200kg. The numbers mean completely different things. Scaling puts everything on a comparable “effort” scale.
Important rule: The scaler is fitted only on training data and then applied to validation and test data. This prevents information from the test set from leaking into the model during training.
Key output: scaler_details.json — records the mean and standard deviation for every feature so the scaler can be reconstructed later for inference.
What it does:
Before training an expensive deep learning model, we establish what simpler models can achieve. If a simple model is nearly as good, we do not need the complex one.
Two baseline models are trained:
Random Forest Classifier — for predicting current_stage:
XGBoost Regressor — for predicting hours_to_next_stage:
Key outputs:
baseline_metrics.json — accuracy, MAE, RMSE of the baseline modelsrf_feature_importances.csv — which features the Random Forest found most predictivebaseline_rf_classification_report.txt — precision, recall, F1 per stage classThe LSTM’s performance is later compared to these baselines to quantify how much the temporal deep learning model adds.
What it does:
Builds the main deep learning model. The architecture has three parts:
1. Shared LSTM Backbone:
Input (24 timesteps × N features)
↓
LSTM Layer 1 (128 units, returns sequences)
↓
LSTM Layer 2 (64 units, outputs one vector)
↓
Batch Normalisation
↓
Dense Layer (64 units, ReLU activation)
↓
Dropout (30% of connections randomly disabled during training)
Think of the LSTM backbone as a “reader” that digests the last 24 hours of sensor history and compresses it into a single rich summary vector.
2. Five Specialised Output Heads:
After the shared backbone, five separate networks branch off, each specialised for one prediction task:
| Head | Output | Loss Function | Notes |
|---|---|---|---|
current_stage |
6 probabilities (one per stage) | Categorical cross-entropy | What stage is the plant in now? |
next_stage |
6 probabilities (one per stage) | Categorical cross-entropy | What stage comes after current? |
hours_to_next |
A single number (hours) | Huber loss (robust to outliers) | How many hours until transition? |
trans_24h |
A probability between 0 and 1 | Binary cross-entropy | Will transition occur in next 24h? |
trans_48h |
A probability between 0 and 1 | Binary cross-entropy | Will transition occur in next 48h? |
3. Class Weighting:
Some growth stages (like ripe) have fewer hours than others (like early_vegetative). Without correction, the model would learn to mostly predict the majority stages. Class weights are computed and applied so that the model pays proportionally more attention to rare stages during training.
4. Loss Function Configuration:
The model is trained with a multi-task loss function that includes all five outputs:
loss = {
'current_stage': 'categorical_crossentropy',
'next_stage': 'categorical_crossentropy',
'hours_to_next': 'mse',
'trans_24h': 'mse',
'trans_48h': 'mse',
}
Each loss component is weighted equally during backpropagation. This ensures all five tasks learn simultaneously and reinforce each other.
Key outputs:
model_config.json — all hyperparameters (architecture choices, layer sizes, learning rate)class_weights.json — the weight given to each stage classWhat it does:
Runs the actual training loop — the model sees the training data repeatedly (one pass through all data = one “epoch”) and adjusts its internal parameters to improve its predictions.
Training safeguards:
| Callback | What it does |
|---|---|
| EarlyStopping | Stops training automatically if the model is no longer improving on the validation set (prevents wasted compute and overfitting) |
| ReduceLROnPlateau | Reduces the learning rate if progress stalls — like taking smaller steps when you’re close to the answer |
| ModelCheckpoint | Saves the single best model weights seen during training (even if later epochs get worse) |
Data pipeline: A tf.data.Dataset pipeline is used to feed data to the model in efficiently batched, shuffled form — the training data is shuffled every epoch so the model cannot memorise order.
Key outputs:
training_history.csv and training_history.json — full record of loss and accuracy for every epochtraining_history_curves.png — a plot showing how all metrics evolved during trainingWhat it does:
Runs the trained model on the test set (data it has never seen) and computes all performance metrics.
For current_stage and next_stage classification:
For hours_to_next_stage regression:
For transition_within_24h and transition_within_48h:
Key outputs:
confusion_matrix_current_stage.png and confusion_matrix_next_stage.pngconfusion_matrix_trans_24h.png and confusion_matrix_trans_48h.pnglstm_current_stage_classification_report.txtlstm_next_stage_classification_report.txthours_actual_vs_predicted.png — scatter plot of predicted vs actual hourslstm_metrics.json — all computed metrics in one fileWhat it does:
Picks 8 random samples from the test set and runs the model on each, printing side-by-side comparisons of predicted vs actual values. This gives a human-readable sanity check — you can read individual examples to build intuition for how the model behaves.
Critical logic: Conditional Transition Flags
The inference function implements a crucial safeguard for the transition probability outputs:
# If hours to next stage > 24, set trans_24h to 0; otherwise use predicted probability
trans_24h_prob = 0.0 if hours_pred > 24 else raw_trans_24h
# If hours to next stage > 48, set trans_48h to 0; otherwise use predicted probability
trans_48h_prob = 0.0 if hours_pred > 48 else raw_trans_48h
This ensures the transition flags are conditional probabilities — they only produce meaningful values when the time window is biologically relevant. For example:
hours_to_next = 220h, both trans_24h and trans_48h will be set to 0.0 (no chance of transition in those windows)hours_to_next = 18h, trans_24h will use the model’s prediction (likely high), and trans_48h will also use its prediction (likely very high)hours_to_next = 50h, trans_24h will be 0.0, and trans_48h will use its predictionExample output row (correct logic in action):
Sample #3
Actual current stage : flowering
Predicted current stage : flowering ✓
Actual next stage : unripe
Predicted next stage : unripe ✓
Actual hours to next : 41.0 h
Predicted hours to next : 38.7 h (Δ 2.3 h)
Trans prob 24h (raw) : 0.03 → Output: 0.0 (hours > 24)
Trans prob 48h (raw) : 0.87 → Output: 0.87 (hours < 48) ✓
Notice how the raw model outputs are post-processed based on hours_to_next to produce biologically sensible predictions.
Key output: prediction_samples.csv — a CSV file with all 8 example predictions for later review.
What it does:
Generates 5 publication-quality charts for understanding the dataset and model behaviour:
| Plot | What it shows |
|---|---|
stage_distribution.png |
How many hours of data exist per stage (are classes balanced?) |
cycle_duration_distribution.png |
How long each complete growth cycle lasted (in hours) |
stage_progression_timelines.png |
A timeline view of 6 random cycles showing stage transitions |
transition_probability_histograms.png |
Distribution of how often 24h and 48h transitions occur |
rf_feature_importances.png |
Which sensor features matter most (from the Random Forest) |
All plots are saved as PNG files in the artifacts folder.
What it does:
Persists all key outputs of this run to disk in an organised way.
Specifically:
growth_stage_progression_<RUN_ID>.keras — can be loaded later for inference without retraining.feature_scaler.pkl — required to preprocess new sensor data in exactly the same way as training data.inference_config.json — a single file containing all information needed to run the model on new data: feature list, stage mappings, scaling parameters, sequence length.per_stage_distribution.jsontest_predictions_full.csv — every test sequence with all 5 predicted and actual values.What it does:
Defines five reusable Python functions that make it easy to use the trained model on new, unseen sensor data — without having to understand the internal preprocessing pipeline:
| Function | What it does |
|---|---|
decode_stage(idx) |
Converts a stage number (0–5) back to its name ("flowering") |
preprocess_new_data(df_raw) |
Takes a raw sensor DataFrame and runs the full pipeline (standardise columns → engineer features → fill gaps) |
build_recent_sequence(df_feats) |
Takes the last 24 rows of preprocessed data and returns a scaled array ready for the model |
predict_with_inference(model, sample_data_dict, scaler) |
Runs the model on a sample and applies conditional transition flag logic (see below) |
predict_progression(sequence_3d) |
Runs the model and returns a human-readable dict with all 5 predictions |
Conditional Transition Flag Processing:
The predict_with_inference function applies critical post-processing to ensure stable, biologically valid predictions:
def predict_with_inference(model, sample_data_dict, scaler):
# Get raw model outputs
sample_scaled = scaler.transform(sample_data_dict['X'])
preds = model.predict(sample_scaled, verbose=0)
hours_pred = preds[2][0]
raw_trans_24h = preds[3][0]
raw_trans_48h = preds[4][0]
# Apply conditional logic: transition flags should be ~0 if hours > threshold
trans_24h_prob = 0.0 if hours_pred > 24 else raw_trans_24h
trans_48h_prob = 0.0 if hours_pred > 48 else raw_trans_48h
return trans_24h_prob, trans_48h_prob # Now logically valid
This ensures that:
hours_to_next is actually within their respective windowsA demonstration is run at the end using the first test sequence, printing the results and saving them to inference_demo.json.
What it does:
Computes and prints a complete summary of the run — the final performance numbers, how much the LSTM improved over the baselines, and a listing of every artifact produced. Saved as both a machine-readable JSON file and a human-readable text file.
Key outputs:
run_summary.json — all metrics and metadata in structured formatrun_summary.txt — a formatted text table of the same informationLSTM stands for Long Short-Term Memory. It is a type of neural network specifically designed to work with sequences — data where the order matters.
Analogy: Think of a person reading a story. When they encounter the word “but”, they know it reverses the meaning of what was just said. They remember the previous sentences to understand each new word. An LSTM works the same way:
Standard neural networks cannot do this — they treat each input independently, without any notion of “what came before”. LSTMs were designed specifically to capture these temporal dependencies.
What “long short-term memory” means:
Traditional machine learning trains a separate model for each question. This notebook uses multi-task learning — one model that answers all five questions simultaneously.
How it works: The LSTM backbone is shared — it learns a general understanding of tomato growth dynamics from all five tasks at once. Then five small specialised “heads” branch off the end, each fine-tuned for one specific prediction.
Why is this better than 5 separate models?
More data efficiency — The backbone learns from all five signals at once. The hours_to_next_stage regression task provides gradient signal that also helps the stage classification tasks, and vice versa.
Shared knowledge — Understanding that a plant is likely to transition soon (from the 24h/48h heads) naturally reinforce the hours-to-next regression head, and vice versa. The tasks are related and learning them together improves all of them.
Single inference — At prediction time, you run the model once and get all five answers — much faster and simpler than running five separate models.
A baseline model is the simplest reasonable approach we can compare against. If the LSTM does not significantly outperform the baseline, it was not worth the added complexity.
This notebook uses two baselines:
Random Forest for stage classification:
XGBoost for hours-to-transition regression:
Why these baselines matter: If the LSTM’s MAE for hours-to-transition is 5.2h compared to the XGBoost baseline’s 8.7h, we know the temporal learning is providing a real improvement of 3.5 hours per prediction — a meaningful operational advantage.
Accuracy — The simplest metric. If the model correctly classifies 92 out of 100 sequences, accuracy = 92%.
Confusion Matrix — A table showing what the model predicted vs what was actually true. Each row is an actual class, each column is a predicted class. Diagonal entries are correct predictions; off-diagonal entries are mistakes. This shows which stages get confused with which — far more informative than a single number.
F1 Score — Since some stages have fewer examples than others (class imbalance), accuracy alone is misleading. F1 balances precision and recall and is reported per-class and as a weighted average.
MAE (Mean Absolute Error) — “On average, predictions are X hours off.” MAE = 3.5 means predictions are 3.5 hours off on average. Easy to interpret.
RMSE (Root Mean Squared Error) — Similar to MAE but squares the errors before averaging, meaning large errors are penalised disproportionately. RMSE > MAE indicates there are occasional very large prediction errors.
MAPE (Mean Absolute Percentage Error) — Errors expressed as a percentage. MAPE = 12% means predictions are 12% off relative to the actual value.
F1 Score — Whether the plant is within 24 hours of transitioning is a yes/no question. F1 score measures how well the model identifies the positive case (transition will happen).
ROC-AUC — A threshold-independent measure of how well the model’s probability score separates “will transition” from “won’t transition”. Perfect = 1.0, random = 0.5.
After a complete run, all outputs live under:
E:\AgriTwin-GH\src\agritwin_gh\models\artifacts\growth_stage_progression_<RUN_ID>\
metrics/)| File | What it contains |
|---|---|
dataset_summary.json |
Row count, column list, missing values |
cycle_summary.csv |
Per-cycle quality audit results |
feature_list.json |
All feature names, rolling windows, lag steps |
split_summary.json |
Train/val/test cycle and sequence counts |
scaler_details.json |
Mean and std for every feature |
baseline_metrics.json |
RF accuracy, XGBoost MAE and RMSE |
rf_feature_importances.csv |
Feature importance from the Random Forest |
class_weights.json |
Per-stage class weights and training distribution |
model_config.json |
All model hyperparameters |
training_history.csv |
Loss and accuracy per epoch |
training_history.json |
Same, in JSON format |
lstm_metrics.json |
All final test-set evaluation metrics |
per_stage_distribution.json |
Stage class counts across train/val/test |
reports/)| File | What it contains |
|---|---|
baseline_rf_classification_report.txt |
Per-class precision/recall/F1 for Random Forest |
lstm_current_stage_classification_report.txt |
Per-class precision/recall/F1 for LSTM current-stage head |
lstm_next_stage_classification_report.txt |
Per-class precision/recall/F1 for LSTM next-stage head |
prediction_samples.csv |
8 hand-picked inference examples with predictions |
test_predictions_full.csv |
All test sequences with all 5 predicted + actual values |
inference_demo.json |
Output of the inference demo call |
plots/)| File | What it shows |
|---|---|
training_history_curves.png |
Loss and accuracy curves during training |
confusion_matrix_current_stage.png |
Stage classification confusion matrix |
confusion_matrix_next_stage.png |
Next-stage prediction confusion matrix |
confusion_matrix_trans_24h.png |
24h transition binary classification CM |
confusion_matrix_trans_48h.png |
48h transition binary classification CM |
hours_actual_vs_predicted.png |
Scatter: actual vs predicted hours to transition |
stage_distribution.png |
Dataset class balance visualisation |
cycle_duration_distribution.png |
Distribution of cycle lengths |
stage_progression_timelines.png |
Stage-over-time plots for sample cycles |
transition_probability_histograms.png |
24h/48h transition distribution |
rf_feature_importances.png |
Random Forest feature importance bar chart |
| File | What it contains |
|---|---|
best_model_<RUN_ID>.keras |
Best checkpoint saved during training |
feature_scaler.pkl |
Serialised StandardScaler for inference |
inference_config.json |
All config needed to run inference on new data |
run_summary.json |
Complete run metadata and final metrics |
run_summary.txt |
Human-readable summary table |
run_<RUN_ID>.log |
Full execution log with timestamps |
| File | What it contains |
|---|---|
growth_stage_progression_<RUN_ID>.keras |
Final saved trained model |
Ensure the project virtual environment is active and all requirements are installed:
# From E:\AgriTwin-GH\
.venv\Scripts\activate # Windows
# or
source .venv/bin/activate # macOS/Linux
pip install -r requirements.txt
notebooks/tomato_growth_stage_progression.ipynb
Select the project virtual environment kernel (.venv).
Run all cells top to bottom using Run All (Ctrl+F9 in VS Code).
RUN_ID automaticallysrc/agritwin_gh/models/artifacts/growth_stage_progression_<RUN_ID>/
TensorFlow automatically uses a CUDA-compatible GPU if one is available and the CUDA drivers are installed. The notebook confirms GPU availability in Section 1:
GPU: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
If the list is empty, training runs on CPU — still correct, but slower.
CSV Dataset
│
▼
[Section 2] Load CSV
│
▼
[Section 3] Standardise Columns & Stage Labels
│
▼
[Section 4] Cycle Integrity Checks → Remove Bad Cycles
│
▼
[Section 5] Compute 5 Target Variables per Row
│ (next_stage, hours_to_next, trans_24h, trans_48h, stage_frac)
▼
[Section 6] Feature Engineering → 60-80 Features per Hour
│ (rolling stats, lag features, GDH, VPD, light cumsum)
▼
[Section 7] Slide 24-Hour Windows → 3D Tensor (N, 24, F)
│
▼
[Section 8] Split by Cycle → Train / Val / Test
│
▼
[Section 9] StandardScaler (fit on train only)
│
├──▶ [Section 10] Baselines (RF + XGBoost) → baseline_metrics.json
│
▼
[Section 11] Build LSTM Multi-Task Model (5 heads)
│
▼
[Section 12] Train with EarlyStopping + ReduceLR + Checkpoint
│ → training_history_curves.png
▼
[Section 13] Evaluate on Test Set
│ → confusion matrices, classification reports, lstm_metrics.json
▼
[Section 14] Print 8 Inference Examples → prediction_samples.csv
│
▼
[Section 15] Generate 5 Dataset & Model Visualisation Plots
│
▼
[Section 16] Save Model (.keras) + Scaler (.pkl) + Config (JSON)
│
▼
[Section 17] Define Inference Utilities + Run Demo
│
▼
[Section 18] Compute & Save Final Run Summary
→ run_summary.json + run_summary.txt
Q: Can I use my own real greenhouse sensor data instead of the synthetic dataset?
A: Yes. As long as your CSV has a timestamp column, a growth cycle identifier, a stage label column, and at least one environmental sensor column (temperature, humidity, light, or CO₂), the notebook will handle it. Column names do not need to match exactly — the alias system in Section 3 maps common variants automatically.
Q: What if my data does not have all six growth stages?
A: The notebook flags cycles with missing stages but keeps them for training. You can also adjust the REQUIRE_START_STAGE and REQUIRE_END_STAGE constants in Section 4 to change which stages are required.
Q: How long does training take?
A: On a modern GPU, approximately 5–15 minutes with early stopping. On CPU, 30–90 minutes depending on dataset size.
Q: Can I re-use a model trained in a previous run without retraining?
A: Yes. Use the inference_config.json file from the artifacts folder to load the scaler and model paths, then use the inference utilities from Section 17. All the information needed to reconstruct the inference pipeline is saved.
Q: What does the solarradiation column get renamed to?
A: It gets renamed to light by the column alias map in Section 3. All such renamings are printed during the “Detected columns” step.
Q: Why does the last row of each stage have NaN for some target columns?
A: Because hours_to_next_stage, transition_within_24h, and transition_within_48h require looking ahead in time. For rows that are at the very last recorded stage (ripe), there is no future transition to look forward to, so these values are NaN. The model handles this with sample weights — NaN rows are given zero weight during training and are excluded from evaluation metrics.
Q: What is the difference between the .keras checkpoint and the final model?
A: The checkpoint (best_model_<RUN_ID>.keras) is saved by the ModelCheckpoint callback and contains the weights from the single best validation epoch — even if later epochs were worse. The final model (growth_stage_progression_<RUN_ID>.keras) is saved at the end of Section 16 and contains whatever state the model was in when training finished. In most cases these are identical, but if you use the saved model for deployment, prefer the checkpoint.
test_growth_stage_progression.pyFile location: scripts/test_growth_stage_progression.py
Purpose:
Standalone test script to validate the trained Growth Stage Progression model (multi-task LSTM) across 10 diverse scenarios covering all six growth stages, transition boundaries, stress conditions, and day/night comparisons.
Why it exists:
The model makes five simultaneous predictions (current stage, next stage, hours-to-transition, 24h probability, 48h probability). This script exercises the model without requiring the training notebook or integration with the full digital twin — enabling quick validation, debugging, and confidence checks.
# Run all 10 scenarios
python scripts/test_growth_stage_progression.py
# Run a specific scenario (1–10)
python scripts/test_growth_stage_progression.py --scenario 5
| # | Scenario | What it validates |
|---|---|---|
| 1 | Seedling Day 1 – freshly transplanted | Model correctly identifies stage 0 (seedling) at cycle onset |
| 2 | Seedling near transition – 90% progress, ~20h to next | High t24_prob expected; model should detect imminent transition |
| 3 | Early Vegetative – stable mid-stage (50% progress) | Low transition probabilities; model should predict stable state |
| 4 | Flowering Initiation – first flower buds appearing | Model correctly identifies stage 2 (flowering initiation) |
| 5 | Full Flowering – optimal conditions, peak anthesis | Model identifies stage 3 (flowering) with moderate hrs_to_next |
| 6 | Unripe → Ripe transition – 85% progress, 15h remaining | High t24_prob and t48_prob; imminent stage transition |
| 7 | Ripe final phase – 95% cycle progress | Model identifies stage 5 (ripe); t24_prob and t48_prob should be ≈0 |
| 8 | Cold stress – Early Veg at 10°C + low light | Model should predict slower development (higher hrs_to_next vs warm scenario) |
| 9 | Heat stress – Flowering at 38°C (pollen viability risk) | Model should detect stress condition; may affect transition timing |
| 10 | Day vs Night – same Flowering stage, toggle day/night flag | Comparison: daytime vs nighttime should show model responsiveness to diurnal cycle |
For each scenario, the script prints:
======================================================================
Scenario 5: Full Flowering — mid-stage optimal conditions
Current stage : Stage 4 – Flowering
Next stage : Stage 5 – Unripe
Hrs to transition : 248.3 h (~10.3 days)
Transition in 24h : 0.015
Transition in 48h : 0.042 ##
Interpretation:
Current stage — The model’s classification of the current 24-timestep sequenceNext stage — Predicted next stage (deterministic: always follows the sequence order)Hrs to transition — Regression output: minutes/hours until the transition occurs (may be negative if in final stage)Transition in 24h — Probability (0–1) that a transition occurs within 24 hoursTransition in 48h — Probability (0–1) that a transition occurs within 48 hoursValidation tips:
t24_prob (≥ 0.7) since they model near-transition conditionsEach scenario constructs a static 24-timestep sequence where:
Key features set per scenario:
stage_index (feat 6) – numeric stage ID (0–5)hours_in_current_stage (feat 7) – how long plant has been in current stagestage_progress_pct (feat 11) – 0–100 representing where in the stage cycle the plant isestimated_hrs_to_next_stage (feat 14) – model’s target for regressionindoor_temp, humidity, solarradiation, vpd (feats 16, 17, 20, 22) – environmental driversday_night_flag (feat 21) – 1.0 for day, 0.0 for nightAll predictions are “ripe” (Stage 6):
n_features_in_: 358)Import errors (TensorFlow, joblib, etc.):
pip install -r requirements.txt“Model input shape mismatch” errors:
_fill_sequence() output shape)Unexpected transition probabilities (e.g., high t24_prob at seedling day 1):
This script is a standalone diagnostic tool — it does not interface with the REST API, database, or digital twin renderer. It is used for:
For live greenhouse deployment, sensor data flows through src/agritwin_gh/models/growth_stage_inference.py → REST API → digital twin.
| Term | Plain-English Definition | ||
|---|---|---|---|
| Accuracy | Percentage of correct predictions out of all predictions | ||
| AUC / ROC-AUC | A measure of how well a binary classifier separates positive and negative cases; 1.0 is perfect, 0.5 is random | ||
| Batch | A small group of samples processed together during training (more efficient than one at a time) | ||
| Baseline model | A simple comparison model — if your complex model cannot beat it, the complex model is probably not worth using | ||
| Callback | A function that runs automatically at certain points during training (e.g., after each epoch) | ||
| Class imbalance | When some categories have many more examples than others; this can cause a model to ignore rare classes | ||
| Class weight | A multiplier applied during training to compensate for class imbalance — rare classes get a higher weight | ||
| CO₂ (ppm) | Carbon dioxide concentration in parts per million; plants absorb CO₂ for photosynthesis | ||
| Confusion matrix | A grid showing actual vs predicted class labels; diagonal = correct, off-diagonal = errors | ||
| Cross-entropy | A loss function used for classification tasks; measures how wrong probability predictions are | ||
| Cycle | One complete growth period of a tomato plant, from germination to harvest | ||
| Deep learning | Machine learning using neural networks with many layers | ||
| Dropout | A training technique where random connections are disabled during each training step, preventing over-reliance on any single feature | ||
| Early stopping | A training safeguard that halts training when validation performance stops improving | ||
| Epoch | One complete pass through the entire training dataset | ||
| F1 Score | Harmonic mean of precision and recall; useful when classes are imbalanced | ||
| Feature | A single measurable property used as input to the model (e.g., temperature, humidity) | ||
| Feature engineering | Computing new derived features from raw data that help the model learn better | ||
| Feature importance | A measure of how much each input feature contributes to a model’s predictions | ||
| GDH (Growing Degree Hours) | Cumulative heat above a base temperature (10°C); a biological “thermal clock” for plant development | ||
| Gradient | The signal used to update model weights during training; shows which direction to adjust parameters | ||
| Huber loss | A regression loss function that behaves like MAE for large errors and MSE for small ones — robust to outliers | ||
| Lag feature | The value of a variable from a previous timestep (e.g., temperature 3 hours ago) | ||
| LSTM | Long Short-Term Memory — a type of recurrent neural network designed to capture long-range patterns in sequential data | ||
| MAE | Mean Absolute Error — average of | predicted − actual | |
| MAPE | Mean Absolute Percentage Error — errors as a percentage of actual values | ||
| Multi-task learning | Training one model to predict multiple outputs simultaneously, sharing knowledge between tasks | ||
| NaN | “Not a Number” — a placeholder for missing or undefined values | ||
| Overfitting | When a model memorises the training data and performs poorly on new, unseen data | ||
| Precision | Of all times the model predicted “positive”, what fraction were actually positive? | ||
| Recall | Of all actual positives, what fraction did the model correctly identify? | ||
| Regression | Predicting a continuous numerical value (e.g., hours until a transition) | ||
| RMSE | Root Mean Squared Error — like MAE but penalises large errors more heavily | ||
| Rolling statistics | Statistics (mean, std) computed over a sliding time window | ||
| Run ID | A unique timestamp string appended to all output file names so multiple runs do not overwrite each other | ||
| Scaler | A preprocessing tool that normalises feature values to a common scale | ||
| Sequence | An ordered series of sensor readings over time (here: 24 consecutive hourly readings) | ||
| Sigmoid | An S-shaped activation function that squashes output to the range [0, 1]; used for probability predictions | ||
| Softmax | An activation function that converts raw scores into probabilities that sum to 1; used for multi-class classification | ||
| TFT | Temporal Fusion Transformer — an alternative temporal sequence model (not used here, but referenced in other notebooks) | ||
| VPD (Vapour Pressure Deficit) | A measure of how “thirsty” the air is; high VPD drives more plant transpiration and can cause stress | ||
| Validation set | Data held out during training (not used for weight updates) to monitor for overfitting | ||
| Window / Sliding window | A fixed-length subsequence extracted by sliding a frame across a time series | ||
| XGBoost | eXtreme Gradient Boosting — a powerful, widely-used tree-based machine learning algorithm |