Complete database schema and usage guide for AgriTwin-GH project
AgriTwin-GH uses a single PostgreSQL 15 database (agritwin_db) with TimescaleDB extension to store two distinct types of agricultural data:
Database: agritwin_db
Engine: PostgreSQL 15 with TimescaleDB extension
Container: agritwin-timescaledb
Port: 5432
┌─────────────────────────────────────────────────────────┐
│ PostgreSQL 15 (agritwin_db) │
│ + TimescaleDB Extension │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────┐ ┌──────────────────────────┐ │
│ │ TIMESERIES DATA │ │ IMAGE METADATA │ │
│ │ │ │ │ │
│ │ • weather_data │ │ • image_metadata │ │
│ │ • greenhouse_data │ │ • image_annotations │ │
│ │ • disease_progression │ │ • image_access_log │ │
│ │ • growth_progression_hourly │ │ │ │
│ │ • growth_progression_stage_* │ │ │ │
│ │ • growth_progression_cycle_* │ │ │ │
│ │ • growth_progression_metadata │ │ │ │
│ │ │ │ │ │
│ │ (Hypertables for hourly data) │ │ │ │
│ └────────────────────────────────┘ └──────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
↓
┌───────────────┐
│ MinIO Storage│
│ (Images) │
└───────────────┘
Design Rationale:
Stores outdoor weather conditions from external weather API.
Schema:
CREATE TABLE weather_data (
id INTEGER PRIMARY KEY,
datetime TIMESTAMP NOT NULL,
datetime_epoch INTEGER,
temp FLOAT, -- Temperature (°C)
humidity FLOAT, -- Humidity (%)
windspeed FLOAT, -- Wind speed (km/h)
solarradiation FLOAT, -- Solar radiation (W/m²)
sunrise VARCHAR(20),
sunrise_epoch FLOAT,
sunset VARCHAR(20),
sunset_epoch FLOAT,
conditions VARCHAR(100),
description VARCHAR(500)
);
-- Converted to TimescaleDB hypertable
SELECT create_hypertable('weather_data', 'datetime',
chunk_time_interval => INTERVAL '1 month');
Key Features:
datetime for fast time-range queriesSample Data:
SELECT * FROM weather_data LIMIT 1;
| id | datetime | temp | humidity | windspeed | solarradiation | conditions | |—-|———-|——|———-|———–|—————-|————| | 1 | 2024-01-01 00:00:00 | 22.5 | 65.3 | 12.8 | 0.0 | Clear |
Common Queries:
-- Daily temperature extremes
SELECT
DATE(datetime) as date,
MIN(temp) as min_temp,
MAX(temp) as max_temp,
AVG(temp) as avg_temp
FROM weather_data
WHERE datetime >= '2025-01-01'
GROUP BY DATE(datetime)
ORDER BY date;
-- High solar radiation days
SELECT DATE(datetime), MAX(solarradiation) as max_radiation
FROM weather_data
WHERE solarradiation > 500
GROUP BY DATE(datetime)
ORDER BY max_radiation DESC;
Stores indoor greenhouse environmental conditions (synthetic or sensor data).
Schema:
CREATE TABLE greenhouse_data (
id INTEGER PRIMARY KEY,
datetime TIMESTAMP NOT NULL,
indoor_temp FLOAT, -- Indoor temperature (°C)
indoor_humidity FLOAT, -- Indoor humidity (%)
indoor_air_velocity FLOAT, -- Air velocity (m/s)
indoor_co2 FLOAT, -- CO2 concentration (ppm)
solarradiation FLOAT, -- Solar radiation (W/m²)
day_night_flag INTEGER, -- Day (1) or Night (0)
vpd FLOAT, -- Vapor Pressure Deficit (kPa)
dew_point FLOAT, -- Dew point (°C)
leaf_wetness_proxy FLOAT -- Leaf wetness indicator
);
-- Converted to TimescaleDB hypertable
SELECT create_hypertable('greenhouse_data', 'datetime',
chunk_time_interval => INTERVAL '1 week');
Key Features:
day_night_flag for day/night cycle analysisImportant Metrics:
Sample Data:
SELECT * FROM greenhouse_data ORDER BY datetime DESC LIMIT 1;
| datetime | indoor_temp | indoor_humidity | indoor_co2 | vpd | day_night_flag | |———-|————-|—————–|————|—–|—————-| | 2025-01-15 14:30:00 | 24.5 | 68.2 | 950.0 | 0.95 | 1 |
Common Queries:
-- Compare indoor vs outdoor conditions
SELECT
g.datetime,
g.indoor_temp,
w.temp as outdoor_temp,
(g.indoor_temp - w.temp) as temp_difference
FROM greenhouse_data g
JOIN weather_data w ON DATE_TRUNC('hour', g.datetime) = DATE_TRUNC('hour', w.datetime)
WHERE g.datetime >= NOW() - INTERVAL '7 days'
ORDER BY g.datetime;
-- High disease risk periods (high humidity + leaf wetness)
SELECT
DATE_TRUNC('hour', datetime) as hour,
AVG(indoor_humidity) as avg_humidity,
AVG(leaf_wetness_proxy) as avg_wetness
FROM greenhouse_data
WHERE indoor_humidity > 80 AND leaf_wetness_proxy > 0.7
GROUP BY hour
ORDER BY hour DESC;
-- CO2 supplementation effectiveness
SELECT
day_night_flag,
AVG(indoor_co2) as avg_co2,
MIN(indoor_co2) as min_co2,
MAX(indoor_co2) as max_co2
FROM greenhouse_data
WHERE datetime >= CURRENT_DATE
GROUP BY day_night_flag;
Hourly synthetic disease risk and outbreak records for 5 disease types across 4 crop cycles (Jul 2024 – Oct 2025). Each row represents one disease × one hour.
Schema:
CREATE TABLE disease_progression (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL, -- Hourly bucket
-- Cycle / stage identity
cycle_id INTEGER NOT NULL, -- 1–4
cycle_label VARCHAR(50), -- kharif_2024, rabi_2024, …
season_label VARCHAR(50),
stage_name VARCHAR(50), -- seedling … ripe
stage_index INTEGER, -- 0–5
-- Time features
days_from_cycle_start FLOAT,
day_of_year INTEGER,
week_of_year INTEGER,
hour INTEGER,
hours_in_current_stage FLOAT,
stage_progress_pct FLOAT,
total_cycle_progress_pct FLOAT,
is_stage_transition BOOLEAN,
-- Snapshot indoor environment
indoor_temp FLOAT,
indoor_humidity FLOAT,
indoor_air_velocity FLOAT,
indoor_co2 FLOAT,
solarradiation FLOAT,
day_night_flag FLOAT,
vpd FLOAT,
dew_point FLOAT,
leaf_wetness_proxy FLOAT,
-- Rolling / derived features
temperature_rolling_mean_24h FLOAT,
humidity_rolling_mean_24h FLOAT,
vpd_proxy FLOAT,
cumulative_gdd_like_index FLOAT,
-- Disease-specific columns
disease_name VARCHAR(50) NOT NULL, -- early_blight | late_blight | …
disease_present_flag INTEGER, -- 1 = active outbreak
disease_cycle_id INTEGER,
disease_cycle_stage VARCHAR(30), -- none | latent | active | decline
outbreak_trigger_flag INTEGER,
control_action_flag INTEGER,
control_action_type VARCHAR(50), -- none | fungicide | pruning | …
stage_susceptibility_score FLOAT,
disease_risk_score FLOAT,
hours_since_disease_onset FLOAT,
current_infection_pct FLOAT,
infection_growth_rate_hourly FLOAT
);
-- TimescaleDB hypertable (monthly chunks)
SELECT create_hypertable('disease_progression', 'timestamp',
chunk_time_interval => INTERVAL '1 month');
Key Features:
early_blight, late_blight, leaf_mold, powdery_mildew, spider_mitestimestamp (monthly chunks)(cycle_id, disease_name) for per-disease querieshours_since_disease_onset is NULL when no outbreak is activeCommon Queries:
-- Active outbreak periods by disease
SELECT disease_name,
MIN(timestamp) AS outbreak_start,
MAX(timestamp) AS outbreak_end,
MAX(current_infection_pct) AS peak_infection_pct
FROM disease_progression
WHERE disease_present_flag = 1
GROUP BY disease_name, disease_cycle_id
ORDER BY outbreak_start;
-- Risk score time-series for a specific cycle and disease
SELECT timestamp, disease_risk_score, control_action_type
FROM disease_progression
WHERE cycle_id = 1 AND disease_name = 'late_blight'
ORDER BY timestamp;
-- Hours with high risk but no control action
SELECT COUNT(*) AS uncontrolled_high_risk_hours
FROM disease_progression
WHERE disease_risk_score > 0.7
AND control_action_flag = 0;
Hourly time-series of tomato growth stage progression across 4 crop cycles, with full environmental context and GDD accumulation. 10,848 rows total.
Schema:
CREATE TABLE growth_progression_hourly (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL,
-- Cycle / identity
cycle_id INTEGER NOT NULL,
cycle_label VARCHAR(50),
season_window VARCHAR(100),
real_or_synthetic_flag VARCHAR(20),
-- Time features
year INTEGER, month INTEGER, day_of_year INTEGER,
week_of_year INTEGER, hour INTEGER,
season_label VARCHAR(50),
days_from_cycle_start FLOAT,
-- Stage / progression
stage_name VARCHAR(50),
stage_index INTEGER,
hours_in_current_stage FLOAT,
days_in_current_stage FLOAT,
stage_duration_hours INTEGER,
stage_duration_days INTEGER,
stage_progress_pct FLOAT,
total_cycle_progress_pct FLOAT,
estimated_days_to_next_stage FLOAT,
estimated_hours_to_next_stage FLOAT,
is_stage_transition BOOLEAN,
-- Environment
indoor_temp FLOAT, indoor_humidity FLOAT,
indoor_air_velocity FLOAT, indoor_co2 FLOAT,
solarradiation FLOAT, day_night_flag FLOAT,
vpd FLOAT, dew_point FLOAT, leaf_wetness_proxy FLOAT,
-- Engineered features
temperature_rolling_mean_24h FLOAT,
humidity_rolling_mean_24h FLOAT,
vpd_proxy FLOAT,
light_period_flag INTEGER,
cumulative_gdd_like_index FLOAT
);
SELECT create_hypertable('growth_progression_hourly', 'timestamp',
chunk_time_interval => INTERVAL '1 month');
Key Features:
Tbase = 10 °C, Topt = 30 °C) accumulated hourlyCommon Queries:
-- Duration spent in each stage per cycle
SELECT cycle_id, stage_name, MAX(days_in_current_stage) AS stage_days
FROM growth_progression_hourly
GROUP BY cycle_id, stage_name
ORDER BY cycle_id, MIN(stage_index);
-- GDD at stage transitions
SELECT timestamp, stage_name, cumulative_gdd_like_index
FROM growth_progression_hourly
WHERE is_stage_transition = TRUE
ORDER BY timestamp;
-- Hourly temp and GDD for cycle 1
SELECT timestamp, indoor_temp, cumulative_gdd_like_index
FROM growth_progression_hourly
WHERE cycle_id = 1
ORDER BY timestamp;
Per-stage aggregated statistics (mean / std of environmental variables) for each cycle. 24 rows (4 cycles × 6 stages).
Schema:
CREATE TABLE growth_progression_stage_summary (
id SERIAL PRIMARY KEY,
cycle_id INTEGER NOT NULL,
stage_index INTEGER NOT NULL,
stage_name VARCHAR(50) NOT NULL,
hourly_rows INTEGER,
start_timestamp TIMESTAMP,
end_timestamp TIMESTAMP,
actual_days FLOAT,
max_stage_prog FLOAT,
mean_gdd FLOAT,
mean_indoor_temp FLOAT,
mean_indoor_humidity FLOAT,
mean_vpd FLOAT,
mean_solarradiation FLOAT,
std_indoor_temp FLOAT,
std_indoor_humidity FLOAT,
std_vpd FLOAT,
std_solarradiation FLOAT
);
Common Queries:
-- Compare mean temperature across stages for all cycles
SELECT stage_name, ROUND(AVG(mean_indoor_temp)::numeric, 2) AS avg_temp
FROM growth_progression_stage_summary
GROUP BY stage_name
ORDER BY MIN(stage_index);
-- Cycles where ripe stage had high VPD
SELECT cycle_id, mean_vpd, actual_days
FROM growth_progression_stage_summary
WHERE stage_name = 'ripe' AND mean_vpd > 1.5;
One row per crop cycle with start/end dates and stage durations. 4 rows.
Schema:
CREATE TABLE growth_progression_cycle_summary (
id SERIAL PRIMARY KEY,
cycle_id INTEGER NOT NULL UNIQUE,
cycle_label VARCHAR(50),
season_window VARCHAR(100),
cycle_start DATE,
cycle_end DATE,
total_days INTEGER,
days_seedling INTEGER,
days_early_vegetative INTEGER,
days_flowering_initiation INTEGER,
days_flowering INTEGER,
days_unripe INTEGER,
days_ripe INTEGER,
total_hourly_rows INTEGER
);
Common Queries:
-- Overview of all cycles
SELECT cycle_label, cycle_start, cycle_end, total_days, days_ripe
FROM growth_progression_cycle_summary
ORDER BY cycle_start;
-- Which cycle had the longest flowering stage?
SELECT cycle_label, days_flowering
FROM growth_progression_cycle_summary
ORDER BY days_flowering DESC
LIMIT 1;
Single-row table holding the full dataset metadata JSON (project config, agronomic notes, column groups, etc.).
Schema:
CREATE TABLE growth_progression_metadata (
id SERIAL PRIMARY KEY,
project VARCHAR(100),
crop VARCHAR(100),
location VARCHAR(100),
greenhouse_system VARCHAR(100),
notebook_name VARCHAR(200),
created_on TIMESTAMP,
total_hourly_rows INTEGER,
earliest_timestamp TIMESTAMP,
latest_timestamp TIMESTAMP,
metadata_json JSONB, -- Full raw metadata blob
loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Common Queries:
-- Retrieve agronomic notes
SELECT jsonb_array_elements_text(metadata_json->'agronomic_notes') AS note
FROM growth_progression_metadata;
-- Stage duration ranges from metadata
SELECT key AS stage, value AS duration_range_days
FROM growth_progression_metadata,
jsonb_each(metadata_json->'stage_duration_ranges_days');
Main table storing metadata for agricultural images stored in MinIO.
Schema:
CREATE TABLE image_metadata (
id SERIAL PRIMARY KEY,
-- MinIO Storage Info
image_key VARCHAR(255) NOT NULL UNIQUE,
bucket_name VARCHAR(100) NOT NULL DEFAULT 'agritwin-images',
file_name VARCHAR(255) NOT NULL,
file_size BIGINT,
mime_type VARCHAR(50) DEFAULT 'image/jpeg',
etag VARCHAR(255),
-- Image Classification
category VARCHAR(50) NOT NULL, -- 'disease', 'growth_stage', 'healthy'
subcategory VARCHAR(100), -- 'early_blight', 'stage1_seedling', etc.
label VARCHAR(100),
-- Agricultural Context
crop_type VARCHAR(50) DEFAULT 'tomato',
source_dataset VARCHAR(100),
original_path TEXT,
-- Image Properties
width INTEGER,
height INTEGER,
color_space VARCHAR(20),
-- Processing Status
is_uploaded BOOLEAN DEFAULT FALSE,
upload_date TIMESTAMP,
is_processed BOOLEAN DEFAULT FALSE,
processed_date TIMESTAMP,
processing_status VARCHAR(50) DEFAULT 'pending',
-- Analysis Results (JSON)
analysis_results JSONB,
-- Metadata
tags TEXT[],
notes TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_by VARCHAR(100),
CONSTRAINT valid_category CHECK (category IN ('disease', 'growth_stage', 'healthy')),
CONSTRAINT valid_status CHECK (processing_status IN ('pending', 'processing', 'completed', 'failed', 'skipped'))
);
Key Features:
image_key points to object in MinIO bucketanalysis_results stores ML model outputsupdated_at automatically updated on changesDataset Distribution (69,607 total images):
Sample Data:
SELECT id, file_name, category, subcategory, file_size, width, height
FROM image_metadata LIMIT 3;
| id | file_name | category | subcategory | file_size | width | height | |—-|———–|———-|————-|———–|——-|——–| | 1 | IMG_001.jpg | disease | tomato_early_blight | 245120 | 1024 | 768 | | 2 | STAGE2_045.jpg | growth_stage | stage2_early_vegetative | 512340 | 2048 | 1536 | | 3 | HEALTHY_089.jpg | healthy | NULL | 189760 | 800 | 600 |
Common Queries:
-- Images by category
SELECT category, COUNT(*) as count, SUM(file_size) as total_bytes
FROM image_metadata
GROUP BY category;
-- Recently uploaded disease images
SELECT file_name, subcategory, upload_date
FROM image_metadata
WHERE category = 'disease'
ORDER BY upload_date DESC
LIMIT 10;
-- Search by tags
SELECT file_name, tags
FROM image_metadata
WHERE tags @> ARRAY['high_quality', 'annotated']
LIMIT 20;
-- Unprocessed images
SELECT COUNT(*) as pending_count
FROM image_metadata
WHERE processing_status = 'pending';
-- ML analysis results (JSON query)
SELECT file_name, analysis_results->>'disease' as predicted_disease,
(analysis_results->>'confidence')::float as confidence
FROM image_metadata
WHERE analysis_results IS NOT NULL
AND (analysis_results->>'confidence')::float > 0.90
ORDER BY confidence DESC;
Stores ML training labels and bounding boxes for images.
Schema:
CREATE TABLE image_annotations (
id SERIAL PRIMARY KEY,
image_id INTEGER NOT NULL REFERENCES image_metadata(id) ON DELETE CASCADE,
annotation_type VARCHAR(50) NOT NULL, -- 'bounding_box', 'segmentation', 'classification', 'keypoint'
annotation_data JSONB NOT NULL,
confidence FLOAT,
annotator VARCHAR(100),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT valid_annotation_type CHECK (annotation_type IN
('bounding_box', 'segmentation', 'classification', 'keypoint'))
);
Annotation Data Examples:
Bounding Box (disease detection):
{
"boxes": [
{"x": 120, "y": 80, "width": 200, "height": 150, "label": "early_blight", "confidence": 0.92}
],
"image_width": 1024,
"image_height": 768
}
Classification (growth stage):
{
"predicted_class": "stage4_flowering",
"probabilities": {
"stage3_flowering_initiation": 0.05,
"stage4_flowering": 0.89,
"stage5_unripe": 0.06
}
}
Segmentation (leaf mask):
{
"mask_url": "s3://agritwin-masks/mask_12345.png",
"num_pixels": 45678,
"mask_type": "binary"
}
Common Queries:
-- Images with bounding box annotations
SELECT im.file_name, ia.annotation_data
FROM image_metadata im
JOIN image_annotations ia ON im.id = ia.image_id
WHERE ia.annotation_type = 'bounding_box'
LIMIT 5;
-- High-confidence annotations
SELECT im.file_name, ia.annotation_type, ia.confidence
FROM image_metadata im
JOIN image_annotations ia ON im.id = ia.image_id
WHERE ia.confidence > 0.95
ORDER BY ia.confidence DESC;
-- Annotations per image
SELECT im.file_name, COUNT(ia.id) as annotation_count
FROM image_metadata im
LEFT JOIN image_annotations ia ON im.id = ia.image_id
GROUP BY im.file_name
HAVING COUNT(ia.id) > 0
ORDER BY annotation_count DESC;
Tracks image access for usage analytics and audit trails.
Schema:
CREATE TABLE image_access_log (
id SERIAL PRIMARY KEY,
image_id INTEGER NOT NULL REFERENCES image_metadata(id) ON DELETE CASCADE,
accessed_by VARCHAR(100),
access_type VARCHAR(50), -- 'view', 'download', 'process', 'train'
accessed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
metadata JSONB
);
Use Cases:
Sample Log Entry:
INSERT INTO image_access_log (image_id, accessed_by, access_type, metadata)
VALUES (
12345,
'ml_pipeline_v2',
'train',
'{"epoch": 45, "batch_id": 102, "model": "resnet50"}'::jsonb
);
Common Queries:
-- Most accessed images
SELECT im.file_name, COUNT(*) as access_count
FROM image_access_log ial
JOIN image_metadata im ON ial.image_id = im.id
GROUP BY im.file_name
ORDER BY access_count DESC
LIMIT 20;
-- Access patterns by type
SELECT access_type, COUNT(*) as count
FROM image_access_log
WHERE accessed_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY access_type;
-- Images used for training
SELECT DISTINCT im.file_name, im.category, im.subcategory
FROM image_access_log ial
JOIN image_metadata im ON ial.image_id = im.id
WHERE ial.access_type = 'train'
AND ial.accessed_at >= '2025-01-01';
Pre-computed aggregations for quick statistics.
Definition:
CREATE VIEW image_summary AS
SELECT
category,
subcategory,
COUNT(*) as total_images,
SUM(file_size) as total_size_bytes,
ROUND(SUM(file_size) / 1024.0 / 1024.0, 2) as total_size_mb,
COUNT(CASE WHEN is_uploaded THEN 1 END) as uploaded_count,
COUNT(CASE WHEN is_processed THEN 1 END) as processed_count,
MIN(upload_date) as first_upload,
MAX(upload_date) as last_upload
FROM image_metadata
GROUP BY category, subcategory;
Usage:
-- Get statistics for all categories
SELECT * FROM image_summary ORDER BY category, subcategory;
-- Disease images summary
SELECT * FROM image_summary WHERE category = 'disease';
Sample Output: | category | subcategory | total_images | total_size_mb | uploaded_count | processed_count | |———-|————-|————–|—————|—————-|—————–| | disease | tomato_early_blight | 3,828 | 58.12 | 3,828 | 3,205 | | disease | tomato_late_blight | 4,105 | 62.43 | 4,105 | 3,891 | | growth_stage | stage1_seedling | 7,234 | 1905.23 | 7,234 | 6,892 |
-- weather_data
CREATE INDEX idx_weather_datetime ON weather_data(datetime);
-- greenhouse_data
CREATE INDEX idx_greenhouse_datetime ON greenhouse_data(datetime);
-- disease_progression
CREATE INDEX idx_disease_prog_timestamp ON disease_progression (timestamp DESC);
CREATE INDEX idx_disease_prog_cycle_disease ON disease_progression (cycle_id, disease_name);
CREATE INDEX idx_disease_prog_present ON disease_progression (disease_present_flag)
WHERE disease_present_flag = 1;
-- growth_progression_hourly
CREATE INDEX idx_growth_hourly_timestamp ON growth_progression_hourly (timestamp DESC);
CREATE INDEX idx_growth_hourly_cycle_stage ON growth_progression_hourly (cycle_id, stage_index);
-- growth_progression_stage_summary
CREATE INDEX idx_stage_summary_cycle_stage ON growth_progression_stage_summary (cycle_id, stage_index);
TimescaleDB Advantages:
-- Primary lookup
CREATE INDEX idx_image_bucket_key ON image_metadata(bucket_name, image_key);
-- Category filtering
CREATE INDEX idx_image_category ON image_metadata(category);
CREATE INDEX idx_image_subcategory ON image_metadata(subcategory);
CREATE INDEX idx_image_crop_type ON image_metadata(crop_type);
-- Time-based queries
CREATE INDEX idx_image_upload_date ON image_metadata(upload_date DESC);
-- Status filtering
CREATE INDEX idx_image_processing_status ON image_metadata(processing_status);
-- Array and JSON (GIN indexes)
CREATE INDEX idx_image_tags ON image_metadata USING GIN(tags);
CREATE INDEX idx_image_analysis ON image_metadata USING GIN(analysis_results);
Performance Tips:
WHERE is_processed = false)Link growth stages to environmental conditions:
SELECT
im.subcategory as growth_stage,
COUNT(*) as image_count,
AVG(g.indoor_temp) as avg_temp,
AVG(g.vpd) as avg_vpd
FROM image_metadata im
CROSS JOIN greenhouse_data g
WHERE im.category = 'growth_stage'
AND g.datetime BETWEEN im.upload_date - INTERVAL '7 days' AND im.upload_date
GROUP BY im.subcategory;
Disease occurrence vs weather conditions:
SELECT
im.subcategory as disease,
DATE(im.upload_date) as date,
AVG(w.humidity) as avg_humidity,
AVG(w.temp) as avg_temp,
COUNT(*) as image_count
FROM image_metadata im
JOIN weather_data w ON DATE(w.datetime) = DATE(im.upload_date)
WHERE im.category = 'disease'
GROUP BY disease, date
HAVING AVG(w.humidity) > 70
ORDER BY date DESC;
Hourly greenhouse trends:
SELECT
time_bucket('1 hour', datetime) as hour,
AVG(indoor_temp) as avg_temp,
AVG(indoor_humidity) as avg_humidity,
AVG(indoor_co2) as avg_co2
FROM greenhouse_data
WHERE datetime >= NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour;
Daily temperature variance:
SELECT
DATE(datetime) as date,
MAX(temp) - MIN(temp) as temp_range,
STDDEV(temp) as temp_stddev
FROM weather_data
WHERE datetime >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(datetime)
ORDER BY temp_range DESC;
Processing pipeline status:
SELECT
processing_status,
category,
COUNT(*) as count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY category), 2) as percentage
FROM image_metadata
GROUP BY processing_status, category
ORDER BY category, count DESC;
Storage usage by category:
SELECT
category,
COUNT(*) as file_count,
pg_size_pretty(SUM(file_size)::bigint) as total_size,
pg_size_pretty(AVG(file_size)::bigint) as avg_file_size
FROM image_metadata
GROUP BY category;
┌─────────────────┐
│ weather_data │
│ (Hypertable) │
│ │
│ • datetime (PK) │
│ • temp │
│ • humidity │
└─────────────────┘
┌──────────────────┐
│ greenhouse_data │
│ (Hypertable) │
│ │
│ • datetime (PK) │
│ • indoor_temp │
│ • indoor_co2 │
│ • vpd │
└──────────────────┘
┌─────────────────────┐ ┌──────────────────────┐
│ image_metadata │ │ image_annotations │
│ │ 1 ∞ │ │
│ • id (PK) │◄─────────│ • id (PK) │
│ • image_key (UK) │ │ • image_id (FK) │
│ • category │ │ • annotation_data │
│ • subcategory │ └──────────────────────┘
│ • analysis_results │
│ │ ┌──────────────────────┐
│ │ 1 ∞ │ image_access_log │
│ │◄─────────│ │
└─────────────────────┘ │ • id (PK) │
│ • image_id (FK) │
│ • access_type │
└──────────────────────┘
Stores ML model predictions and metrics.
Example Structure:
{
"model": "efficientnet_b3",
"version": "v2.1.0",
"timestamp": "2025-02-25T10:30:00Z",
"disease": "tomato_early_blight",
"confidence": 0.94,
"probabilities": {
"tomato_early_blight": 0.94,
"tomato_late_blight": 0.03,
"healthy": 0.02
},
"bounding_boxes": [
{
"x": 120, "y": 85, "width": 180, "height": 160,
"label": "lesion", "score": 0.89
}
],
"features": {
"color_histogram": [0.12, 0.23, 0.45, ...],
"texture_score": 0.67
}
}
Query Examples:
-- Extract specific JSON fields
SELECT
file_name,
analysis_results->>'disease' as disease,
(analysis_results->>'confidence')::float as confidence
FROM image_metadata
WHERE analysis_results IS NOT NULL;
-- Filter by JSON values
SELECT file_name
FROM image_metadata
WHERE (analysis_results->>'confidence')::float > 0.9
AND analysis_results->>'disease' = 'tomato_early_blight';
-- Count predictions by disease
SELECT
analysis_results->>'disease' as disease,
COUNT(*) as count
FROM image_metadata
WHERE analysis_results IS NOT NULL
GROUP BY disease
ORDER BY count DESC;
Flexible structure based on annotation_type.
Querying Nested JSON:
-- Extract bounding box coordinates
SELECT
im.file_name,
(ia.annotation_data->'boxes'->0->>'x')::int as box_x,
(ia.annotation_data->'boxes'->0->>'y')::int as box_y
FROM image_annotations ia
JOIN image_metadata im ON ia.image_id = im.id
WHERE ia.annotation_type = 'bounding_box';
SELECT time_bucket('1 hour', datetime), AVG(temp)
FROM weather_data
GROUP BY 1;
CREATE MATERIALIZED VIEW daily_weather
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', datetime) as day,
AVG(temp) as avg_temp,
MAX(temp) as max_temp
FROM weather_data
GROUP BY day;
Partition by time for large datasets (automatically handled by TimescaleDB)
cursor.executemany(
"INSERT INTO image_metadata (...) VALUES (...)",
batch_data
)
Use JSONB (not JSON) for indexable fields
Tag strategy: Use arrays for fixed tags, JSONB for dynamic metadata
Cascade deletes: Rely on FK constraints to clean up annotations/logs
updated_atConnection pooling: Use pgBouncer or SQLAlchemy pools for high concurrency
Regular VACUUM: Run VACUUM ANALYZE weekly on active tables
Monitor index bloat: Check with pg_stat_user_indexes
LIMIT for exploratory queriesSELECT * in production code-- Total database size
SELECT pg_size_pretty(pg_database_size('agritwin_db'));
-- Size per table
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
| Table | Records | Size | Notes |
|---|---|---|---|
| weather_data | ~17,520 | ~2 MB | Hourly outdoor data, 2024–2025 |
| greenhouse_data | ~17,520 | ~2 MB | Hourly indoor conditions, 2024–2025 merged |
| disease_progression | ~175,000+ | ~60 MB | 5 diseases × 4 cycles, hourly |
| growth_progression_hourly | 10,848 | ~5 MB | 4 crop cycles, hourly (Jul 2024–Oct 2025) |
| growth_progression_stage_summary | 24 | <1 MB | 4 cycles × 6 stages |
| growth_progression_cycle_summary | 4 | <1 MB | One row per crop cycle |
| growth_progression_metadata | 1 | <1 MB | Dataset config JSON |
| image_metadata | 69,607 | ~15 MB | 3 categories of tomato images |
| image_annotations | varies | <1 MB | ML training labels |
| image_access_log | varies | <5 MB | Usage tracking |
Total Image Storage (MinIO): ~11.8 GB
examples/query_timeseries_examples.py, examples/query_images_examples.pydatabase/schema/image_metadata.sql, database/schema/timeseries_data.sql (complete timeseries schema — all 7 tables)src/agritwin_gh/models/timeseries.pyscripts/load_timeseries_to_postgres.pyscripts/verify_setup.pyDocument Version: 2.0
Last Updated: 2026-03-30
Database Version: PostgreSQL 15 + TimescaleDB
New in v2.0: disease_progression, growth_progression_hourly, growth_progression_stage_summary, growth_progression_cycle_summary, growth_progression_metadata