# Database Deployment Guide for Thailand Water Monitor

This guide covers deployment options for storing water monitoring data in production environments.

## 🏆 Recommendation Summary

| Database | Best For | Performance | Complexity | Cost |
|----------|----------|-------------|------------|------|
| **InfluxDB** | Time-series data, dashboards | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| **VictoriaMetrics** | High-performance metrics | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| **PostgreSQL** | Complex queries, reliability | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
| **MySQL** | Familiar, existing infrastructure | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ |

## 1. InfluxDB Deployment (Recommended for Time-Series)

### Why InfluxDB?
- **Purpose-built** for time-series data
- **Excellent compression** (10:1 typical ratio)
- **Built-in retention policies** and downsampling
- **Great Grafana integration** for dashboards
- **High write throughput** (100k+ points/second)

### Docker Deployment

```yaml
# docker-compose.yml
version: '3.8'
services:
  influxdb:
    image: influxdb:1.8
    container_name: water_influxdb
    ports:
      - "8086:8086"
    volumes:
      - influxdb_data:/var/lib/influxdb
      - ./influxdb.conf:/etc/influxdb/influxdb.conf:ro
    environment:
      - INFLUXDB_DB=water_monitoring
      - INFLUXDB_ADMIN_USER=admin
      - INFLUXDB_ADMIN_PASSWORD=your_secure_password
      - INFLUXDB_USER=water_user
      - INFLUXDB_USER_PASSWORD=water_password
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: water_grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin_password
    restart: unless-stopped

volumes:
  influxdb_data:
  grafana_data:
```

### Environment Variables
```bash
# .env file
DB_TYPE=influxdb
INFLUX_HOST=localhost
INFLUX_PORT=8086
INFLUX_DATABASE=water_monitoring
INFLUX_USERNAME=water_user
INFLUX_PASSWORD=water_password
```

### InfluxDB Configuration
```toml
# influxdb.conf
[meta]
  dir = "/var/lib/influxdb/meta"

[data]
  dir = "/var/lib/influxdb/data"
  wal-dir = "/var/lib/influxdb/wal"
  
  # Optimize for time-series data
  cache-max-memory-size = "1g"
  cache-snapshot-memory-size = "25m"
  cache-snapshot-write-cold-duration = "10m"
  
  # Retention and compression
  compact-full-write-cold-duration = "4h"
  max-series-per-database = 1000000
  max-values-per-tag = 100000

[coordinator]
  write-timeout = "10s"
  max-concurrent-queries = 0
  query-timeout = "0s"

[retention]
  enabled = true
  check-interval = "30m"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = true
  max-body-size = "25000000"
  max-concurrent-requests = 0
  max-enqueued-requests = 0
```

### Production Setup Commands
```bash
# Start services
docker-compose up -d

# Create retention policies
docker exec -it water_influxdb influx -username admin -password your_secure_password -execute "
CREATE RETENTION POLICY \"raw_data\" ON \"water_monitoring\" DURATION 90d REPLICATION 1 DEFAULT;
CREATE RETENTION POLICY \"downsampled\" ON \"water_monitoring\" DURATION 730d REPLICATION 1;
"

# Create continuous queries for downsampling
docker exec -it water_influxdb influx -username admin -password your_secure_password -execute "
CREATE CONTINUOUS QUERY \"downsample_hourly\" ON \"water_monitoring\" 
BEGIN 
  SELECT mean(water_level) AS water_level, mean(discharge) AS discharge, mean(discharge_percent) AS discharge_percent
  INTO \"downsampled\".\"water_data_hourly\" 
  FROM \"water_data\" 
  GROUP BY time(1h), station_code, station_name_en, station_name_th
END
"
```

## 2. VictoriaMetrics Deployment (High Performance)

### Why VictoriaMetrics?
- **Extremely fast** and resource-efficient
- **Better compression** than InfluxDB
- **Prometheus-compatible** API
- **Lower memory usage**
- **Built-in clustering**

### Docker Deployment
```yaml
# docker-compose.yml
version: '3.8'
services:
  victoriametrics:
    image: victoriametrics/victoria-metrics:latest
    container_name: water_victoriametrics
    ports:
      - "8428:8428"
    volumes:
      - vm_data:/victoria-metrics-data
    command:
      - '--storageDataPath=/victoria-metrics-data'
      - '--retentionPeriod=2y'
      - '--httpListenAddr=:8428'
      - '--maxConcurrentInserts=16'
    restart: unless-stopped

volumes:
  vm_data:
```

### Environment Variables
```bash
# .env file
DB_TYPE=victoriametrics
VM_HOST=localhost
VM_PORT=8428
```

## 3. PostgreSQL Deployment (Relational + Time-Series)

### Why PostgreSQL?
- **Mature and reliable**
- **Excellent for complex queries**
- **TimescaleDB extension** for time-series optimization
- **Strong consistency guarantees**
- **Rich ecosystem**

### Docker Deployment with TimescaleDB
```yaml
# docker-compose.yml
version: '3.8'
services:
  postgres:
    image: timescale/timescaledb:latest-pg14
    container_name: water_postgres
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    environment:
      - POSTGRES_DB=water_monitoring
      - POSTGRES_USER=water_user
      - POSTGRES_PASSWORD=secure_password
    restart: unless-stopped

volumes:
  postgres_data:
```

### Database Initialization
```sql
-- init.sql
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;

-- Create hypertable for time-series optimization
CREATE TABLE water_measurements (
    id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL,
    station_id INT NOT NULL,
    water_level NUMERIC(10,3),
    discharge NUMERIC(10,2),
    discharge_percent NUMERIC(5,2),
    status VARCHAR(20) DEFAULT 'active',
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Convert to hypertable (TimescaleDB)
SELECT create_hypertable('water_measurements', 'timestamp', chunk_time_interval => INTERVAL '1 day');

-- Create indexes
CREATE INDEX idx_water_measurements_station_time ON water_measurements (station_id, timestamp DESC);
CREATE INDEX idx_water_measurements_timestamp ON water_measurements (timestamp DESC);

-- Create retention policy (keep raw data for 2 years)
SELECT add_retention_policy('water_measurements', INTERVAL '2 years');

-- Create continuous aggregates for performance
CREATE MATERIALIZED VIEW water_measurements_hourly
WITH (timescaledb.continuous) AS
SELECT 
    time_bucket('1 hour', timestamp) AS bucket,
    station_id,
    AVG(water_level) as avg_water_level,
    MAX(water_level) as max_water_level,
    MIN(water_level) as min_water_level,
    AVG(discharge) as avg_discharge,
    MAX(discharge) as max_discharge,
    MIN(discharge) as min_discharge,
    AVG(discharge_percent) as avg_discharge_percent
FROM water_measurements
GROUP BY bucket, station_id;

-- Refresh policy for continuous aggregates
SELECT add_continuous_aggregate_policy('water_measurements_hourly',
    start_offset => INTERVAL '1 day',
    end_offset => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour');
```

### Environment Variables
```bash
# .env file
DB_TYPE=postgresql
POSTGRES_CONNECTION_STRING=postgresql://water_user:secure_password@localhost:5432/water_monitoring
```

## 4. MySQL Deployment (Traditional Relational)

### Docker Deployment
```yaml
# docker-compose.yml
version: '3.8'
services:
  mysql:
    image: mysql:8.0
    container_name: water_mysql
    ports:
      - "3306:3306"
    volumes:
      - mysql_data:/var/lib/mysql
      - ./mysql.cnf:/etc/mysql/conf.d/mysql.cnf
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    environment:
      - MYSQL_ROOT_PASSWORD=root_password
      - MYSQL_DATABASE=water_monitoring
      - MYSQL_USER=water_user
      - MYSQL_PASSWORD=water_password
    restart: unless-stopped

volumes:
  mysql_data:
```

### MySQL Configuration
```ini
# mysql.cnf
[mysqld]
# Optimize for time-series data
innodb_buffer_pool_size = 1G
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT

# Partitioning support
partition = ON

# Query cache
query_cache_type = 1
query_cache_size = 128M

# Connection settings
max_connections = 200
connect_timeout = 10
wait_timeout = 600
```

### Environment Variables
```bash
# .env file
DB_TYPE=mysql
MYSQL_CONNECTION_STRING=mysql://water_user:water_password@localhost:3306/water_monitoring
```

## 5. Installation and Dependencies

### Required Python Packages
```bash
# Base requirements
pip install requests schedule

# Database-specific packages
pip install influxdb                    # For InfluxDB
pip install sqlalchemy pymysql         # For MySQL
pip install sqlalchemy psycopg2-binary # For PostgreSQL
# VictoriaMetrics uses HTTP API (no extra packages needed)
```

### Updated requirements.txt
```txt
requests>=2.28.0
schedule>=1.2.0
pandas>=1.5.0

# Database adapters (install as needed)
influxdb>=5.3.1
sqlalchemy>=1.4.0
pymysql>=1.0.2
psycopg2-binary>=2.9.0
```

## 6. Production Deployment Examples

### Using InfluxDB (Recommended)
```bash
# Set environment variables
export DB_TYPE=influxdb
export INFLUX_HOST=your-influx-server.com
export INFLUX_PORT=8086
export INFLUX_DATABASE=water_monitoring
export INFLUX_USERNAME=water_user
export INFLUX_PASSWORD=your_secure_password

# Run the scraper
python water_scraper_v3.py
```

### Using PostgreSQL with TimescaleDB
```bash
# Set environment variables
export DB_TYPE=postgresql
export POSTGRES_CONNECTION_STRING=postgresql://water_user:password@your-postgres-server.com:5432/water_monitoring

# Run the scraper
python water_scraper_v3.py
```

### Using VictoriaMetrics
```bash
# Set environment variables
export DB_TYPE=victoriametrics
export VM_HOST=your-vm-server.com
export VM_PORT=8428

# Run the scraper
python water_scraper_v3.py
```

## 7. Monitoring and Alerting

### Grafana Dashboard Setup
1. **Add Data Source**: Configure your database as a Grafana data source
2. **Import Dashboard**: Use pre-built water monitoring dashboards
3. **Set Alerts**: Configure alerts for abnormal water levels or discharge rates

### Example Grafana Queries

#### InfluxDB Queries
```sql
-- Current water levels
SELECT last("water_level") FROM "water_data" GROUP BY "station_code"

-- Discharge trends (last 24h)
SELECT mean("discharge") FROM "water_data" WHERE time >= now() - 24h GROUP BY time(1h), "station_code"
```

#### PostgreSQL/TimescaleDB Queries
```sql
-- Current water levels
SELECT DISTINCT ON (station_id) 
    station_id, water_level, discharge, timestamp
FROM water_measurements 
ORDER BY station_id, timestamp DESC;

-- Hourly averages (last 24h)
SELECT 
    time_bucket('1 hour', timestamp) as hour,
    station_id,
    AVG(water_level) as avg_level,
    AVG(discharge) as avg_discharge
FROM water_measurements 
WHERE timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY hour, station_id
ORDER BY hour DESC;
```

## 8. Performance Optimization Tips

### For All Databases
- **Batch inserts**: Insert multiple measurements at once
- **Connection pooling**: Reuse database connections
- **Indexing**: Ensure proper indexes on timestamp and station_id
- **Retention policies**: Automatically delete old data

### InfluxDB Specific
- Use **tags** for metadata (station codes, names)
- Use **fields** for numeric values (water levels, discharge)
- Configure **retention policies** and **continuous queries**
- Enable **compression** for long-term storage

### PostgreSQL/TimescaleDB Specific
- Use **hypertables** for automatic partitioning
- Create **continuous aggregates** for common queries
- Configure **compression** for older chunks
- Use **parallel queries** for large datasets

### VictoriaMetrics Specific
- Use **labels** efficiently (similar to Prometheus)
- Configure **retention periods** appropriately
- Use **downsampling** for long-term storage
- Enable **deduplication** if needed

This deployment guide provides production-ready configurations for all supported database backends. Choose the one that best fits your infrastructure and requirements.