Files
Northern-Thailand-Ping-Rive…/docs/GAP_FILLING_GUIDE.md
grabowski af62cfef0b
Some checks failed
Security & Dependency Updates / Dependency Security Scan (push) Successful in 29s
Security & Dependency Updates / Docker Security Scan (push) Failing after 53s
Security & Dependency Updates / License Compliance (push) Successful in 13s
Security & Dependency Updates / Check for Dependency Updates (push) Successful in 19s
Security & Dependency Updates / Code Quality Metrics (push) Successful in 11s
Security & Dependency Updates / Security Summary (push) Successful in 7s
Initial commit: Northern Thailand Ping River Monitor v3.1.0
Features:
- Real-time water level monitoring for Ping River Basin (16 stations)
- Coverage from Chiang Dao to Nakhon Sawan in Northern Thailand
- FastAPI web interface with interactive dashboard and station management
- Multi-database support (SQLite, MySQL, PostgreSQL, InfluxDB, VictoriaMetrics)
- Comprehensive monitoring with health checks and metrics collection
- Docker deployment with Grafana integration
- Production-ready architecture with enterprise-grade observability

 CI/CD & Automation:
- Complete Gitea Actions workflows for CI/CD, security, and releases
- Multi-Python version testing (3.9-3.12)
- Multi-architecture Docker builds (amd64, arm64)
- Daily security scanning and dependency monitoring
- Automated documentation generation
- Performance testing and validation

 Production Ready:
- Type safety with Pydantic models and comprehensive type hints
- Data validation layer with range checking and error handling
- Rate limiting and request tracking for API protection
- Enhanced logging with rotation, colors, and performance metrics
- Station management API for dynamic CRUD operations
- Comprehensive documentation and deployment guides

 Technical Stack:
- Python 3.9+ with FastAPI and Pydantic
- Multi-database architecture with adapter pattern
- Docker containerization with multi-stage builds
- Grafana dashboards for visualization
- Gitea Actions for CI/CD automation
- Enterprise monitoring and alerting

 Ready for deployment to B4L infrastructure!
2025-08-12 15:40:24 +07:00

8.6 KiB

Gap Filling and Data Integrity Guide

This guide explains the enhanced gap-filling functionality that addresses data gaps and missing timestamps in the Thailand Water Monitor.

Issues Resolved

1. Data Gaps Problem

  • Before: Tool only fetched current day data, leaving gaps in historical records
  • After: Automatically detects and fills missing timestamps for the last 7 days

2. Missing Midnight Timestamps

  • Before: Jump from 23:00 to 01:00 (missing 00:00 midnight data)
  • After: Specifically checks for and fills midnight hour gaps

3. Changed Values

  • Before: No mechanism to update existing data if values changed on the server
  • After: Compares existing data with fresh API data and updates changed values

🔧 New Features

Command Line Interface

# Check for missing data gaps
python water_scraper_v3.py --check-gaps [days]

# Fill missing data gaps
python water_scraper_v3.py --fill-gaps [days]

# Update existing data with latest values
python water_scraper_v3.py --update-data [days]

# Run single test cycle
python water_scraper_v3.py --test

# Show help
python water_scraper_v3.py --help

Automatic Gap Detection

The system now automatically:

  • Generates expected hourly timestamps for the specified time range
  • Compares with existing database records
  • Identifies missing timestamps
  • Groups missing data by date for efficient API calls

Intelligent Gap Filling

  • Historical Data Fetching: Retrieves data for specific dates to fill gaps
  • Selective Insertion: Only inserts data for actually missing timestamps
  • API Rate Limiting: Includes delays between API calls to be respectful
  • Error Handling: Continues processing even if some dates fail

Data Update Mechanism

  • Change Detection: Compares water levels, discharge rates, and percentages
  • Precision Checking: Uses appropriate thresholds (0.001m for water level, 0.1 cms for discharge)
  • Selective Updates: Only updates records where values have actually changed

📊 Test Results

Before Enhancement

Found 22 missing timestamps in the last 2 days:
  2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
  2025-07-24: Missing hours [0, 20, 21, 22, 23]
  2025-07-25: Missing hours [0, 9]

After Gap Filling

Gap filling completed. Filled 96 missing data points

Remaining gaps:
  2025-07-24: Missing hours [10]
  2025-07-25: Missing hours [0, 10]

Improvement: Reduced from 22 missing timestamps to 3 (86% improvement)

🚀 Enhanced Scraping Cycle

The regular scraping cycle now includes three phases:

Phase 1: Current Data Collection

# Fetch and save current data
water_data = self.fetch_water_data()
success = self.save_to_database(water_data)

Phase 2: Gap Filling (Last 7 Days)

# Check for and fill missing data
filled_count = self.fill_data_gaps(days_back=7)

Phase 3: Data Updates (Last 2 Days)

# Update existing data with latest values
updated_count = self.update_existing_data(days_back=2)

🔧 Technical Improvements

Database Connection Handling

  • SQLite Optimization: Added timeout and thread safety parameters
  • Retry Logic: Exponential backoff for database lock errors
  • Transaction Management: Proper use of engine.begin() for automatic commits

Error Recovery

# Retry logic with exponential backoff
for attempt in range(max_retries):
    try:
        success = self.db_adapter.save_measurements(water_data)
        if success:
            return True
    except Exception as e:
        if "database is locked" in str(e).lower():
            time.sleep(2 ** attempt)  # 1s, 2s, 4s delays
            continue

Memory Efficiency

  • Selective Data Processing: Only processes data for missing timestamps
  • Batch Processing: Groups operations by date to minimize API calls
  • Resource Management: Proper cleanup and connection handling

📋 Usage Examples

Daily Maintenance

# Check for gaps in the last week
python water_scraper_v3.py --check-gaps 7

# Fill any found gaps
python water_scraper_v3.py --fill-gaps 7

# Update recent data for accuracy
python water_scraper_v3.py --update-data 2

Historical Data Recovery

# Check for gaps in the last month
python water_scraper_v3.py --check-gaps 30

# Fill gaps for the last month (be patient, this takes time)
python water_scraper_v3.py --fill-gaps 30

Production Monitoring

# Quick test to ensure system is working
python water_scraper_v3.py --test

# Check for recent gaps
python water_scraper_v3.py --check-gaps 1

🔍 Monitoring and Alerts

Gap Detection Output

Found 22 missing timestamps:
  2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
  2025-07-24: Missing hours [0, 20, 21, 22, 23]
  2025-07-25: Missing hours [0, 9]

Gap Filling Progress

Fetching data for 2025-07-24 to fill 5 missing timestamps
Successfully fetched 368 data points from API for 2025-07-24
Filled 80 data points for 2025-07-24
Gap filling completed. Filled 96 missing data points

Update Detection

Checking for updates on 2025-07-24
Update needed for P.1 at 2025-07-24 15:00:00
Updated 5 measurements for 2025-07-24
Data update completed. Updated 5 measurements

⚙️ Configuration Options

Environment Variables

# Database configuration
export DB_TYPE=sqlite
export WATER_DB_PATH=water_monitoring.db

# Gap filling settings (can be added to config.py)
export GAP_FILL_DAYS=7        # Days to check for gaps
export UPDATE_DAYS=2          # Days to check for updates
export API_DELAY=1            # Seconds between API calls
export MAX_RETRIES=3          # Database retry attempts

Customizable Parameters

  • Gap Check Period: Default 7 days, configurable via command line
  • Update Period: Default 2 days, configurable via command line
  • API Rate Limiting: 1-second delay between calls (configurable)
  • Retry Logic: 3 attempts with exponential backoff (configurable)

🛠️ Troubleshooting

Common Issues

Database Locked Errors

ERROR - Error saving to SQLITE: database is locked

Solution: The retry logic now handles this automatically with exponential backoff.

API Rate Limiting

WARNING - Too many requests to API

Solution: Increase delay between API calls or reduce the number of days processed at once.

Missing Data Still Present

Found X missing timestamps after gap filling

Possible Causes:

  • Data not available on the Thai government server for those timestamps
  • Network issues during API calls
  • API returned empty data for those specific times

Debug Commands

# Enable debug logging
export LOG_LEVEL=DEBUG
python water_scraper_v3.py --check-gaps 1

# Test specific date range
python water_scraper_v3.py --fill-gaps 1

# Check database directly
sqlite3 water_monitoring.db "SELECT COUNT(*) FROM water_measurements;"
sqlite3 water_monitoring.db "SELECT timestamp, COUNT(*) FROM water_measurements GROUP BY timestamp ORDER BY timestamp DESC LIMIT 10;"

📈 Performance Metrics

Gap Filling Efficiency

  • API Calls: Grouped by date to minimize requests
  • Processing Speed: ~100-400 data points per API call
  • Success Rate: 86% gap reduction in test case
  • Resource Usage: Minimal memory footprint with selective processing

Database Performance

  • SQLite Optimization: Connection pooling and timeout handling
  • Transaction Efficiency: Batch inserts with proper transaction management
  • Retry Success: Automatic recovery from temporary lock conditions

🎯 Best Practices

Regular Maintenance

  1. Daily: Run --check-gaps 1 to monitor recent data quality
  2. Weekly: Run --fill-gaps 7 to catch any missed data
  3. Monthly: Run --update-data 7 to ensure data accuracy

Production Deployment

  1. Automated Scheduling: Use cron or systemd timers for regular gap checks
  2. Monitoring: Set up alerts for excessive missing data
  3. Backup: Regular database backups before major gap-filling operations

Data Quality Assurance

  1. Validation: Check for reasonable value ranges after gap filling
  2. Comparison: Compare filled data with nearby timestamps for consistency
  3. Documentation: Log all gap-filling activities for audit trails

This enhanced gap-filling system ensures comprehensive and accurate water level monitoring with minimal data loss and automatic recovery capabilities.