# Gap Filling and Data Integrity Guide This guide explains the enhanced gap-filling functionality that addresses data gaps and missing timestamps in the Thailand Water Monitor. ## ✅ **Issues Resolved** ### **1. Data Gaps Problem** - **Before**: Tool only fetched current day data, leaving gaps in historical records - **After**: Automatically detects and fills missing timestamps for the last 7 days ### **2. Missing Midnight Timestamps** - **Before**: Jump from 23:00 to 01:00 (missing 00:00 midnight data) - **After**: Specifically checks for and fills midnight hour gaps ### **3. Changed Values** - **Before**: No mechanism to update existing data if values changed on the server - **After**: Compares existing data with fresh API data and updates changed values ## 🔧 **New Features** ### **Command Line Interface** ```bash # Check for missing data gaps python water_scraper_v3.py --check-gaps [days] # Fill missing data gaps python water_scraper_v3.py --fill-gaps [days] # Update existing data with latest values python water_scraper_v3.py --update-data [days] # Run single test cycle python water_scraper_v3.py --test # Show help python water_scraper_v3.py --help ``` ### **Automatic Gap Detection** The system now automatically: - Generates expected hourly timestamps for the specified time range - Compares with existing database records - Identifies missing timestamps - Groups missing data by date for efficient API calls ### **Intelligent Gap Filling** - **Historical Data Fetching**: Retrieves data for specific dates to fill gaps - **Selective Insertion**: Only inserts data for actually missing timestamps - **API Rate Limiting**: Includes delays between API calls to be respectful - **Error Handling**: Continues processing even if some dates fail ### **Data Update Mechanism** - **Change Detection**: Compares water levels, discharge rates, and percentages - **Precision Checking**: Uses appropriate thresholds (0.001m for water level, 0.1 cms for discharge) - **Selective Updates**: Only updates records where values have actually changed ## 📊 **Test Results** ### **Before Enhancement** ``` Found 22 missing timestamps in the last 2 days: 2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] 2025-07-24: Missing hours [0, 20, 21, 22, 23] 2025-07-25: Missing hours [0, 9] ``` ### **After Gap Filling** ``` Gap filling completed. Filled 96 missing data points Remaining gaps: 2025-07-24: Missing hours [10] 2025-07-25: Missing hours [0, 10] ``` **Improvement**: Reduced from 22 missing timestamps to 3 (86% improvement) ## 🚀 **Enhanced Scraping Cycle** The regular scraping cycle now includes three phases: ### **Phase 1: Current Data Collection** ```python # Fetch and save current data water_data = self.fetch_water_data() success = self.save_to_database(water_data) ``` ### **Phase 2: Gap Filling (Last 7 Days)** ```python # Check for and fill missing data filled_count = self.fill_data_gaps(days_back=7) ``` ### **Phase 3: Data Updates (Last 2 Days)** ```python # Update existing data with latest values updated_count = self.update_existing_data(days_back=2) ``` ## 🔧 **Technical Improvements** ### **Database Connection Handling** - **SQLite Optimization**: Added timeout and thread safety parameters - **Retry Logic**: Exponential backoff for database lock errors - **Transaction Management**: Proper use of `engine.begin()` for automatic commits ### **Error Recovery** ```python # Retry logic with exponential backoff for attempt in range(max_retries): try: success = self.db_adapter.save_measurements(water_data) if success: return True except Exception as e: if "database is locked" in str(e).lower(): time.sleep(2 ** attempt) # 1s, 2s, 4s delays continue ``` ### **Memory Efficiency** - **Selective Data Processing**: Only processes data for missing timestamps - **Batch Processing**: Groups operations by date to minimize API calls - **Resource Management**: Proper cleanup and connection handling ## 📋 **Usage Examples** ### **Daily Maintenance** ```bash # Check for gaps in the last week python water_scraper_v3.py --check-gaps 7 # Fill any found gaps python water_scraper_v3.py --fill-gaps 7 # Update recent data for accuracy python water_scraper_v3.py --update-data 2 ``` ### **Historical Data Recovery** ```bash # Check for gaps in the last month python water_scraper_v3.py --check-gaps 30 # Fill gaps for the last month (be patient, this takes time) python water_scraper_v3.py --fill-gaps 30 ``` ### **Production Monitoring** ```bash # Quick test to ensure system is working python water_scraper_v3.py --test # Check for recent gaps python water_scraper_v3.py --check-gaps 1 ``` ## 🔍 **Monitoring and Alerts** ### **Gap Detection Output** ``` Found 22 missing timestamps: 2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] 2025-07-24: Missing hours [0, 20, 21, 22, 23] 2025-07-25: Missing hours [0, 9] ``` ### **Gap Filling Progress** ``` Fetching data for 2025-07-24 to fill 5 missing timestamps Successfully fetched 368 data points from API for 2025-07-24 Filled 80 data points for 2025-07-24 Gap filling completed. Filled 96 missing data points ``` ### **Update Detection** ``` Checking for updates on 2025-07-24 Update needed for P.1 at 2025-07-24 15:00:00 Updated 5 measurements for 2025-07-24 Data update completed. Updated 5 measurements ``` ## ⚙️ **Configuration Options** ### **Environment Variables** ```bash # Database configuration export DB_TYPE=sqlite export WATER_DB_PATH=water_monitoring.db # Gap filling settings (can be added to config.py) export GAP_FILL_DAYS=7 # Days to check for gaps export UPDATE_DAYS=2 # Days to check for updates export API_DELAY=1 # Seconds between API calls export MAX_RETRIES=3 # Database retry attempts ``` ### **Customizable Parameters** - **Gap Check Period**: Default 7 days, configurable via command line - **Update Period**: Default 2 days, configurable via command line - **API Rate Limiting**: 1-second delay between calls (configurable) - **Retry Logic**: 3 attempts with exponential backoff (configurable) ## 🛠️ **Troubleshooting** ### **Common Issues** #### **Database Locked Errors** ``` ERROR - Error saving to SQLITE: database is locked ``` **Solution**: The retry logic now handles this automatically with exponential backoff. #### **API Rate Limiting** ``` WARNING - Too many requests to API ``` **Solution**: Increase delay between API calls or reduce the number of days processed at once. #### **Missing Data Still Present** ``` Found X missing timestamps after gap filling ``` **Possible Causes**: - Data not available on the Thai government server for those timestamps - Network issues during API calls - API returned empty data for those specific times ### **Debug Commands** ```bash # Enable debug logging export LOG_LEVEL=DEBUG python water_scraper_v3.py --check-gaps 1 # Test specific date range python water_scraper_v3.py --fill-gaps 1 # Check database directly sqlite3 water_monitoring.db "SELECT COUNT(*) FROM water_measurements;" sqlite3 water_monitoring.db "SELECT timestamp, COUNT(*) FROM water_measurements GROUP BY timestamp ORDER BY timestamp DESC LIMIT 10;" ``` ## 📈 **Performance Metrics** ### **Gap Filling Efficiency** - **API Calls**: Grouped by date to minimize requests - **Processing Speed**: ~100-400 data points per API call - **Success Rate**: 86% gap reduction in test case - **Resource Usage**: Minimal memory footprint with selective processing ### **Database Performance** - **SQLite Optimization**: Connection pooling and timeout handling - **Transaction Efficiency**: Batch inserts with proper transaction management - **Retry Success**: Automatic recovery from temporary lock conditions ## 🎯 **Best Practices** ### **Regular Maintenance** 1. **Daily**: Run `--check-gaps 1` to monitor recent data quality 2. **Weekly**: Run `--fill-gaps 7` to catch any missed data 3. **Monthly**: Run `--update-data 7` to ensure data accuracy ### **Production Deployment** 1. **Automated Scheduling**: Use cron or systemd timers for regular gap checks 2. **Monitoring**: Set up alerts for excessive missing data 3. **Backup**: Regular database backups before major gap-filling operations ### **Data Quality Assurance** 1. **Validation**: Check for reasonable value ranges after gap filling 2. **Comparison**: Compare filled data with nearby timestamps for consistency 3. **Documentation**: Log all gap-filling activities for audit trails This enhanced gap-filling system ensures comprehensive and accurate water level monitoring with minimal data loss and automatic recovery capabilities.