Some checks failed
Security & Dependency Updates / Dependency Security Scan (push) Successful in 29s
Security & Dependency Updates / Docker Security Scan (push) Failing after 53s
Security & Dependency Updates / License Compliance (push) Successful in 13s
Security & Dependency Updates / Check for Dependency Updates (push) Successful in 19s
Security & Dependency Updates / Code Quality Metrics (push) Successful in 11s
Security & Dependency Updates / Security Summary (push) Successful in 7s
Features: - Real-time water level monitoring for Ping River Basin (16 stations) - Coverage from Chiang Dao to Nakhon Sawan in Northern Thailand - FastAPI web interface with interactive dashboard and station management - Multi-database support (SQLite, MySQL, PostgreSQL, InfluxDB, VictoriaMetrics) - Comprehensive monitoring with health checks and metrics collection - Docker deployment with Grafana integration - Production-ready architecture with enterprise-grade observability CI/CD & Automation: - Complete Gitea Actions workflows for CI/CD, security, and releases - Multi-Python version testing (3.9-3.12) - Multi-architecture Docker builds (amd64, arm64) - Daily security scanning and dependency monitoring - Automated documentation generation - Performance testing and validation Production Ready: - Type safety with Pydantic models and comprehensive type hints - Data validation layer with range checking and error handling - Rate limiting and request tracking for API protection - Enhanced logging with rotation, colors, and performance metrics - Station management API for dynamic CRUD operations - Comprehensive documentation and deployment guides Technical Stack: - Python 3.9+ with FastAPI and Pydantic - Multi-database architecture with adapter pattern - Docker containerization with multi-stage builds - Grafana dashboards for visualization - Gitea Actions for CI/CD automation - Enterprise monitoring and alerting Ready for deployment to B4L infrastructure!
8.6 KiB
8.6 KiB
Gap Filling and Data Integrity Guide
This guide explains the enhanced gap-filling functionality that addresses data gaps and missing timestamps in the Thailand Water Monitor.
✅ Issues Resolved
1. Data Gaps Problem
- Before: Tool only fetched current day data, leaving gaps in historical records
- After: Automatically detects and fills missing timestamps for the last 7 days
2. Missing Midnight Timestamps
- Before: Jump from 23:00 to 01:00 (missing 00:00 midnight data)
- After: Specifically checks for and fills midnight hour gaps
3. Changed Values
- Before: No mechanism to update existing data if values changed on the server
- After: Compares existing data with fresh API data and updates changed values
🔧 New Features
Command Line Interface
# Check for missing data gaps
python water_scraper_v3.py --check-gaps [days]
# Fill missing data gaps
python water_scraper_v3.py --fill-gaps [days]
# Update existing data with latest values
python water_scraper_v3.py --update-data [days]
# Run single test cycle
python water_scraper_v3.py --test
# Show help
python water_scraper_v3.py --help
Automatic Gap Detection
The system now automatically:
- Generates expected hourly timestamps for the specified time range
- Compares with existing database records
- Identifies missing timestamps
- Groups missing data by date for efficient API calls
Intelligent Gap Filling
- Historical Data Fetching: Retrieves data for specific dates to fill gaps
- Selective Insertion: Only inserts data for actually missing timestamps
- API Rate Limiting: Includes delays between API calls to be respectful
- Error Handling: Continues processing even if some dates fail
Data Update Mechanism
- Change Detection: Compares water levels, discharge rates, and percentages
- Precision Checking: Uses appropriate thresholds (0.001m for water level, 0.1 cms for discharge)
- Selective Updates: Only updates records where values have actually changed
📊 Test Results
Before Enhancement
Found 22 missing timestamps in the last 2 days:
2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
2025-07-24: Missing hours [0, 20, 21, 22, 23]
2025-07-25: Missing hours [0, 9]
After Gap Filling
Gap filling completed. Filled 96 missing data points
Remaining gaps:
2025-07-24: Missing hours [10]
2025-07-25: Missing hours [0, 10]
Improvement: Reduced from 22 missing timestamps to 3 (86% improvement)
🚀 Enhanced Scraping Cycle
The regular scraping cycle now includes three phases:
Phase 1: Current Data Collection
# Fetch and save current data
water_data = self.fetch_water_data()
success = self.save_to_database(water_data)
Phase 2: Gap Filling (Last 7 Days)
# Check for and fill missing data
filled_count = self.fill_data_gaps(days_back=7)
Phase 3: Data Updates (Last 2 Days)
# Update existing data with latest values
updated_count = self.update_existing_data(days_back=2)
🔧 Technical Improvements
Database Connection Handling
- SQLite Optimization: Added timeout and thread safety parameters
- Retry Logic: Exponential backoff for database lock errors
- Transaction Management: Proper use of
engine.begin()
for automatic commits
Error Recovery
# Retry logic with exponential backoff
for attempt in range(max_retries):
try:
success = self.db_adapter.save_measurements(water_data)
if success:
return True
except Exception as e:
if "database is locked" in str(e).lower():
time.sleep(2 ** attempt) # 1s, 2s, 4s delays
continue
Memory Efficiency
- Selective Data Processing: Only processes data for missing timestamps
- Batch Processing: Groups operations by date to minimize API calls
- Resource Management: Proper cleanup and connection handling
📋 Usage Examples
Daily Maintenance
# Check for gaps in the last week
python water_scraper_v3.py --check-gaps 7
# Fill any found gaps
python water_scraper_v3.py --fill-gaps 7
# Update recent data for accuracy
python water_scraper_v3.py --update-data 2
Historical Data Recovery
# Check for gaps in the last month
python water_scraper_v3.py --check-gaps 30
# Fill gaps for the last month (be patient, this takes time)
python water_scraper_v3.py --fill-gaps 30
Production Monitoring
# Quick test to ensure system is working
python water_scraper_v3.py --test
# Check for recent gaps
python water_scraper_v3.py --check-gaps 1
🔍 Monitoring and Alerts
Gap Detection Output
Found 22 missing timestamps:
2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
2025-07-24: Missing hours [0, 20, 21, 22, 23]
2025-07-25: Missing hours [0, 9]
Gap Filling Progress
Fetching data for 2025-07-24 to fill 5 missing timestamps
Successfully fetched 368 data points from API for 2025-07-24
Filled 80 data points for 2025-07-24
Gap filling completed. Filled 96 missing data points
Update Detection
Checking for updates on 2025-07-24
Update needed for P.1 at 2025-07-24 15:00:00
Updated 5 measurements for 2025-07-24
Data update completed. Updated 5 measurements
⚙️ Configuration Options
Environment Variables
# Database configuration
export DB_TYPE=sqlite
export WATER_DB_PATH=water_monitoring.db
# Gap filling settings (can be added to config.py)
export GAP_FILL_DAYS=7 # Days to check for gaps
export UPDATE_DAYS=2 # Days to check for updates
export API_DELAY=1 # Seconds between API calls
export MAX_RETRIES=3 # Database retry attempts
Customizable Parameters
- Gap Check Period: Default 7 days, configurable via command line
- Update Period: Default 2 days, configurable via command line
- API Rate Limiting: 1-second delay between calls (configurable)
- Retry Logic: 3 attempts with exponential backoff (configurable)
🛠️ Troubleshooting
Common Issues
Database Locked Errors
ERROR - Error saving to SQLITE: database is locked
Solution: The retry logic now handles this automatically with exponential backoff.
API Rate Limiting
WARNING - Too many requests to API
Solution: Increase delay between API calls or reduce the number of days processed at once.
Missing Data Still Present
Found X missing timestamps after gap filling
Possible Causes:
- Data not available on the Thai government server for those timestamps
- Network issues during API calls
- API returned empty data for those specific times
Debug Commands
# Enable debug logging
export LOG_LEVEL=DEBUG
python water_scraper_v3.py --check-gaps 1
# Test specific date range
python water_scraper_v3.py --fill-gaps 1
# Check database directly
sqlite3 water_monitoring.db "SELECT COUNT(*) FROM water_measurements;"
sqlite3 water_monitoring.db "SELECT timestamp, COUNT(*) FROM water_measurements GROUP BY timestamp ORDER BY timestamp DESC LIMIT 10;"
📈 Performance Metrics
Gap Filling Efficiency
- API Calls: Grouped by date to minimize requests
- Processing Speed: ~100-400 data points per API call
- Success Rate: 86% gap reduction in test case
- Resource Usage: Minimal memory footprint with selective processing
Database Performance
- SQLite Optimization: Connection pooling and timeout handling
- Transaction Efficiency: Batch inserts with proper transaction management
- Retry Success: Automatic recovery from temporary lock conditions
🎯 Best Practices
Regular Maintenance
- Daily: Run
--check-gaps 1
to monitor recent data quality - Weekly: Run
--fill-gaps 7
to catch any missed data - Monthly: Run
--update-data 7
to ensure data accuracy
Production Deployment
- Automated Scheduling: Use cron or systemd timers for regular gap checks
- Monitoring: Set up alerts for excessive missing data
- Backup: Regular database backups before major gap-filling operations
Data Quality Assurance
- Validation: Check for reasonable value ranges after gap filling
- Comparison: Compare filled data with nearby timestamps for consistency
- Documentation: Log all gap-filling activities for audit trails
This enhanced gap-filling system ensures comprehensive and accurate water level monitoring with minimal data loss and automatic recovery capabilities.