Some checks failed
Security & Dependency Updates / Dependency Security Scan (push) Successful in 29s
Security & Dependency Updates / Docker Security Scan (push) Failing after 53s
Security & Dependency Updates / License Compliance (push) Successful in 13s
Security & Dependency Updates / Check for Dependency Updates (push) Successful in 19s
Security & Dependency Updates / Code Quality Metrics (push) Successful in 11s
Security & Dependency Updates / Security Summary (push) Successful in 7s
Features: - Real-time water level monitoring for Ping River Basin (16 stations) - Coverage from Chiang Dao to Nakhon Sawan in Northern Thailand - FastAPI web interface with interactive dashboard and station management - Multi-database support (SQLite, MySQL, PostgreSQL, InfluxDB, VictoriaMetrics) - Comprehensive monitoring with health checks and metrics collection - Docker deployment with Grafana integration - Production-ready architecture with enterprise-grade observability CI/CD & Automation: - Complete Gitea Actions workflows for CI/CD, security, and releases - Multi-Python version testing (3.9-3.12) - Multi-architecture Docker builds (amd64, arm64) - Daily security scanning and dependency monitoring - Automated documentation generation - Performance testing and validation Production Ready: - Type safety with Pydantic models and comprehensive type hints - Data validation layer with range checking and error handling - Rate limiting and request tracking for API protection - Enhanced logging with rotation, colors, and performance metrics - Station management API for dynamic CRUD operations - Comprehensive documentation and deployment guides Technical Stack: - Python 3.9+ with FastAPI and Pydantic - Multi-database architecture with adapter pattern - Docker containerization with multi-stage builds - Grafana dashboards for visualization - Gitea Actions for CI/CD automation - Enterprise monitoring and alerting Ready for deployment to B4L infrastructure!
276 lines
8.6 KiB
Markdown
276 lines
8.6 KiB
Markdown
# Gap Filling and Data Integrity Guide
|
|
|
|
This guide explains the enhanced gap-filling functionality that addresses data gaps and missing timestamps in the Thailand Water Monitor.
|
|
|
|
## ✅ **Issues Resolved**
|
|
|
|
### **1. Data Gaps Problem**
|
|
- **Before**: Tool only fetched current day data, leaving gaps in historical records
|
|
- **After**: Automatically detects and fills missing timestamps for the last 7 days
|
|
|
|
### **2. Missing Midnight Timestamps**
|
|
- **Before**: Jump from 23:00 to 01:00 (missing 00:00 midnight data)
|
|
- **After**: Specifically checks for and fills midnight hour gaps
|
|
|
|
### **3. Changed Values**
|
|
- **Before**: No mechanism to update existing data if values changed on the server
|
|
- **After**: Compares existing data with fresh API data and updates changed values
|
|
|
|
## 🔧 **New Features**
|
|
|
|
### **Command Line Interface**
|
|
```bash
|
|
# Check for missing data gaps
|
|
python water_scraper_v3.py --check-gaps [days]
|
|
|
|
# Fill missing data gaps
|
|
python water_scraper_v3.py --fill-gaps [days]
|
|
|
|
# Update existing data with latest values
|
|
python water_scraper_v3.py --update-data [days]
|
|
|
|
# Run single test cycle
|
|
python water_scraper_v3.py --test
|
|
|
|
# Show help
|
|
python water_scraper_v3.py --help
|
|
```
|
|
|
|
### **Automatic Gap Detection**
|
|
The system now automatically:
|
|
- Generates expected hourly timestamps for the specified time range
|
|
- Compares with existing database records
|
|
- Identifies missing timestamps
|
|
- Groups missing data by date for efficient API calls
|
|
|
|
### **Intelligent Gap Filling**
|
|
- **Historical Data Fetching**: Retrieves data for specific dates to fill gaps
|
|
- **Selective Insertion**: Only inserts data for actually missing timestamps
|
|
- **API Rate Limiting**: Includes delays between API calls to be respectful
|
|
- **Error Handling**: Continues processing even if some dates fail
|
|
|
|
### **Data Update Mechanism**
|
|
- **Change Detection**: Compares water levels, discharge rates, and percentages
|
|
- **Precision Checking**: Uses appropriate thresholds (0.001m for water level, 0.1 cms for discharge)
|
|
- **Selective Updates**: Only updates records where values have actually changed
|
|
|
|
## 📊 **Test Results**
|
|
|
|
### **Before Enhancement**
|
|
```
|
|
Found 22 missing timestamps in the last 2 days:
|
|
2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
|
|
2025-07-24: Missing hours [0, 20, 21, 22, 23]
|
|
2025-07-25: Missing hours [0, 9]
|
|
```
|
|
|
|
### **After Gap Filling**
|
|
```
|
|
Gap filling completed. Filled 96 missing data points
|
|
|
|
Remaining gaps:
|
|
2025-07-24: Missing hours [10]
|
|
2025-07-25: Missing hours [0, 10]
|
|
```
|
|
|
|
**Improvement**: Reduced from 22 missing timestamps to 3 (86% improvement)
|
|
|
|
## 🚀 **Enhanced Scraping Cycle**
|
|
|
|
The regular scraping cycle now includes three phases:
|
|
|
|
### **Phase 1: Current Data Collection**
|
|
```python
|
|
# Fetch and save current data
|
|
water_data = self.fetch_water_data()
|
|
success = self.save_to_database(water_data)
|
|
```
|
|
|
|
### **Phase 2: Gap Filling (Last 7 Days)**
|
|
```python
|
|
# Check for and fill missing data
|
|
filled_count = self.fill_data_gaps(days_back=7)
|
|
```
|
|
|
|
### **Phase 3: Data Updates (Last 2 Days)**
|
|
```python
|
|
# Update existing data with latest values
|
|
updated_count = self.update_existing_data(days_back=2)
|
|
```
|
|
|
|
## 🔧 **Technical Improvements**
|
|
|
|
### **Database Connection Handling**
|
|
- **SQLite Optimization**: Added timeout and thread safety parameters
|
|
- **Retry Logic**: Exponential backoff for database lock errors
|
|
- **Transaction Management**: Proper use of `engine.begin()` for automatic commits
|
|
|
|
### **Error Recovery**
|
|
```python
|
|
# Retry logic with exponential backoff
|
|
for attempt in range(max_retries):
|
|
try:
|
|
success = self.db_adapter.save_measurements(water_data)
|
|
if success:
|
|
return True
|
|
except Exception as e:
|
|
if "database is locked" in str(e).lower():
|
|
time.sleep(2 ** attempt) # 1s, 2s, 4s delays
|
|
continue
|
|
```
|
|
|
|
### **Memory Efficiency**
|
|
- **Selective Data Processing**: Only processes data for missing timestamps
|
|
- **Batch Processing**: Groups operations by date to minimize API calls
|
|
- **Resource Management**: Proper cleanup and connection handling
|
|
|
|
## 📋 **Usage Examples**
|
|
|
|
### **Daily Maintenance**
|
|
```bash
|
|
# Check for gaps in the last week
|
|
python water_scraper_v3.py --check-gaps 7
|
|
|
|
# Fill any found gaps
|
|
python water_scraper_v3.py --fill-gaps 7
|
|
|
|
# Update recent data for accuracy
|
|
python water_scraper_v3.py --update-data 2
|
|
```
|
|
|
|
### **Historical Data Recovery**
|
|
```bash
|
|
# Check for gaps in the last month
|
|
python water_scraper_v3.py --check-gaps 30
|
|
|
|
# Fill gaps for the last month (be patient, this takes time)
|
|
python water_scraper_v3.py --fill-gaps 30
|
|
```
|
|
|
|
### **Production Monitoring**
|
|
```bash
|
|
# Quick test to ensure system is working
|
|
python water_scraper_v3.py --test
|
|
|
|
# Check for recent gaps
|
|
python water_scraper_v3.py --check-gaps 1
|
|
```
|
|
|
|
## 🔍 **Monitoring and Alerts**
|
|
|
|
### **Gap Detection Output**
|
|
```
|
|
Found 22 missing timestamps:
|
|
2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
|
|
2025-07-24: Missing hours [0, 20, 21, 22, 23]
|
|
2025-07-25: Missing hours [0, 9]
|
|
```
|
|
|
|
### **Gap Filling Progress**
|
|
```
|
|
Fetching data for 2025-07-24 to fill 5 missing timestamps
|
|
Successfully fetched 368 data points from API for 2025-07-24
|
|
Filled 80 data points for 2025-07-24
|
|
Gap filling completed. Filled 96 missing data points
|
|
```
|
|
|
|
### **Update Detection**
|
|
```
|
|
Checking for updates on 2025-07-24
|
|
Update needed for P.1 at 2025-07-24 15:00:00
|
|
Updated 5 measurements for 2025-07-24
|
|
Data update completed. Updated 5 measurements
|
|
```
|
|
|
|
## ⚙️ **Configuration Options**
|
|
|
|
### **Environment Variables**
|
|
```bash
|
|
# Database configuration
|
|
export DB_TYPE=sqlite
|
|
export WATER_DB_PATH=water_monitoring.db
|
|
|
|
# Gap filling settings (can be added to config.py)
|
|
export GAP_FILL_DAYS=7 # Days to check for gaps
|
|
export UPDATE_DAYS=2 # Days to check for updates
|
|
export API_DELAY=1 # Seconds between API calls
|
|
export MAX_RETRIES=3 # Database retry attempts
|
|
```
|
|
|
|
### **Customizable Parameters**
|
|
- **Gap Check Period**: Default 7 days, configurable via command line
|
|
- **Update Period**: Default 2 days, configurable via command line
|
|
- **API Rate Limiting**: 1-second delay between calls (configurable)
|
|
- **Retry Logic**: 3 attempts with exponential backoff (configurable)
|
|
|
|
## 🛠️ **Troubleshooting**
|
|
|
|
### **Common Issues**
|
|
|
|
#### **Database Locked Errors**
|
|
```
|
|
ERROR - Error saving to SQLITE: database is locked
|
|
```
|
|
**Solution**: The retry logic now handles this automatically with exponential backoff.
|
|
|
|
#### **API Rate Limiting**
|
|
```
|
|
WARNING - Too many requests to API
|
|
```
|
|
**Solution**: Increase delay between API calls or reduce the number of days processed at once.
|
|
|
|
#### **Missing Data Still Present**
|
|
```
|
|
Found X missing timestamps after gap filling
|
|
```
|
|
**Possible Causes**:
|
|
- Data not available on the Thai government server for those timestamps
|
|
- Network issues during API calls
|
|
- API returned empty data for those specific times
|
|
|
|
### **Debug Commands**
|
|
```bash
|
|
# Enable debug logging
|
|
export LOG_LEVEL=DEBUG
|
|
python water_scraper_v3.py --check-gaps 1
|
|
|
|
# Test specific date range
|
|
python water_scraper_v3.py --fill-gaps 1
|
|
|
|
# Check database directly
|
|
sqlite3 water_monitoring.db "SELECT COUNT(*) FROM water_measurements;"
|
|
sqlite3 water_monitoring.db "SELECT timestamp, COUNT(*) FROM water_measurements GROUP BY timestamp ORDER BY timestamp DESC LIMIT 10;"
|
|
```
|
|
|
|
## 📈 **Performance Metrics**
|
|
|
|
### **Gap Filling Efficiency**
|
|
- **API Calls**: Grouped by date to minimize requests
|
|
- **Processing Speed**: ~100-400 data points per API call
|
|
- **Success Rate**: 86% gap reduction in test case
|
|
- **Resource Usage**: Minimal memory footprint with selective processing
|
|
|
|
### **Database Performance**
|
|
- **SQLite Optimization**: Connection pooling and timeout handling
|
|
- **Transaction Efficiency**: Batch inserts with proper transaction management
|
|
- **Retry Success**: Automatic recovery from temporary lock conditions
|
|
|
|
## 🎯 **Best Practices**
|
|
|
|
### **Regular Maintenance**
|
|
1. **Daily**: Run `--check-gaps 1` to monitor recent data quality
|
|
2. **Weekly**: Run `--fill-gaps 7` to catch any missed data
|
|
3. **Monthly**: Run `--update-data 7` to ensure data accuracy
|
|
|
|
### **Production Deployment**
|
|
1. **Automated Scheduling**: Use cron or systemd timers for regular gap checks
|
|
2. **Monitoring**: Set up alerts for excessive missing data
|
|
3. **Backup**: Regular database backups before major gap-filling operations
|
|
|
|
### **Data Quality Assurance**
|
|
1. **Validation**: Check for reasonable value ranges after gap filling
|
|
2. **Comparison**: Compare filled data with nearby timestamps for consistency
|
|
3. **Documentation**: Log all gap-filling activities for audit trails
|
|
|
|
This enhanced gap-filling system ensures comprehensive and accurate water level monitoring with minimal data loss and automatic recovery capabilities.
|