Initial commit: Northern Thailand Ping River Monitor v3.1.0
Some checks failed
Security & Dependency Updates / Dependency Security Scan (push) Successful in 29s
Security & Dependency Updates / Docker Security Scan (push) Failing after 53s
Security & Dependency Updates / License Compliance (push) Successful in 13s
Security & Dependency Updates / Check for Dependency Updates (push) Successful in 19s
Security & Dependency Updates / Code Quality Metrics (push) Successful in 11s
Security & Dependency Updates / Security Summary (push) Successful in 7s
Some checks failed
Security & Dependency Updates / Dependency Security Scan (push) Successful in 29s
Security & Dependency Updates / Docker Security Scan (push) Failing after 53s
Security & Dependency Updates / License Compliance (push) Successful in 13s
Security & Dependency Updates / Check for Dependency Updates (push) Successful in 19s
Security & Dependency Updates / Code Quality Metrics (push) Successful in 11s
Security & Dependency Updates / Security Summary (push) Successful in 7s
Features: - Real-time water level monitoring for Ping River Basin (16 stations) - Coverage from Chiang Dao to Nakhon Sawan in Northern Thailand - FastAPI web interface with interactive dashboard and station management - Multi-database support (SQLite, MySQL, PostgreSQL, InfluxDB, VictoriaMetrics) - Comprehensive monitoring with health checks and metrics collection - Docker deployment with Grafana integration - Production-ready architecture with enterprise-grade observability CI/CD & Automation: - Complete Gitea Actions workflows for CI/CD, security, and releases - Multi-Python version testing (3.9-3.12) - Multi-architecture Docker builds (amd64, arm64) - Daily security scanning and dependency monitoring - Automated documentation generation - Performance testing and validation Production Ready: - Type safety with Pydantic models and comprehensive type hints - Data validation layer with range checking and error handling - Rate limiting and request tracking for API protection - Enhanced logging with rotation, colors, and performance metrics - Station management API for dynamic CRUD operations - Comprehensive documentation and deployment guides Technical Stack: - Python 3.9+ with FastAPI and Pydantic - Multi-database architecture with adapter pattern - Docker containerization with multi-stage builds - Grafana dashboards for visualization - Gitea Actions for CI/CD automation - Enterprise monitoring and alerting Ready for deployment to B4L infrastructure!
This commit is contained in:
275
docs/GAP_FILLING_GUIDE.md
Normal file
275
docs/GAP_FILLING_GUIDE.md
Normal file
@@ -0,0 +1,275 @@
|
||||
# Gap Filling and Data Integrity Guide
|
||||
|
||||
This guide explains the enhanced gap-filling functionality that addresses data gaps and missing timestamps in the Thailand Water Monitor.
|
||||
|
||||
## ✅ **Issues Resolved**
|
||||
|
||||
### **1. Data Gaps Problem**
|
||||
- **Before**: Tool only fetched current day data, leaving gaps in historical records
|
||||
- **After**: Automatically detects and fills missing timestamps for the last 7 days
|
||||
|
||||
### **2. Missing Midnight Timestamps**
|
||||
- **Before**: Jump from 23:00 to 01:00 (missing 00:00 midnight data)
|
||||
- **After**: Specifically checks for and fills midnight hour gaps
|
||||
|
||||
### **3. Changed Values**
|
||||
- **Before**: No mechanism to update existing data if values changed on the server
|
||||
- **After**: Compares existing data with fresh API data and updates changed values
|
||||
|
||||
## 🔧 **New Features**
|
||||
|
||||
### **Command Line Interface**
|
||||
```bash
|
||||
# Check for missing data gaps
|
||||
python water_scraper_v3.py --check-gaps [days]
|
||||
|
||||
# Fill missing data gaps
|
||||
python water_scraper_v3.py --fill-gaps [days]
|
||||
|
||||
# Update existing data with latest values
|
||||
python water_scraper_v3.py --update-data [days]
|
||||
|
||||
# Run single test cycle
|
||||
python water_scraper_v3.py --test
|
||||
|
||||
# Show help
|
||||
python water_scraper_v3.py --help
|
||||
```
|
||||
|
||||
### **Automatic Gap Detection**
|
||||
The system now automatically:
|
||||
- Generates expected hourly timestamps for the specified time range
|
||||
- Compares with existing database records
|
||||
- Identifies missing timestamps
|
||||
- Groups missing data by date for efficient API calls
|
||||
|
||||
### **Intelligent Gap Filling**
|
||||
- **Historical Data Fetching**: Retrieves data for specific dates to fill gaps
|
||||
- **Selective Insertion**: Only inserts data for actually missing timestamps
|
||||
- **API Rate Limiting**: Includes delays between API calls to be respectful
|
||||
- **Error Handling**: Continues processing even if some dates fail
|
||||
|
||||
### **Data Update Mechanism**
|
||||
- **Change Detection**: Compares water levels, discharge rates, and percentages
|
||||
- **Precision Checking**: Uses appropriate thresholds (0.001m for water level, 0.1 cms for discharge)
|
||||
- **Selective Updates**: Only updates records where values have actually changed
|
||||
|
||||
## 📊 **Test Results**
|
||||
|
||||
### **Before Enhancement**
|
||||
```
|
||||
Found 22 missing timestamps in the last 2 days:
|
||||
2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
|
||||
2025-07-24: Missing hours [0, 20, 21, 22, 23]
|
||||
2025-07-25: Missing hours [0, 9]
|
||||
```
|
||||
|
||||
### **After Gap Filling**
|
||||
```
|
||||
Gap filling completed. Filled 96 missing data points
|
||||
|
||||
Remaining gaps:
|
||||
2025-07-24: Missing hours [10]
|
||||
2025-07-25: Missing hours [0, 10]
|
||||
```
|
||||
|
||||
**Improvement**: Reduced from 22 missing timestamps to 3 (86% improvement)
|
||||
|
||||
## 🚀 **Enhanced Scraping Cycle**
|
||||
|
||||
The regular scraping cycle now includes three phases:
|
||||
|
||||
### **Phase 1: Current Data Collection**
|
||||
```python
|
||||
# Fetch and save current data
|
||||
water_data = self.fetch_water_data()
|
||||
success = self.save_to_database(water_data)
|
||||
```
|
||||
|
||||
### **Phase 2: Gap Filling (Last 7 Days)**
|
||||
```python
|
||||
# Check for and fill missing data
|
||||
filled_count = self.fill_data_gaps(days_back=7)
|
||||
```
|
||||
|
||||
### **Phase 3: Data Updates (Last 2 Days)**
|
||||
```python
|
||||
# Update existing data with latest values
|
||||
updated_count = self.update_existing_data(days_back=2)
|
||||
```
|
||||
|
||||
## 🔧 **Technical Improvements**
|
||||
|
||||
### **Database Connection Handling**
|
||||
- **SQLite Optimization**: Added timeout and thread safety parameters
|
||||
- **Retry Logic**: Exponential backoff for database lock errors
|
||||
- **Transaction Management**: Proper use of `engine.begin()` for automatic commits
|
||||
|
||||
### **Error Recovery**
|
||||
```python
|
||||
# Retry logic with exponential backoff
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
success = self.db_adapter.save_measurements(water_data)
|
||||
if success:
|
||||
return True
|
||||
except Exception as e:
|
||||
if "database is locked" in str(e).lower():
|
||||
time.sleep(2 ** attempt) # 1s, 2s, 4s delays
|
||||
continue
|
||||
```
|
||||
|
||||
### **Memory Efficiency**
|
||||
- **Selective Data Processing**: Only processes data for missing timestamps
|
||||
- **Batch Processing**: Groups operations by date to minimize API calls
|
||||
- **Resource Management**: Proper cleanup and connection handling
|
||||
|
||||
## 📋 **Usage Examples**
|
||||
|
||||
### **Daily Maintenance**
|
||||
```bash
|
||||
# Check for gaps in the last week
|
||||
python water_scraper_v3.py --check-gaps 7
|
||||
|
||||
# Fill any found gaps
|
||||
python water_scraper_v3.py --fill-gaps 7
|
||||
|
||||
# Update recent data for accuracy
|
||||
python water_scraper_v3.py --update-data 2
|
||||
```
|
||||
|
||||
### **Historical Data Recovery**
|
||||
```bash
|
||||
# Check for gaps in the last month
|
||||
python water_scraper_v3.py --check-gaps 30
|
||||
|
||||
# Fill gaps for the last month (be patient, this takes time)
|
||||
python water_scraper_v3.py --fill-gaps 30
|
||||
```
|
||||
|
||||
### **Production Monitoring**
|
||||
```bash
|
||||
# Quick test to ensure system is working
|
||||
python water_scraper_v3.py --test
|
||||
|
||||
# Check for recent gaps
|
||||
python water_scraper_v3.py --check-gaps 1
|
||||
```
|
||||
|
||||
## 🔍 **Monitoring and Alerts**
|
||||
|
||||
### **Gap Detection Output**
|
||||
```
|
||||
Found 22 missing timestamps:
|
||||
2025-07-23: Missing hours [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
|
||||
2025-07-24: Missing hours [0, 20, 21, 22, 23]
|
||||
2025-07-25: Missing hours [0, 9]
|
||||
```
|
||||
|
||||
### **Gap Filling Progress**
|
||||
```
|
||||
Fetching data for 2025-07-24 to fill 5 missing timestamps
|
||||
Successfully fetched 368 data points from API for 2025-07-24
|
||||
Filled 80 data points for 2025-07-24
|
||||
Gap filling completed. Filled 96 missing data points
|
||||
```
|
||||
|
||||
### **Update Detection**
|
||||
```
|
||||
Checking for updates on 2025-07-24
|
||||
Update needed for P.1 at 2025-07-24 15:00:00
|
||||
Updated 5 measurements for 2025-07-24
|
||||
Data update completed. Updated 5 measurements
|
||||
```
|
||||
|
||||
## ⚙️ **Configuration Options**
|
||||
|
||||
### **Environment Variables**
|
||||
```bash
|
||||
# Database configuration
|
||||
export DB_TYPE=sqlite
|
||||
export WATER_DB_PATH=water_monitoring.db
|
||||
|
||||
# Gap filling settings (can be added to config.py)
|
||||
export GAP_FILL_DAYS=7 # Days to check for gaps
|
||||
export UPDATE_DAYS=2 # Days to check for updates
|
||||
export API_DELAY=1 # Seconds between API calls
|
||||
export MAX_RETRIES=3 # Database retry attempts
|
||||
```
|
||||
|
||||
### **Customizable Parameters**
|
||||
- **Gap Check Period**: Default 7 days, configurable via command line
|
||||
- **Update Period**: Default 2 days, configurable via command line
|
||||
- **API Rate Limiting**: 1-second delay between calls (configurable)
|
||||
- **Retry Logic**: 3 attempts with exponential backoff (configurable)
|
||||
|
||||
## 🛠️ **Troubleshooting**
|
||||
|
||||
### **Common Issues**
|
||||
|
||||
#### **Database Locked Errors**
|
||||
```
|
||||
ERROR - Error saving to SQLITE: database is locked
|
||||
```
|
||||
**Solution**: The retry logic now handles this automatically with exponential backoff.
|
||||
|
||||
#### **API Rate Limiting**
|
||||
```
|
||||
WARNING - Too many requests to API
|
||||
```
|
||||
**Solution**: Increase delay between API calls or reduce the number of days processed at once.
|
||||
|
||||
#### **Missing Data Still Present**
|
||||
```
|
||||
Found X missing timestamps after gap filling
|
||||
```
|
||||
**Possible Causes**:
|
||||
- Data not available on the Thai government server for those timestamps
|
||||
- Network issues during API calls
|
||||
- API returned empty data for those specific times
|
||||
|
||||
### **Debug Commands**
|
||||
```bash
|
||||
# Enable debug logging
|
||||
export LOG_LEVEL=DEBUG
|
||||
python water_scraper_v3.py --check-gaps 1
|
||||
|
||||
# Test specific date range
|
||||
python water_scraper_v3.py --fill-gaps 1
|
||||
|
||||
# Check database directly
|
||||
sqlite3 water_monitoring.db "SELECT COUNT(*) FROM water_measurements;"
|
||||
sqlite3 water_monitoring.db "SELECT timestamp, COUNT(*) FROM water_measurements GROUP BY timestamp ORDER BY timestamp DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
## 📈 **Performance Metrics**
|
||||
|
||||
### **Gap Filling Efficiency**
|
||||
- **API Calls**: Grouped by date to minimize requests
|
||||
- **Processing Speed**: ~100-400 data points per API call
|
||||
- **Success Rate**: 86% gap reduction in test case
|
||||
- **Resource Usage**: Minimal memory footprint with selective processing
|
||||
|
||||
### **Database Performance**
|
||||
- **SQLite Optimization**: Connection pooling and timeout handling
|
||||
- **Transaction Efficiency**: Batch inserts with proper transaction management
|
||||
- **Retry Success**: Automatic recovery from temporary lock conditions
|
||||
|
||||
## 🎯 **Best Practices**
|
||||
|
||||
### **Regular Maintenance**
|
||||
1. **Daily**: Run `--check-gaps 1` to monitor recent data quality
|
||||
2. **Weekly**: Run `--fill-gaps 7` to catch any missed data
|
||||
3. **Monthly**: Run `--update-data 7` to ensure data accuracy
|
||||
|
||||
### **Production Deployment**
|
||||
1. **Automated Scheduling**: Use cron or systemd timers for regular gap checks
|
||||
2. **Monitoring**: Set up alerts for excessive missing data
|
||||
3. **Backup**: Regular database backups before major gap-filling operations
|
||||
|
||||
### **Data Quality Assurance**
|
||||
1. **Validation**: Check for reasonable value ranges after gap filling
|
||||
2. **Comparison**: Compare filled data with nearby timestamps for consistency
|
||||
3. **Documentation**: Log all gap-filling activities for audit trails
|
||||
|
||||
This enhanced gap-filling system ensures comprehensive and accurate water level monitoring with minimal data loss and automatic recovery capabilities.
|
Reference in New Issue
Block a user