Some checks failed
Security & Dependency Updates / Dependency Security Scan (push) Successful in 29s
Security & Dependency Updates / Docker Security Scan (push) Failing after 53s
Security & Dependency Updates / License Compliance (push) Successful in 13s
Security & Dependency Updates / Check for Dependency Updates (push) Successful in 19s
Security & Dependency Updates / Code Quality Metrics (push) Successful in 11s
Security & Dependency Updates / Security Summary (push) Successful in 7s
Features: - Real-time water level monitoring for Ping River Basin (16 stations) - Coverage from Chiang Dao to Nakhon Sawan in Northern Thailand - FastAPI web interface with interactive dashboard and station management - Multi-database support (SQLite, MySQL, PostgreSQL, InfluxDB, VictoriaMetrics) - Comprehensive monitoring with health checks and metrics collection - Docker deployment with Grafana integration - Production-ready architecture with enterprise-grade observability CI/CD & Automation: - Complete Gitea Actions workflows for CI/CD, security, and releases - Multi-Python version testing (3.9-3.12) - Multi-architecture Docker builds (amd64, arm64) - Daily security scanning and dependency monitoring - Automated documentation generation - Performance testing and validation Production Ready: - Type safety with Pydantic models and comprehensive type hints - Data validation layer with range checking and error handling - Rate limiting and request tracking for API protection - Enhanced logging with rotation, colors, and performance metrics - Station management API for dynamic CRUD operations - Comprehensive documentation and deployment guides Technical Stack: - Python 3.9+ with FastAPI and Pydantic - Multi-database architecture with adapter pattern - Docker containerization with multi-stage builds - Grafana dashboards for visualization - Gitea Actions for CI/CD automation - Enterprise monitoring and alerting Ready for deployment to B4L infrastructure!
10 KiB
10 KiB
Enhanced Scheduler Guide
This guide explains the new 15-minute scheduling system that runs continuously throughout each hour to ensure comprehensive data coverage.
✅ New Scheduling Behavior
15-Minute Schedule Pattern
- Timing: Runs every 15 minutes: 1:00, 1:15, 1:30, 1:45, 2:00, 2:15, 2:30, 2:45, etc.
- Hourly Full Checks: At :00 minutes (includes gap filling and data updates)
- Quarter-Hour Quick Checks: At :15, :30, :45 minutes (data fetch only)
- Continuous Coverage: Ensures no data is missed throughout each hour
Operation Types
- Full Operations (at :00): Data fetching + gap filling + data updates
- Quick Operations (at :15, :30, :45): Data fetching only for performance
🔧 Technical Implementation
Scheduler States
# State tracking variables
self.last_successful_update = None # Timestamp of last successful data update
self.retry_mode = False # Whether in quick check mode (skip gap filling)
self.next_hourly_check = None # Next scheduled hourly check
Quarter-Hour Check Process
def quarter_hour_check(self):
"""15-minute check for new data"""
current_time = datetime.datetime.now()
minute = current_time.minute
# Determine if this is a full hourly check (at :00) or a quarter-hour check
if minute == 0:
logging.info("=== HOURLY CHECK (00:00) ===")
self.retry_mode = False # Full check with gap filling and updates
else:
logging.info(f"=== 15-MINUTE CHECK ({minute:02d}:00) ===")
self.retry_mode = True # Skip gap filling and updates on 15-min checks
new_data_found = self.run_scraping_cycle()
if new_data_found:
self.last_successful_update = datetime.datetime.now()
if minute == 0:
logging.info("New data found during hourly check")
else:
logging.info(f"New data found during 15-minute check at :{minute:02d}")
else:
if minute == 0:
logging.info("No new data found during hourly check")
else:
logging.info(f"No new data found during 15-minute check at :{minute:02d}")
Scheduler Setup
def start_scheduler(self):
"""Start enhanced scheduler with 15-minute checks"""
# Schedule checks every 15 minutes (at :00, :15, :30, :45)
schedule.every().hour.at(":00").do(self.quarter_hour_check)
schedule.every().hour.at(":15").do(self.quarter_hour_check)
schedule.every().hour.at(":30").do(self.quarter_hour_check)
schedule.every().hour.at(":45").do(self.quarter_hour_check)
while True:
schedule.run_pending()
time.sleep(30) # Check every 30 seconds
📊 New Data Detection Logic
Smart Detection Algorithm
def has_new_data(self) -> bool:
"""Check if there is new data available since last successful update"""
# Get most recent timestamp from database
latest_data = self.get_latest_data(limit=1)
# Check if we should have newer data by now
now = datetime.datetime.now()
expected_latest = now.replace(minute=0, second=0, microsecond=0)
# If current time is past 5 minutes after the hour, we should have data
if now.minute >= 5:
if latest_timestamp < expected_latest:
return True # New data expected
# Check if we have data for the previous hour
previous_hour = expected_latest - datetime.timedelta(hours=1)
if latest_timestamp < previous_hour:
return True # Missing recent data
return False # Data is up to date
Actual Data Verification
# Compare timestamps before and after scraping
initial_timestamp = get_latest_timestamp_before_scraping()
# ... perform scraping ...
latest_timestamp = get_latest_timestamp_after_scraping()
if initial_timestamp is None or latest_timestamp > initial_timestamp:
new_data_found = True
self.last_successful_update = datetime.datetime.now()
🚀 Operational Modes
Mode 1: Full Hourly Operation (at :00)
- Schedule: Every hour at :00 minutes (1:00, 2:00, 3:00, etc.)
- Operations:
- ✅ Fetch current data
- ✅ Fill data gaps (last 7 days)
- ✅ Update existing data (last 2 days)
- Purpose: Comprehensive data collection and maintenance
Mode 2: Quick 15-Minute Checks (at :15, :30, :45)
- Schedule: Every 15 minutes at quarter-hour marks
- Operations:
- ✅ Fetch current data only
- ❌ Skip gap filling (performance optimization)
- ❌ Skip data updates (performance optimization)
- Purpose: Ensure no new data is missed between hourly checks
📋 Logging Output Examples
Successful Hourly Check (at :00)
2025-07-26 01:00:00,123 - INFO - === HOURLY CHECK (00:00) ===
2025-07-26 01:00:00,124 - INFO - Starting scraping cycle...
2025-07-26 01:00:01,456 - INFO - Successfully fetched 384 data points from API
2025-07-26 01:00:02,789 - INFO - New data found: 2025-07-26 01:00:00
2025-07-26 01:00:03,012 - INFO - Filled 5 data gaps
2025-07-26 01:00:04,234 - INFO - Updated 2 existing measurements
2025-07-26 01:00:04,235 - INFO - New data found during hourly check
15-Minute Quick Check (at :15, :30, :45)
2025-07-26 01:15:00,123 - INFO - === 15-MINUTE CHECK (15:00) ===
2025-07-26 01:15:00,124 - INFO - Starting scraping cycle...
2025-07-26 01:15:01,456 - INFO - Successfully fetched 299 data points from API
2025-07-26 01:15:02,789 - INFO - New data found: 2025-07-26 01:00:00
2025-07-26 01:15:02,790 - INFO - New data found during 15-minute check at :15
Continuous 15-Minute Pattern
2025-07-26 01:00:00,123 - INFO - === HOURLY CHECK (00:00) ===
2025-07-26 01:00:04,235 - INFO - New data found during hourly check
2025-07-26 01:15:00,123 - INFO - === 15-MINUTE CHECK (15:00) ===
2025-07-26 01:15:02,790 - INFO - No new data found during 15-minute check at :15
2025-07-26 01:30:00,123 - INFO - === 15-MINUTE CHECK (30:00) ===
2025-07-26 01:30:02,790 - INFO - No new data found during 15-minute check at :30
2025-07-26 01:45:00,123 - INFO - === 15-MINUTE CHECK (45:00) ===
2025-07-26 01:45:02,790 - INFO - No new data found during 15-minute check at :45
2025-07-26 02:00:00,123 - INFO - === HOURLY CHECK (00:00) ===
2025-07-26 02:00:04,235 - INFO - New data found during hourly check
⚙️ Configuration Options
Environment Variables
# Retry interval (default: 5 minutes)
export RETRY_INTERVAL_MINUTES=5
# Data availability buffer (default: 5 minutes after hour)
export DATA_BUFFER_MINUTES=5
# Gap filling days (default: 7 days)
export GAP_FILL_DAYS=7
# Update check days (default: 2 days)
export UPDATE_DAYS=2
Scheduler Timing
# Hourly checks at top of hour
schedule.every().hour.at(":00").do(self.hourly_check)
# 5-minute retries (dynamically scheduled)
schedule.every(5).minutes.do(self.retry_check).tag('retry')
# Check every 30 seconds for responsive retry scheduling
time.sleep(30)
🔍 Performance Optimizations
Retry Mode Optimizations
- Skip Gap Filling: Avoids expensive historical data fetching during retries
- Skip Data Updates: Avoids comparison operations during retries
- Focused API Calls: Only fetches current day data during retries
- Reduced Database Queries: Minimal database operations during retries
Resource Management
- API Rate Limiting: 1-second delays between API calls
- Database Connection Pooling: Efficient connection reuse
- Memory Efficiency: Selective data processing
- Error Recovery: Automatic retry with exponential backoff
🛠️ Troubleshooting
Common Scenarios
Stuck in Retry Mode
# Check if API is returning data
curl -X POST https://hyd-app-db.rid.go.th/webservice/getGroupHourlyWaterLevelReportAllHL.ashx
# Check database connectivity
python water_scraper_v3.py --check-gaps 1
# Manual data fetch test
python water_scraper_v3.py --test
Missing Hourly Triggers
# Check system time synchronization
timedatectl status
# Verify scheduler is running
ps aux | grep water_scraper
# Check logs for scheduler activity
tail -f water_monitor.log | grep "HOURLY CHECK"
False New Data Detection
# Check latest data in database
sqlite3 water_monitoring.db "SELECT MAX(timestamp) FROM water_measurements;"
# Verify timestamp parsing
python -c "
import datetime
print('Current hour:', datetime.datetime.now().replace(minute=0, second=0, microsecond=0))
"
📈 Monitoring and Alerts
Key Metrics to Monitor
- Hourly Success Rate: Percentage of hourly checks that find new data
- Retry Duration: How long system stays in retry mode
- Data Freshness: Time since last successful data update
- API Response Time: Performance of data fetching operations
Alert Conditions
- Extended Retry Mode: System in retry mode for > 30 minutes
- No Data for 2+ Hours: No new data found for extended period
- High Error Rate: Multiple consecutive API failures
- Database Issues: Connection or save failures
Health Check Script
#!/bin/bash
# Check if system is stuck in retry mode
RETRY_COUNT=$(tail -n 100 water_monitor.log | grep -c "RETRY CHECK")
if [ $RETRY_COUNT -gt 6 ]; then
echo "WARNING: System may be stuck in retry mode ($RETRY_COUNT retries in last 100 log entries)"
fi
# Check data freshness
LATEST_DATA=$(sqlite3 water_monitoring.db "SELECT MAX(timestamp) FROM water_measurements;")
echo "Latest data timestamp: $LATEST_DATA"
🎯 Best Practices
Production Deployment
- Monitor Logs: Watch for retry mode patterns
- Set Alerts: Configure notifications for extended retry periods
- Regular Maintenance: Weekly gap filling and data validation
- Backup Strategy: Regular database backups before major operations
Performance Tuning
- Adjust Buffer Time: Modify data availability buffer based on API patterns
- Optimize Retry Interval: Balance between responsiveness and API load
- Database Indexing: Ensure proper indexes for timestamp queries
- Connection Pooling: Configure appropriate database connection limits
This enhanced scheduler ensures reliable, efficient, and intelligent water level monitoring with automatic adaptation to data availability patterns.