feat: 增加批量处理和数据库离线恢复机制以提升可靠性
- 新增 BatchProcessor 类实现消息批量插入,提高数据库写入性能 - 在 consumer 中禁用 autoCommit 并实现手动提交,确保数据一致性 - 添加数据库健康检查机制,在数据库离线时暂停消费并自动恢复 - 支持 0x0E 命令字处理,扩展消息类型识别范围 - 增加数据库连接重试逻辑,解决 Windows 环境端口冲突问题 - 更新环境变量配置,优化 Kafka 消费者参数 - 添加相关单元测试验证批量处理和可靠性功能
This commit is contained in:
@@ -0,0 +1,49 @@
|
||||
# Reliable Kafka Consumption & DB Offline Handling
|
||||
|
||||
- **Status**: Completed
|
||||
- **Author**: AI Assistant
|
||||
- **Created**: 2026-02-04
|
||||
|
||||
## Context
|
||||
|
||||
Currently, the Kafka consumer is configured with `autoCommit: true`. This means offsets are committed periodically regardless of whether the data was successfully processed and stored in the database. If the database insertion fails (e.g., due to a constraint violation or connection loss), the message is considered "consumed" by Kafka, leading to data loss.
|
||||
|
||||
Additionally, if the PostgreSQL database goes offline, the consumer continues to try processing messages, likely filling logs with errors and potentially losing data if retries aren't handled correctly. The user requires a mechanism to pause consumption during DB outages and resume only when the DB is back online.
|
||||
|
||||
## Proposal
|
||||
|
||||
We propose to enhance the reliability of the ingestion pipeline by:
|
||||
|
||||
1. **Disabling Auto-Commit**:
|
||||
- Set `autoCommit: false` in the Kafka `ConsumerGroup` configuration.
|
||||
- Implement manual offset committing only after the database insertion is confirmed successful.
|
||||
|
||||
2. **Implementing DB Offline Handling (Circuit Breaker)**:
|
||||
- Detect database connection errors during insertion.
|
||||
- If a connection error occurs:
|
||||
1. Pause the Kafka consumer immediately.
|
||||
2. Log a warning and enter a "Recovery Mode".
|
||||
3. Wait for 1 minute.
|
||||
4. Periodically check database connectivity (every 1 minute).
|
||||
5. Once the database is reachable, resume the Kafka consumer.
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Configuration
|
||||
- No new environment variables are strictly required, but `KAFKA_AUTO_COMMIT` could be forced to `false` or removed if we enforce this behavior.
|
||||
- Retry interval (60 seconds) can be a constant or a config.
|
||||
|
||||
### Implementation Steps
|
||||
1. Modify `src/kafka/consumer.js`:
|
||||
- Change `autoCommit` to `false`.
|
||||
- Update the message processing flow to await the `onMessage` handler.
|
||||
- Call `consumer.commit()` explicitly after successful processing.
|
||||
- Add logic to handle errors from `onMessage`. If it's a DB connection error, trigger the pause/retry loop.
|
||||
2. Update `src/db/databaseManager.js` (Optional but helpful):
|
||||
- Ensure it exposes a method to check connectivity (e.g., `testConnection()`) for the recovery loop.
|
||||
|
||||
## Impact
|
||||
|
||||
- **Reliability**: drastically improved. Zero data loss guarantee for DB outages.
|
||||
- **Performance**: Slight overhead due to manual commits (can be batched if needed, but per-message or small batch is safer for now).
|
||||
- **Operations**: System will self-recover from DB maintenance or crashes.
|
||||
@@ -0,0 +1,39 @@
|
||||
# Phase 2: Optimization and Fixes
|
||||
|
||||
- **Status**: Completed
|
||||
- **Author**: AI Assistant
|
||||
- **Created**: 2026-02-04
|
||||
|
||||
## Context
|
||||
|
||||
Following the initial stabilization, several issues were identified:
|
||||
1. **Missing Command Support**: The system did not recognize command word `0x0E`, which shares the same structure as `0x36`.
|
||||
2. **Bootstrap Instability**: On Windows, restarting the service frequently caused `EADDRINUSE` errors when connecting to PostgreSQL due to ephemeral port exhaustion.
|
||||
3. **Performance Bottleneck**: The Kafka consumer could not keep up with the backlog using single-row inserts and low parallelism, and scaling horizontal instances was restricted.
|
||||
|
||||
## Implemented Changes
|
||||
|
||||
### 1. 0x0E Command Support
|
||||
- **Goal**: Enable processing of `0x0E` command word.
|
||||
- **Implementation**:
|
||||
- Updated `resolveMessageType` in `src/processor/index.js` to map `0x0E` to the same handler as `0x36`.
|
||||
- Added unit tests in `tests/processor.test.js` to verify `0x0E` parsing for status and fault reports.
|
||||
|
||||
### 2. Bootstrap Retry Logic
|
||||
- **Goal**: Prevent service startup failure due to transient port conflicts.
|
||||
- **Implementation**:
|
||||
- Modified `src/db/initializer.js` to catch `EADDRINUSE` errors during the initial database connection.
|
||||
- Added a retry mechanism: max 5 retries with 1-second backoff.
|
||||
|
||||
### 3. High Throughput Optimization (Batch Processing)
|
||||
- **Goal**: Resolve Kafka backlog without adding more service instances.
|
||||
- **Implementation**:
|
||||
- **Batch Processor**: Created `src/db/batchProcessor.js` to buffer messages in memory.
|
||||
- **Strategy**: Messages are flushed to DB when buffer size reaches 500 or every 1 second.
|
||||
- **Config Update**: Increased default `KAFKA_MAX_IN_FLIGHT` from 50 to 500 in `src/config/config.js` to align with batch size.
|
||||
- **Integration**: Refactored `src/index.js` and `src/processor/index.js` to decouple parsing from insertion, allowing `BatchProcessor` to handle the write operations.
|
||||
|
||||
## Impact
|
||||
- **Throughput**: Significantly increased database write throughput via batching.
|
||||
- **Reliability**: Service is resilient to port conflicts on restart.
|
||||
- **Functionality**: `0x0E` messages are now correctly processed and stored.
|
||||
Reference in New Issue
Block a user