feat: 增加批量处理和数据库离线恢复机制以提升可靠性

- 新增 BatchProcessor 类实现消息批量插入,提高数据库写入性能
- 在 consumer 中禁用 autoCommit 并实现手动提交,确保数据一致性
- 添加数据库健康检查机制,在数据库离线时暂停消费并自动恢复
- 支持 0x0E 命令字处理,扩展消息类型识别范围
- 增加数据库连接重试逻辑,解决 Windows 环境端口冲突问题
- 更新环境变量配置,优化 Kafka 消费者参数
- 添加相关单元测试验证批量处理和可靠性功能
This commit is contained in:
2026-02-04 20:36:33 +08:00
parent 339db6f95f
commit 680bf6a957
16 changed files with 557 additions and 43 deletions

View File

@@ -0,0 +1,49 @@
# Reliable Kafka Consumption & DB Offline Handling
- **Status**: Completed
- **Author**: AI Assistant
- **Created**: 2026-02-04
## Context
Currently, the Kafka consumer is configured with `autoCommit: true`. This means offsets are committed periodically regardless of whether the data was successfully processed and stored in the database. If the database insertion fails (e.g., due to a constraint violation or connection loss), the message is considered "consumed" by Kafka, leading to data loss.
Additionally, if the PostgreSQL database goes offline, the consumer continues to try processing messages, likely filling logs with errors and potentially losing data if retries aren't handled correctly. The user requires a mechanism to pause consumption during DB outages and resume only when the DB is back online.
## Proposal
We propose to enhance the reliability of the ingestion pipeline by:
1. **Disabling Auto-Commit**:
- Set `autoCommit: false` in the Kafka `ConsumerGroup` configuration.
- Implement manual offset committing only after the database insertion is confirmed successful.
2. **Implementing DB Offline Handling (Circuit Breaker)**:
- Detect database connection errors during insertion.
- If a connection error occurs:
1. Pause the Kafka consumer immediately.
2. Log a warning and enter a "Recovery Mode".
3. Wait for 1 minute.
4. Periodically check database connectivity (every 1 minute).
5. Once the database is reachable, resume the Kafka consumer.
## Technical Details
### Configuration
- No new environment variables are strictly required, but `KAFKA_AUTO_COMMIT` could be forced to `false` or removed if we enforce this behavior.
- Retry interval (60 seconds) can be a constant or a config.
### Implementation Steps
1. Modify `src/kafka/consumer.js`:
- Change `autoCommit` to `false`.
- Update the message processing flow to await the `onMessage` handler.
- Call `consumer.commit()` explicitly after successful processing.
- Add logic to handle errors from `onMessage`. If it's a DB connection error, trigger the pause/retry loop.
2. Update `src/db/databaseManager.js` (Optional but helpful):
- Ensure it exposes a method to check connectivity (e.g., `testConnection()`) for the recovery loop.
## Impact
- **Reliability**: drastically improved. Zero data loss guarantee for DB outages.
- **Performance**: Slight overhead due to manual commits (can be batched if needed, but per-message or small batch is safer for now).
- **Operations**: System will self-recover from DB maintenance or crashes.

View File

@@ -0,0 +1,39 @@
# Phase 2: Optimization and Fixes
- **Status**: Completed
- **Author**: AI Assistant
- **Created**: 2026-02-04
## Context
Following the initial stabilization, several issues were identified:
1. **Missing Command Support**: The system did not recognize command word `0x0E`, which shares the same structure as `0x36`.
2. **Bootstrap Instability**: On Windows, restarting the service frequently caused `EADDRINUSE` errors when connecting to PostgreSQL due to ephemeral port exhaustion.
3. **Performance Bottleneck**: The Kafka consumer could not keep up with the backlog using single-row inserts and low parallelism, and scaling horizontal instances was restricted.
## Implemented Changes
### 1. 0x0E Command Support
- **Goal**: Enable processing of `0x0E` command word.
- **Implementation**:
- Updated `resolveMessageType` in `src/processor/index.js` to map `0x0E` to the same handler as `0x36`.
- Added unit tests in `tests/processor.test.js` to verify `0x0E` parsing for status and fault reports.
### 2. Bootstrap Retry Logic
- **Goal**: Prevent service startup failure due to transient port conflicts.
- **Implementation**:
- Modified `src/db/initializer.js` to catch `EADDRINUSE` errors during the initial database connection.
- Added a retry mechanism: max 5 retries with 1-second backoff.
### 3. High Throughput Optimization (Batch Processing)
- **Goal**: Resolve Kafka backlog without adding more service instances.
- **Implementation**:
- **Batch Processor**: Created `src/db/batchProcessor.js` to buffer messages in memory.
- **Strategy**: Messages are flushed to DB when buffer size reaches 500 or every 1 second.
- **Config Update**: Increased default `KAFKA_MAX_IN_FLIGHT` from 50 to 500 in `src/config/config.js` to align with batch size.
- **Integration**: Refactored `src/index.js` and `src/processor/index.js` to decouple parsing from insertion, allowing `BatchProcessor` to handle the write operations.
## Impact
- **Throughput**: Significantly increased database write throughput via batching.
- **Reliability**: Service is resilient to port conflicts on restart.
- **Functionality**: `0x0E` messages are now correctly processed and stored.