feat: 增加批量处理和数据库离线恢复机制以提升可靠性

- 新增 BatchProcessor 类实现消息批量插入，提高数据库写入性能 - 在 consumer 中禁用 autoCommit 并实现手动提交，确保数据一致性 - 添加数据库健康检查机制，在数据库离线时暂停消费并自动恢复 - 支持 0x0E 命令字处理，扩展消息类型识别范围 - 增加数据库连接重试逻辑，解决 Windows 环境端口冲突问题 - 更新环境变量配置，优化 Kafka 消费者参数 - 添加相关单元测试验证批量处理和可靠性功能
2026-02-04 20:36:33 +08:00
parent 339db6f95f
commit 680bf6a957
16 changed files with 557 additions and 43 deletions
--- a/openspec/changes/archive/2026-02-04-phase2/feature-reliable-kafka-db-integration.md
+++ b/openspec/changes/archive/2026-02-04-phase2/feature-reliable-kafka-db-integration.md
@@ -0,0 +1,49 @@
+# Reliable Kafka Consumption & DB Offline Handling
+
+- **Status**: Completed
+- **Author**: AI Assistant
+- **Created**: 2026-02-04
+
+## Context
+
+Currently, the Kafka consumer is configured with `autoCommit: true`. This means offsets are committed periodically regardless of whether the data was successfully processed and stored in the database. If the database insertion fails (e.g., due to a constraint violation or connection loss), the message is considered "consumed" by Kafka, leading to data loss.
+
+Additionally, if the PostgreSQL database goes offline, the consumer continues to try processing messages, likely filling logs with errors and potentially losing data if retries aren't handled correctly. The user requires a mechanism to pause consumption during DB outages and resume only when the DB is back online.
+
+## Proposal
+
+We propose to enhance the reliability of the ingestion pipeline by:
+
+1.  **Disabling Auto-Commit**:
+    -   Set `autoCommit: false` in the Kafka `ConsumerGroup` configuration.
+    -   Implement manual offset committing only after the database insertion is confirmed successful.
+
+2.  **Implementing DB Offline Handling (Circuit Breaker)**:
+    -   Detect database connection errors during insertion.
+    -   If a connection error occurs:
+        1.  Pause the Kafka consumer immediately.
+        2.  Log a warning and enter a "Recovery Mode".
+        3.  Wait for 1 minute.
+        4.  Periodically check database connectivity (every 1 minute).
+        5.  Once the database is reachable, resume the Kafka consumer.
+
+## Technical Details
+
+### Configuration
+-   No new environment variables are strictly required, but `KAFKA_AUTO_COMMIT` could be forced to `false` or removed if we enforce this behavior.
+-   Retry interval (60 seconds) can be a constant or a config.
+
+### Implementation Steps
+1.  Modify `src/kafka/consumer.js`:
+    -   Change `autoCommit` to `false`.
+    -   Update the message processing flow to await the `onMessage` handler.
+    -   Call `consumer.commit()` explicitly after successful processing.
+    -   Add logic to handle errors from `onMessage`. If it's a DB connection error, trigger the pause/retry loop.
+2.  Update `src/db/databaseManager.js` (Optional but helpful):
+    -   Ensure it exposes a method to check connectivity (e.g., `testConnection()`) for the recovery loop.
+
+## Impact
+
+-   **Reliability**: drastically improved. Zero data loss guarantee for DB outages.
+-   **Performance**: Slight overhead due to manual commits (can be batched if needed, but per-message or small batch is safer for now).
+-   **Operations**: System will self-recover from DB maintenance or crashes.
--- a/openspec/changes/archive/2026-02-04-phase2/phase-2-optimization-and-fixes.md
+++ b/openspec/changes/archive/2026-02-04-phase2/phase-2-optimization-and-fixes.md
@@ -0,0 +1,39 @@
+# Phase 2: Optimization and Fixes
+
+- **Status**: Completed
+- **Author**: AI Assistant
+- **Created**: 2026-02-04
+
+## Context
+
+Following the initial stabilization, several issues were identified:
+1.  **Missing Command Support**: The system did not recognize command word `0x0E`, which shares the same structure as `0x36`.
+2.  **Bootstrap Instability**: On Windows, restarting the service frequently caused `EADDRINUSE` errors when connecting to PostgreSQL due to ephemeral port exhaustion.
+3.  **Performance Bottleneck**: The Kafka consumer could not keep up with the backlog using single-row inserts and low parallelism, and scaling horizontal instances was restricted.
+
+## Implemented Changes
+
+### 1. 0x0E Command Support
+- **Goal**: Enable processing of `0x0E` command word.
+- **Implementation**:
+    -   Updated `resolveMessageType` in `src/processor/index.js` to map `0x0E` to the same handler as `0x36`.
+    -   Added unit tests in `tests/processor.test.js` to verify `0x0E` parsing for status and fault reports.
+
+### 2. Bootstrap Retry Logic
+- **Goal**: Prevent service startup failure due to transient port conflicts.
+- **Implementation**:
+    -   Modified `src/db/initializer.js` to catch `EADDRINUSE` errors during the initial database connection.
+    -   Added a retry mechanism: max 5 retries with 1-second backoff.
+
+### 3. High Throughput Optimization (Batch Processing)
+- **Goal**: Resolve Kafka backlog without adding more service instances.
+- **Implementation**:
+    -   **Batch Processor**: Created `src/db/batchProcessor.js` to buffer messages in memory.
+    -   **Strategy**: Messages are flushed to DB when buffer size reaches 500 or every 1 second.
+    -   **Config Update**: Increased default `KAFKA_MAX_IN_FLIGHT` from 50 to 500 in `src/config/config.js` to align with batch size.
+    -   **Integration**: Refactored `src/index.js` and `src/processor/index.js` to decouple parsing from insertion, allowing `BatchProcessor` to handle the write operations.
+
+## Impact
+- **Throughput**: Significantly increased database write throughput via batching.
+- **Reliability**: Service is resilient to port conflicts on restart.
+- **Functionality**: `0x0E` messages are now correctly processed and stored.