feat(lightrag): Add document status tracking and checkpoint support
功能(lightrag): 添加文档状态跟踪和断点续传支持 - Add DocStatus enum and DocProcessingStatus class for document processing state management - 添加 DocStatus 枚举和 DocProcessingStatus 类用于文档处理状态管理 - Implement JsonDocStatusStorage for persistent status storage - 实现 JsonDocStatusStorage 用于持久化状态存储 - Add document-level deduplication in batch processing - 在批处理中添加文档级别的去重功能 - Add checkpoint support in ainsert method for resumable document processing - 在 ainsert 方法中添加断点续传支持,实现可恢复的文档处理 - Add status query methods for monitoring processing progress - 添加状态查询方法用于监控处理进度 - Update LightRAG initialization to support document status tracking - 更新 LightRAG 初始化以支持文档状态跟踪
This commit is contained in:
@@ -1,5 +1,6 @@
|
||||
from dataclasses import dataclass, field
|
||||
from typing import TypedDict, Union, Literal, Generic, TypeVar
|
||||
from typing import TypedDict, Union, Literal, Generic, TypeVar, Optional, Dict, Any
|
||||
from enum import Enum
|
||||
|
||||
import numpy as np
|
||||
|
||||
@@ -129,3 +130,42 @@ class BaseGraphStorage(StorageNameSpace):
|
||||
|
||||
async def embed_nodes(self, algorithm: str) -> tuple[np.ndarray, list[str]]:
|
||||
raise NotImplementedError("Node embedding is not used in lightrag.")
|
||||
|
||||
|
||||
class DocStatus(str, Enum):
|
||||
"""Document processing status enum"""
|
||||
|
||||
PENDING = "pending"
|
||||
PROCESSING = "processing"
|
||||
PROCESSED = "processed"
|
||||
FAILED = "failed"
|
||||
|
||||
|
||||
@dataclass
|
||||
class DocProcessingStatus:
|
||||
"""Document processing status data structure"""
|
||||
|
||||
content_summary: str # First 100 chars of document content
|
||||
content_length: int # Total length of document
|
||||
status: DocStatus # Current processing status
|
||||
created_at: str # ISO format timestamp
|
||||
updated_at: str # ISO format timestamp
|
||||
chunks_count: Optional[int] = None # Number of chunks after splitting
|
||||
error: Optional[str] = None # Error message if failed
|
||||
metadata: Dict[str, Any] = field(default_factory=dict) # Additional metadata
|
||||
|
||||
|
||||
class DocStatusStorage(BaseKVStorage):
|
||||
"""Base class for document status storage"""
|
||||
|
||||
async def get_status_counts(self) -> Dict[str, int]:
|
||||
"""Get counts of documents in each status"""
|
||||
raise NotImplementedError
|
||||
|
||||
async def get_failed_docs(self) -> Dict[str, DocProcessingStatus]:
|
||||
"""Get all failed documents"""
|
||||
raise NotImplementedError
|
||||
|
||||
async def get_pending_docs(self) -> Dict[str, DocProcessingStatus]:
|
||||
"""Get all pending documents"""
|
||||
raise NotImplementedError
|
||||
|
Reference in New Issue
Block a user