Package com.bytedesk.kbase.llm_file
Class FileChunkService
java.lang.Object
com.bytedesk.kbase.llm_file.FileChunkService
文件Chunk处理服务
负责将文档分割成chunks并创建相应的记录
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final intprivate final ChunkRestServiceprivate static final intprivate final FileChunkMessageService -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate StringcleanSpecialCharacters(String text) 清理可能导致编码问题的特殊字符private String清理文本内容private List<org.springframework.ai.document.Document>fallbackSplitDocuments(List<org.springframework.ai.document.Document> documents) 备选的文档分割方案 - 基于字符数的简单分割 当TokenTextSplitter失败时使用private intfindLastSentenceBoundary(String text, int start, int end) 在指定范围内查找最后一个句子边界private List<org.springframework.ai.document.Document>preprocessDocuments(List<org.springframework.ai.document.Document> documents) 预处理文档:过滤空文档,拆分过大文档private StringpreprocessLargeText(String text) 预处理大文本,移除多余的空白和重复内容processBatch(List<org.springframework.ai.document.Document> batch, FileResponse fileResponse, int batchNumber) 处理单个批次processDocumentsIndividually(List<org.springframework.ai.document.Document> documents, FileResponse fileResponse, int batchNumber) 逐个处理文档(当批次处理失败时的兜底方案)processFileChunks(List<org.springframework.ai.document.Document> documents, FileResponse fileResponse) 处理文件chunk切分 - 同步版本 直接处理文档,不进行大小判断processFileChunksAsync(List<org.springframework.ai.document.Document> documents, FileResponse fileResponse) 处理文件chunk切分 - 异步版本(推荐用于大文件)processFileChunksInternal(List<org.springframework.ai.document.Document> documents, FileResponse fileResponse) 内部同步处理方法 - 被异步服务调用 专注于核心的文档切分和存储逻辑private List<org.springframework.ai.document.Document>splitLargeDocument(String text, Map<String, Object> metadata) 拆分大文档private List<org.springframework.ai.document.Document>splitSingleDocument(String text, Map<String, Object> metadata, int maxChunkSize, int overlap) 分割单个文档
-
Field Details
-
chunkRestService
-
messageService
-
BATCH_SIZE
private static final int BATCH_SIZE- See Also:
-
MAX_DOCUMENT_SIZE
private static final int MAX_DOCUMENT_SIZE- See Also:
-
-
Constructor Details
-
FileChunkService
public FileChunkService()
-
-
Method Details
-
processFileChunks
public List<String> processFileChunks(List<org.springframework.ai.document.Document> documents, FileResponse fileResponse) 处理文件chunk切分 - 同步版本 直接处理文档,不进行大小判断 -
processFileChunksAsync
@Async("fileChunkTaskExecutor") public CompletableFuture<List<String>> processFileChunksAsync(List<org.springframework.ai.document.Document> documents, FileResponse fileResponse) 处理文件chunk切分 - 异步版本(推荐用于大文件) -
processFileChunksInternal
public List<String> processFileChunksInternal(List<org.springframework.ai.document.Document> documents, FileResponse fileResponse) 内部同步处理方法 - 被异步服务调用 专注于核心的文档切分和存储逻辑 -
fallbackSplitDocuments
private List<org.springframework.ai.document.Document> fallbackSplitDocuments(List<org.springframework.ai.document.Document> documents) 备选的文档分割方案 - 基于字符数的简单分割 当TokenTextSplitter失败时使用 -
preprocessLargeText
预处理大文本,移除多余的空白和重复内容 -
splitSingleDocument
private List<org.springframework.ai.document.Document> splitSingleDocument(String text, Map<String, Object> metadata, int maxChunkSize, int overlap) 分割单个文档 -
cleanSpecialCharacters
清理可能导致编码问题的特殊字符 -
findLastSentenceBoundary
在指定范围内查找最后一个句子边界 -
preprocessDocuments
private List<org.springframework.ai.document.Document> preprocessDocuments(List<org.springframework.ai.document.Document> documents) 预处理文档:过滤空文档,拆分过大文档 -
cleanText
清理文本内容 -
splitLargeDocument
private List<org.springframework.ai.document.Document> splitLargeDocument(String text, Map<String, Object> metadata) 拆分大文档 -
processBatch
private List<String> processBatch(List<org.springframework.ai.document.Document> batch, FileResponse fileResponse, int batchNumber) 处理单个批次 -
processDocumentsIndividually
private List<String> processDocumentsIndividually(List<org.springframework.ai.document.Document> documents, FileResponse fileResponse, int batchNumber) 逐个处理文档(当批次处理失败时的兜底方案)
-