java.lang.Object

com.bytedesk.kbase.llm_file.FileService

@Service public class FileService extends Object

解析文件内容 https://tika.apache.org/3.2.1/formats.html

Field Summary

Fields

Modifier and Type

Field

Description

private final UploadRestService

uploadRestService
Constructor Summary

Constructors

Constructor

Description

FileService()
Method Summary

Modifier and Type

Method

Description

private boolean

allBlank(List<org.springframework.ai.document.Document> docs)

String

cleanDocumentText(String text)

清理文档文本，移除可能导致问题的特殊字符

String

extractContentSummary(List<org.springframework.ai.document.Document> documents, int maxLength)

智能截取文档内容摘要

private String

getFileExtension(String fileName)

private String

getOcrLanguage()

返回 OCR 使用的 Tesseract 语言设置，默认中英混合。

private boolean

isImageExt(String ext)

private boolean

isTikaPreferred(String ext)

根据扩展名判定是否优先交由 Tika 解析。

List<org.springframework.ai.document.Document>

parseFileContent(UploadEntity upload)

https://docs.spring.io/spring-ai/reference/api/etl-pipeline.html

List<org.springframework.ai.document.Document>

readByTika(org.springframework.core.io.Resource resource)

List<org.springframework.ai.document.Document>

readByTikaWithOcr(org.springframework.core.io.Resource resource, String ocrLanguage)

使用 Apache Tika + Tesseract 进行 OCR 解析，适用扫描 PDF 或图片。

List<org.springframework.ai.document.Document>

readHtml(org.springframework.core.io.Resource resource)

List<org.springframework.ai.document.Document>

readJson(org.springframework.core.io.Resource resource)

List<org.springframework.ai.document.Document>

readMarkdown(org.springframework.core.io.Resource resource)

List<org.springframework.ai.document.Document>

readPdfPage(org.springframework.core.io.Resource resource)

List<org.springframework.ai.document.Document>

readPdfParagraph(org.springframework.core.io.Resource resource)

List<org.springframework.ai.document.Document>

readTxt(org.springframework.core.io.Resource resource)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- uploadRestService
  
  private final UploadRestService uploadRestService
Constructor Details
- FileService
  
  public FileService()
Method Details
- parseFileContent
  
  public List<org.springframework.ai.document.Document> parseFileContent(UploadEntity upload)
  
  https://docs.spring.io/spring-ai/reference/api/etl-pipeline.html
- readPdfPage
  
  public List<org.springframework.ai.document.Document> readPdfPage(org.springframework.core.io.Resource resource)
- readPdfParagraph
  
  public List<org.springframework.ai.document.Document> readPdfParagraph(org.springframework.core.io.Resource resource)
- readJson
  
  public List<org.springframework.ai.document.Document> readJson(org.springframework.core.io.Resource resource)
- readMarkdown
  
  public List<org.springframework.ai.document.Document> readMarkdown(org.springframework.core.io.Resource resource)
- readTxt
  
  public List<org.springframework.ai.document.Document> readTxt(org.springframework.core.io.Resource resource)
- readHtml
  
  public List<org.springframework.ai.document.Document> readHtml(org.springframework.core.io.Resource resource)
- readByTika
  
  public List<org.springframework.ai.document.Document> readByTika(org.springframework.core.io.Resource resource)
- readByTikaWithOcr
  
  public List<org.springframework.ai.document.Document> readByTikaWithOcr(org.springframework.core.io.Resource resource, String ocrLanguage)
  
  使用 Apache Tika + Tesseract 进行 OCR 解析，适用扫描 PDF 或图片。依赖系统已安装 tesseract 可执行程序与对应语言数据，默认使用 chi_sim+eng。
- extractContentSummary
  
  public String extractContentSummary(List<org.springframework.ai.document.Document> documents, int maxLength)
  
  智能截取文档内容摘要
  
  Parameters:
  
  documents - 文档列表
  
  maxLength - 最大长度
  
  Returns:
  
  截取后的内容摘要
- cleanDocumentText
  
  public String cleanDocumentText(String text)
  
  清理文档文本，移除可能导致问题的特殊字符
  
  Parameters:
  
  text - 原始文本
  
  Returns:
  
  清理后的文本
- getFileExtension
  
  private String getFileExtension(String fileName)
- allBlank
  
  private boolean allBlank(List<org.springframework.ai.document.Document> docs)
- isImageExt
  
  private boolean isImageExt(String ext)
- getOcrLanguage
  
  private String getOcrLanguage()
  
  返回 OCR 使用的 Tesseract 语言设置，默认中英混合。可后续改为读取配置或环境变量。
- isTikaPreferred
  
  private boolean isTikaPreferred(String ext)
  
  根据扩展名判定是否优先交由 Tika 解析。覆盖范围参考 Tika 2.9.0 支持清单（示例集合，非穷举）。

Class FileService

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

uploadRestService

Constructor Details

FileService

Method Details

parseFileContent

readPdfPage

readPdfParagraph

readJson

readMarkdown

readTxt

readHtml

readByTika

readByTikaWithOcr

extractContentSummary

cleanDocumentText

getFileExtension

allBlank

isImageExt

getOcrLanguage

isTikaPreferred