Class WebpageCrawlerService
java.lang.Object
com.bytedesk.kbase.llm_webpage.service.WebpageCrawlerService
网页内容抓取服务
负责从网页URL抓取内容并更新WebpageEntity的content字段
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptioncrawlAndUpdateContent
(WebpageEntity webpage) 抓取网页内容并更新实体crawlContent
(String url) 仅抓取网页内容,不保存到数据库boolean
isValidContent
(String content) 验证抓取到的内容是否有效boolean
needsCrawling
(WebpageEntity webpage) 检查网页是否需要抓取内容
-
Field Details
-
webpageRestService
-
-
Constructor Details
-
WebpageCrawlerService
public WebpageCrawlerService()
-
-
Method Details
-
crawlAndUpdateContent
抓取网页内容并更新实体- Parameters:
webpage
- 网页实体- Returns:
- 更新后的网页实体,如果抓取失败返回原实体
-
crawlContent
仅抓取网页内容,不保存到数据库- Parameters:
url
- 网页URL- Returns:
- 抓取到的内容,失败返回null
-
needsCrawling
检查网页是否需要抓取内容- Parameters:
webpage
- 网页实体- Returns:
- true如果需要抓取,false如果不需要
-
isValidContent
验证抓取到的内容是否有效- Parameters:
content
- 内容- Returns:
- true如果内容有效,false如果无效
-