Class WebsiteCrawlerService
java.lang.Object
com.bytedesk.kbase.llm_website.service.WebsiteCrawlerService
网站整站抓取服务
负责从指定的网站根域名开始,按照设定的抓取深度和规则,批量抓取整个网站内容
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final ConcurrentHashMap<String,WebsiteCrawlTask> private final ExecutorServiceprivate final WebsiteCrawlTaskRepositoryprivate final UidUtilsprivate final WebpageRepositoryprivate final WebsiteRepository -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate booleancrawlSinglePage(String url, WebsiteEntity website, Set<String> visitedUrls, Set<String> urlsToVisit, WebsiteCrawlConfig config) 抓取单个页面private WebsiteCrawlTaskcreateCrawlTask(WebsiteEntity website, WebsiteCrawlConfig config) 创建抓取任务private voidcreateOrUpdateWebpage(String url, String title, String description, String content, WebsiteEntity website) 创建或更新网页实体private StringextractDescription(org.jsoup.nodes.Document doc) 提取页面描述private voidextractLinks(org.jsoup.nodes.Document doc, String currentUrl, String baseUrl, Set<String> urlsToVisit, WebsiteCrawlConfig config) 提取页面链接getCrawlTaskStatus(String taskId) 获取抓取任务状态private booleanisValidContent(String content, WebsiteCrawlConfig config) 验证内容是否有效parseSitemap(String sitemapUrl) 解析站点地图private WebsiteCrawlResultperformCrawl(WebsiteEntity website, WebsiteCrawlTask task, WebsiteCrawlConfig config) 执行抓取任务private StringresolveUrl(URL base, String href) 解析相对URL为绝对URLprivate WebsiteCrawlTask保存抓取任务private booleanshouldCrawlUrl(String url, URL baseUrl, WebsiteCrawlConfig config) 判断URL是否应该被抓取voidshutdown()清理资源startCrawl(String websiteUid, WebsiteCrawlConfig config) 开始整站抓取booleanstopCrawlTask(String taskId) 停止抓取任务
-
Field Details
-
websiteRepository
-
webpageRepository
-
crawlTaskRepository
-
uidUtils
-
crawlExecutor
-
activeTasks
-
-
Constructor Details
-
WebsiteCrawlerService
public WebsiteCrawlerService()
-
-
Method Details
-
startCrawl
@Async public CompletableFuture<WebsiteCrawlResult> startCrawl(String websiteUid, WebsiteCrawlConfig config) 开始整站抓取- Parameters:
websiteUid- 网站UIDconfig- 抓取配置- Returns:
- 抓取任务
-
performCrawl
private WebsiteCrawlResult performCrawl(WebsiteEntity website, WebsiteCrawlTask task, WebsiteCrawlConfig config) 执行抓取任务 -
crawlSinglePage
private boolean crawlSinglePage(String url, WebsiteEntity website, Set<String> visitedUrls, Set<String> urlsToVisit, WebsiteCrawlConfig config) 抓取单个页面 -
extractDescription
提取页面描述 -
extractLinks
private void extractLinks(org.jsoup.nodes.Document doc, String currentUrl, String baseUrl, Set<String> urlsToVisit, WebsiteCrawlConfig config) 提取页面链接 -
resolveUrl
解析相对URL为绝对URL -
shouldCrawlUrl
判断URL是否应该被抓取 -
isValidContent
验证内容是否有效 -
createOrUpdateWebpage
private void createOrUpdateWebpage(String url, String title, String description, String content, WebsiteEntity website) 创建或更新网页实体 -
createCrawlTask
创建抓取任务 -
saveCrawlTask
保存抓取任务 -
getCrawlTaskStatus
获取抓取任务状态 -
stopCrawlTask
停止抓取任务 -
parseSitemap
解析站点地图 -
shutdown
public void shutdown()清理资源
-