Class WebsiteCrawlConfig
java.lang.Object
com.bytedesk.kbase.llm_website.crawl.WebsiteCrawlConfig
网站抓取配置
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate int并发线程数private boolean是否抓取图片private boolean是否抓取PDF文档private boolean是否去重private int请求间隔延迟(毫秒)排除的URL模式(正则表达式)private boolean是否跟随链接包含的URL模式(正则表达式)private int抓取深度(1-5层)private int最大抓取页面数private int最小内容长度private int抓取优先级(1-10,数字越大优先级越高)private boolean是否支持断点续传private Stringsitemap URL(可选)private int请求超时时间(毫秒)private String用户代理private boolean是否使用sitemap -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic WebsiteCrawlConfiggetDeep()获取深度配置(更大深度和页面数)static WebsiteCrawlConfig获取默认配置static WebsiteCrawlConfiggetFast()获取快速配置(较少深度和页面数)booleanisValid()验证配置有效性
-
Field Details
-
maxDepth
private int maxDepth抓取深度(1-5层) -
maxPages
private int maxPages最大抓取页面数 -
concurrentThreads
private int concurrentThreads并发线程数 -
timeout
private int timeout请求超时时间(毫秒) -
delay
private int delay请求间隔延迟(毫秒) -
userAgent
用户代理 -
followLinks
private boolean followLinks是否跟随链接 -
useSitemap
private boolean useSitemap是否使用sitemap -
sitemapUrl
sitemap URL(可选) -
includePatterns
包含的URL模式(正则表达式) -
excludePatterns
排除的URL模式(正则表达式) -
minContentLength
private int minContentLength最小内容长度 -
deduplication
private boolean deduplication是否去重 -
resumable
private boolean resumable是否支持断点续传 -
priority
private int priority抓取优先级(1-10,数字越大优先级越高) -
crawlImages
private boolean crawlImages是否抓取图片 -
crawlPdfs
private boolean crawlPdfs是否抓取PDF文档
-
-
Constructor Details
-
WebsiteCrawlConfig
public WebsiteCrawlConfig()
-
-
Method Details
-
isValid
public boolean isValid()验证配置有效性 -
getDefault
获取默认配置 -
getFast
获取快速配置(较少深度和页面数) -
getDeep
获取深度配置(更大深度和页面数)
-