Nutch开发(四)
开发环境 1.Nutch插件设计介绍2.解读插件目录结构3、build.xml4、ivy.xml5、plugin.xml6、解读parse-html插件
HtmlParser
setConf(Configuration conf)parse(InputSource input)getParse(Content content) 7.解读parse-metatags插件
metaTagsParser
filter方法addIndexedmetatags方法metadata plugin的配置 开发环境
Linux,Ubuntu20.04LSTIDEANutch1.18Solr8.11
转载请声明出处!!!By 鸭梨的药丸哥
1.Nutch插件设计介绍Nutch高度可扩展,使用的插件系统是基于Eclipse2.x的插件系统。
Nutch暴露了几个扩展点,每个扩展点都是一个接口,通过实现接口来进行插件扩展的开发。Nutch提供以下扩展点,我们只需要实现对应的接口即可开发我们的Nutch插件
2.解读插件目录结构IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).IndexingFilter – Permits one to add metadata to the indexed fields、All plugins found which implement this extension point are run sequentially on the parse (from javadoc).Parser – Parser implementations read through fetched documents in order to extract data to be indexed、This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch、The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.ScoringFilter – A contract defining behavior of scoring plugins、A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes、Filters can be chained in a specific order, to provide multi-stage scoring adjustments.SegmentMergeFilter – Interface used to filter segments during segment merge、It allows filtering on more sophisticated criteria than just URLs、In particular it allows filtering based on metadata collected while parsing page、
Nutch插件的目录都相似,这里介绍一下parse-html的目录就行了
/src #源码目录build.xml #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息)ivy.xml #plugin的ivy配置信息(依赖管理,跟maven的pom.xml一样的东东)plugin.xml #nutch描述这个plugin的信息(如,这个插件实现了哪些扩展点,插件的扩展点实现类名字等)
3、build.xmlbuild.xml告知ant如何编译这个插件的
在看看配置并和addIndexedmetatags对比一下,这就可以看出为什么插件的index.parse.md要加上metatag.前缀