Nutch开发

时间：2023-05-04

Nutch开发(四) 文章目录

Nutch开发(四)

开发环境 1.Nutch插件设计介绍2.解读插件目录结构3、build.xml4、ivy.xml5、plugin.xml6、解读parse-html插件

HtmlParser

setConf(Configuration conf)parse(InputSource input)getParse(Content content) 7.解读parse-metatags插件

metaTagsParser

filter方法addIndexedmetatags方法metadata plugin的配置开发环境

Linux，Ubuntu20.04LSTIDEANutch1.18Solr8.11

转载请声明出处！！！By 鸭梨的药丸哥

1.Nutch插件设计介绍

Nutch高度可扩展，使用的插件系统是基于Eclipse2.x的插件系统。

Nutch暴露了几个扩展点，每个扩展点都是一个接口，通过实现接口来进行插件扩展的开发。Nutch提供以下扩展点，我们只需要实现对应的接口即可开发我们的Nutch插件

IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).IndexingFilter – Permits one to add metadata to the indexed fields、All plugins found which implement this extension point are run sequentially on the parse (from javadoc).Parser – Parser implementations read through fetched documents in order to extract data to be indexed、This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch、The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.ScoringFilter – A contract defining behavior of scoring plugins、A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes、Filters can be chained in a specific order, to provide multi-stage scoring adjustments.SegmentMergeFilter – Interface used to filter segments during segment merge、It allows filtering on more sophisticated criteria than just URLs、In particular it allows filtering based on metadata collected while parsing page、

2.解读插件目录结构

Nutch插件的目录都相似，这里介绍一下parse-html的目录就行了

/src #源码目录build.xml #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息)ivy.xml #plugin的ivy配置信息(依赖管理，跟maven的pom.xml一样的东东)plugin.xml #nutch描述这个plugin的信息(如，这个插件实现了哪些扩展点，插件的扩展点实现类名字等)

3、build.xml

build.xml告知ant如何编译这个插件的

private void addIndexedmetatags(metadata metadata, String metatag, String value) { String lcmetatag = metatag.toLowerCase(Locale.ROOT); if (metatagset.contains("*") || metatagset.contains(lcmetatag)) { if (LOG.isDebugEnabled()) { LOG.debug("Found meta tag: {}t{}", lcmetatag, value); } metadata.add("metatag." + lcmetatag, value); } }

metadata plugin的配置

在看看配置并和addIndexedmetatags对比一下，这就可以看出为什么插件的index.parse.md要加上metatag.前缀

metatags.namesdescription,keywords Names of the metatags to extract, separated by ','. Use '*' to extract all metatags、Prefixes the names with 'metatag.' in the parse-metadata、For instance to index description and keywords, you need to activate the plugin index-metadata and set the value of the parameter 'index.parse.md' to 'metatag.description,metatag.keywords'. index.parse.md metatag.description,metatag.keywords Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g、for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin)

上一篇：玫琳凯首席创新官LucyGildea在2022年妇女和女童参与科学国际日的声明

下一篇：在企业中采用知识管理工具的好处