前言:想要谷歌和百度已经够用了,这里实现的搜索只是为了方便自己做后续的事情的一个小实践。
理论架构
想要实现一个搜索引擎,首先需要考虑出完整的架构。
页面抓取存储分析搜索实现展现 页面抓取
首先,页面抓取我打算采取最简单的HttpClient的方式,可能有人会说,你这样做会漏掉大量使用Web2.0的网站的,是的,没错,最开始我为了验证架构的可用性,就是要漏掉一些复杂的点。
存储
然后,存储,我打算直接使用文件系统进行实体存储,在搜索使用的时候,全部将结果加载到内存中。可能有的人会说,你这样好消耗内存哦,是的,没错,我可以用大量的swap空间,用性能换内存。
分析
分析部分,我打算直接使用分词算法,解析出词频,建立文章的倒排索引,但是不存储文章的所有词语的倒排索引,毕竟要考虑到未优化的文件系统的存取性能。我这里的方案是直接取词频在20~50范围内的词以及网站标题的分词结果作为网站的关键词,建立倒排系统而存在。为了描述不显得那么空白和抽象,这里贴出最后的结构:image.png
文件的标题名就是分词的词语名,文件里面存储的是所有关键词有该词的网站域名,如下:
有点类似elasticsearch底层的存储原理,不过我没有做什么优化。
搜索实现
搜索实现部分,我打算直接将上述文件加载到内存中,直接使用HashMap存储,方便读取。
展现
为了方便随点随用,我打算直接使用谷歌浏览器插件的形式进行展现实现。
好了,现在理论架构差不多了,那么就开始动手实现吧
动手实现 页面抓取
刚才提到了,这里直接使用HttpClient进行页面抓取,除此之外,还涉及对页面的外链解析。在说外链解析之前,我打算先说说我的抓取思路。
将整个互联网想象成一张巨大的网,网站间通过链接的方式相互串联,虽然这里面有大量的网站是孤岛,但是不妨碍对绝大多数网站的抓取。所以这里采取的方案就是多点为主要节点的广度优先遍历,对单个网站只抓取首页的内容,分析其中的所有外链,然后作为目标进行抓取。
抓取页面的代码如下:
import com.chaojilaji.auto.autocode.generatecode.GenerateFile;import com.chaojilaji.auto.autocode.standartReq.SendReq;import com.chaojilaji.auto.autocode.utils.Json;import com.chaojilaji.moneyframework.model.OnePage;import com.chaojilaji.moneyframework.model.Word;import com.chaojilaji.moneyframework.service.Nlp;import com.chaojilaji.moneyframework.utils.DomainUtils;import com.chaojilaji.moneyframework.utils.HtmlUtil;import com.chaojilaji.moneyframework.utils.MDUtils;import org.apache.commons.logging.Log;import org.apache.commons.logging.LogFactory;import org.springframework.stereotype.Service;import org.springframework.util.StringUtils;import java.io.*;import java.util.*;import java.util.concurrent.ConcurrentHashMap;import java.util.concurrent.ConcurrentSkipListSet;public class HttpClientCrawl { private static Log logger = LogFactory.getLog(HttpClientCrawl.class); public Set oldDomains = new ConcurrentSkipListSet<>(); public Map onePageMap = new ConcurrentHashMap<>(400000); public Set ignoreSet = new ConcurrentSkipListSet<>(); public Map> siteMaps = new ConcurrentHashMap<>(50000); public String domain; public HttpClientCrawl(String domain) { this.domain = DomainUtils.getDomainWithCompleteDomain(domain); String[] ignores = {"gov.cn", "apac.cn", "org.cn", "twitter.com" , "baidu.com", "google.com", "sina.com", "weibo.com" , "github.com", "sina.com.cn", "sina.cn", "edu.cn", "wordpress.org", "sephora.com"}; ignoreSet.addAll(Arrays.asList(ignores)); loadIgnore(); loadWord(); } private Map defaultHeaders() { Map ans = new HashMap<>(); ans.put("Accept", "application/json, text/plain, */*"); ans.put("Content-Type", "application/json"); ans.put("Connection", "keep-alive"); ans.put("Accept-Language", "zh-CN,zh;q=0.9"); ans.put("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36"); return ans; } public SendReq.ResBody doRequest(String url, String method, Map params) { String urlTrue = url; SendReq.ResBody resBody = SendReq.sendReq(urlTrue, method, params, defaultHeaders()); return resBody; } public void loadIgnore() { File directory = new File("."); try { String file = directory.getCanonicalPath() + "/moneyframework/generate/ignore/demo.txt"; BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File(file)))); String line = ""; while ((line = reader.readLine()) != null) { String x = line.replace("[", "").replace("]", "").replace(" ", ""); String[] y = x.split(","); ignoreSet.addAll(Arrays.asList(y)); } } catch (IOException e) { e.printStackTrace(); } } public void loadDomains(String file) { File directory = new File("."); try { File file1 = new File(directory.getCanonicalPath() + "\" + file); logger.info(directory.getCanonicalPath() + "\" + file); if (!file1.exists()) { file1.createNewFile(); } BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file1))); String line = ""; while ((line = reader.readLine()) != null) { line = line.trim(); OnePage onePage = new OnePage(line); if (!oldDomains.contains(onePage.getDomain())) { onePageMap.put(onePage.getDomain(), onePage); oldDomains.add(onePage.getDomain()); } } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } public void handleWord(List s, String domain, String title) { for (String a : s) { String x = a.split(" ")[0]; String y = a.split(" ")[1]; Set z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>()); if (Integer.parseInt(y) >= 10) { if (z.contains(domain)) continue; z.add(domain); siteMaps.put(x, z); GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString())); } } Set xxxx = Nlp.separateWordAndReturnUnit(title); for (Word word : xxxx) { String x = word.getWord(); Set z = siteMaps.getOrDefault(x, new ConcurrentSkipListSet<>()); if (z.contains(domain)) continue; z.add(domain); siteMaps.put(x, z); GenerateFile.appendFileWithRelativePath("moneyframework/domain/markdown", x + ".md", MDUtils.getMdContent(domain, title, s.toString())); } } public void loadWord() { File directory = new File("."); try { File file1 = new File(directory.getCanonicalPath() + "\moneyframework/domain/markdown"); if (file1.isDirectory()) { int fileCnt = 0; File[] files = file1.listFiles(); for (File file : files) { fileCnt ++; try { BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file))); String line = ""; siteMaps.put(file.getName().replace(".md", ""), new ConcurrentSkipListSet<>()); while ((line = reader.readLine()) != null) { line = line.trim(); if (line.startsWith("####")) { siteMaps.get(file.getName().replace(".md", "")).add(line.replace("#### ", "").trim()); } } }catch (Exception e){ } if ((fileCnt % 1000 ) == 0){ logger.info((fileCnt * 100.0) / files.length + "%"); } } } for (Map.Entry> xxx : siteMaps.entrySet()){ oldDomains.addAll(xxx.getValue()); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } public void doTask() { String root = "http://" + this.domain + "/"; Queue urls = new linkedList<>(); urls.add(root); Set tmpDomains = new HashSet<>(); tmpDomains.addAll(oldDomains); tmpDomains.add(DomainUtils.getDomainWithCompleteDomain(root)); int cnt = 0; while (!urls.isEmpty()) { String url = urls.poll(); SendReq.ResBody html = doRequest(url, "GET", new HashMap<>()); cnt++; if (html.getCode().equals(0)) { ignoreSet.add(DomainUtils.getDomainWithCompleteDomain(url)); try { GenerateFile.createFile2("moneyframework/generate/ignore", "demo.txt", ignoreSet.toString()); } catch (IOException e) { e.printStackTrace(); } continue; } OnePage onePage = new OnePage(); onePage.setUrl(url); onePage.setDomain(DomainUtils.getDomainWithCompleteDomain(url)); onePage.setCode(html.getCode()); String title = HtmlUtil.getTitle(html.getResponce()).trim(); if (!StringUtils.hasText(title) || title.length() > 100 || title.contains("�")) { title = "没有"; } onePage.setTitle(title); String content = HtmlUtil.getContent(html.getResponce()); Set words = Nlp.separateWordAndReturnUnit(content); List wordStr = Nlp.print2List(new ArrayList<>(words), 10); handleWord(wordStr, DomainUtils.getDomainWithCompleteDomain(url), title); onePage.setContent(wordStr.toString()); if (html.getCode().equals(200)) { List domains = HtmlUtil.getUrls(html.getResponce()); for (String domain : domains) { int flag = 0; String[] aaa = domain.split("."); if (aaa.length>=4){ continue; } for (String i : ignoreSet) { if (domain.endsWith(i)) { flag = 1; break; } } if (flag == 1) continue; if (StringUtils.hasText(domain.trim())) { if (!tmpDomains.contains(domain)) { tmpDomains.add(domain); urls.add("http://" + domain + "/"); } } } logger.info(this.domain + " 队列的大小为 " + urls.size()); if (cnt >= 2000) { break; } } else { if (url.startsWith("http:")){ urls.add(url.replace("http:","https:")); } } } }}
其中,这里的_SendReq.sendReq_是自己实现的一个下载页面你的方法,调用了HttpClient的方法。如果你想实现对Web2.0的抓取,可以考虑在里面封装一个PlayWrite。
然后是格式化Html,去除标签和由于特殊字符引起的各种乱码的工具类HtmlUtils。
import org.apache.commons.lang3.StringEscapeUtils;import java.io.IOException;import java.nio.charset.StandardCharsets;import java.util.ArrayList;import java.util.HashSet;import java.util.List;import java.util.Set;import java.util.regex.Matcher;import java.util.regex.Pattern;public class HtmlUtil { public static String getContent(String html) { String ans = ""; try { html = StringEscapeUtils.unescapeHtml4(html); html = delHTMLTag(html); html = htmlTextFormat(html); return html; } catch (Exception e) { e.printStackTrace(); } return ans; } public static String delHTMLTag(String htmlStr) { String regEx_script = "