您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
本篇文章为大家展示了如何实现generate.max.count的参数处理,内容简明扼要并且容易理解,绝对能使你眼前一亮,通过这篇文章的详细介绍希望你能有所收获。
对generate.max.count参数的处理在org.apache.nutch.crawl.Generator内部类Selector中
org.apache.nutch.crawl.Generator中相关变量声明情况
private HashMap<String, int[]> hostCounts = new HashMap<String, int[]>(); private int maxCount;
内部类Selector的config方法中
maxCount = job.getInt(GENERATOR_MAX_COUNT, -1);
reduce方法中的处理
/*** 1、获取 某一主机下的int[] ,如果为null,声明一个数组,放入map中,int数组第2个值+1; */ //1 int[] hostCount = hostCounts.get(hostordomain); if (hostCount == null) { hostCount = new int[] { 1, 0 }; hostCounts.put(hostordomain, hostCount); } hostCount[1]++;// increment hostCount //2、检查是否到了topN的数量,如果hostCount的第一个值大于limit // check if topN reached, select next segment if it is while (segCounts[hostCount[0] - 1] >= limit//segCounts : && hostCount[0] < maxNumSegments) { hostCount[0]++; hostCount[1] = 0; } // reached the limit of allowed URLs per host / domain // see if we can put it in the next segment? if (hostCount[1] >= maxCount) { if (hostCount[0] < maxNumSegments) { hostCount[0]++; hostCount[1] = 0; } else { if (hostCount[1] == maxCount + 1 && LOG.isInfoEnabled()) { LOG.info("Host or domain " + hostordomain + " has more than " + maxCount + " URLs for all " + maxNumSegments + " segments. Additional URLs won't be included in the fetchlist."); } // skip this entry continue; } } entry.segnum = new IntWritable(hostCount[0]); segCounts[hostCount[0] - 1]++;
上述内容就是如何实现generate.max.count的参数处理,你们学到知识或技能了吗?如果还想学到更多技能或者丰富自己的知识储备,欢迎关注亿速云行业资讯频道。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。