博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Soufun_News
阅读量:6964 次
发布时间:2019-06-27

本文共 4142 字,大约阅读时间需要 13 分钟。

using AnfleCrawler.Common;using System;using System.Collections.Generic;using System.ComponentModel;using System.IO;using System.Linq;using System.Net;using System.Text;using System.Threading.Tasks;namespace AnfleCrawler.DataAnalyzer{    internal class Soufun_News : AnalyzerBase    {        private enum Kind        {            [Description("市场")]            Market = 32,            [Description("政策")]            Policy = 35,            [Description("公司")]            Company = 736,        }        private static readonly string[] FilterTags = new string[] { "script", "iframe" };        public override void Init(PageCrawler crawler)        {            string exp = string.Format("http://news.sh.soufun.com/more/[{0}]/[1-50].html", string.Join(",", Enum.GetValues(typeof(Kind)).Cast
())); crawler.PushUrl(new StringPatternGenerator(exp), 0); base.Init(crawler); } protected override void AnalyzeInternal(PageLandEntity current) { var lander = Crawler.Lander; dynamic repository = Repository; var pHandler = CreateContentHandler(current); switch (current.Depth) { case 0: { var dom = lander.GetDocument(pHandler); foreach (var node in QueryNodes(dom.DocumentNode, ".contenttext")) { var linkNode = QueryNode(node, "a.link_01"); string url = GetHref(linkNode, current.Url).OriginalString; int i = url.LastIndexOf("."); Crawler.PushUrl(new Uri(url.Insert(i, "_all")), 1); } } break; case 1: { var dom = lander.GetDocument(pHandler); var hackNode = QueryNode(dom.DocumentNode, "#newxq_B01_26"); string kind = QueryNodes(hackNode, "a").Last().InnerText; string title = QueryNode(dom.DocumentNode, "h1").InnerText; var contentNode = QueryNode(dom.DocumentNode, "#news_body"); foreach (string tag in FilterTags) { foreach (var node in QueryNodes(contentNode, tag, false).ToArray()) { node.Remove(); } } var set = QueryNodes(dom.DocumentNode, "#newxq_B01_27 span").Take(2).ToArray(); string source = null; DateTime publishDate; DateTime.TryParse(set[0].InnerText, out publishDate); if (set.Length == 2) { source = set[1].InnerText; } repository.SaveNews(current.Url, kind, source, title, contentNode.InnerHtml, publishDate); Crawler.OutWrite("保存新闻 {0}", title); } break; } } }}

 

 

public void SaveNews(Uri pageUrl, string kind, string source, string title, string content, DateTime publishDate)        {            Guid rowID = CryptoManaged.MD5Hash(pageUrl.OriginalString);            using (var db = Create())            {                var q = from t in db.News                        where t.RowID == rowID                        select t;                var news = q.SingleOrDefault();                if (news == null)                {                    db.News.Add(news = new News()                    {                        RowID = rowID,                        SiteID = pageUrl.Authority,                    });                }                news.Kind = kind;                news.Source = source;                news.Title = title;                news.Content = content;                news.PublishDate = publishDate;                db._SaveChanges();            }        }

 

转载于:https://www.cnblogs.com/Googler/p/4181664.html

你可能感兴趣的文章
C#中构造函数的作用
查看>>
添加service到SystemService硬件服务
查看>>
The Model Complexity Myth
查看>>
解决:对 PInvoke 函数的调用导致堆栈不对称问题
查看>>
HTML5学习笔记简明版(10):过时的元素和属性
查看>>
Codeforces Round #313 (Div. 1) B. Equivalent Strings
查看>>
iOS开发-UITextField手机号和邮箱验证
查看>>
使用mvn生成webapp失败,尚未找到原因
查看>>
吐槽C++:C++ 类成员变量初始化 之 初始化带有参数的构造函数 的类成员变量。...
查看>>
跑Java -jar somefile.jar时会发生什么(一个)
查看>>
iOS开发网络篇—GET请求和POST请求
查看>>
UVA 10139 Factovisors(数论)
查看>>
Codeforces 458A Golden System
查看>>
java通过抛异常来返回提示信息
查看>>
LPC43xx双核笔记
查看>>
Flex4将对象转换成json串
查看>>
实现jquery.ajax及原生的XMLHttpRequest调用WCF服务的方法
查看>>
Swift - 多行文本输入框(UITextView)的用法
查看>>
hdu 1251 统计拼图
查看>>
Java多线程6:synchronized锁定类方法、volatile关键字及其他
查看>>