我们经常需要去别的网站采集一些信息,.NET下所有相关的技术都已经非常成熟,用Webrequest抓取页面,虽然支持自定义Reference头,又支持cookie,解析页面一般都是用正则,但是对方网站结构一变,还得重新改代码,重新编译,发布。有了IronPython,可以把抓取和分析的逻辑做成Python脚本,如果对方页面结构变了,只需修改脚本就行了,不需重新编译软件,这样可以用c#做交互和界面部分,用Python封装预期经常变化的部分。
如何利用IronPython把抓取和分析的逻辑做成Python脚本
安装好IronPython和vs.net 2010后,还需要下载一个SGMLReader(见参考链接),这个组件可以把格式不是很严格的HTML转换成格式良好的XML文件,甚至还能增加DTD的验证
我们以抓取百度贴吧页面为例,新建一个Console项目,引用IronPython,Microsoft.Dynamic,Microsoft.Scripting,SgmlReaderDll这些组件,把SGMLReader里的Html.dtd复制到项目目录下,如果没有这个,它会根据doctype去网络上找dtd,然后新建baidu.py的文件,最后在项目属性的生成事件里写上如下代码,把这两个文件拷贝到目标目录里。
在baidu.py里首先引用必要的.net程序集。
- copy $(ProjectDir)\*.py $(TargetDir)
- copy $(ProjectDir)\*.dtd $(TargetDir)
- import clr, sys
- clr.AddReference("SgmlReaderDll")
- clr.AddReference("System.Xml")
- from Sgml import *
- from System.Net import *
- from System.IO import TextReader,StreamReader
- from System.Xml import *
- from System.Text.UnicodeEncoding import UTF8
- def fromHtml(textReader):
- sgmlReader = SgmlReader()
- sgmlReader.SystemLiteral = "html.dtd" sgmlReader.WhitespaceHandling =WhitespaceHandling.All
- sgmlReader.CaseFolding = CaseFolding.ToLower sgmlReader.InputStream = textReader
- doc = XmlDocument()
- doc.PreserveWhitespace = True
- doc.XmlResolver = None
- doc.Load(sgmlReader)
- eturn doc
- def getWebData(url, method, data = None, cookie = None, encoding = "UTF-8"):
- req = WebRequest.Create(url)
- req.Method = method
- if cookie != None:
- req.CookieContainer = cookie
- if data != None:
- stream = req.GetRequestStream()
- stream.Write(data, 0, data.Length)
- rsp = req.GetResponse()
- reader = StreamReader(rsp.GetResponseStre(),
- UTF8.GetEncoding(encoding))
- return reader
- class Post:
- def __init__(self, hit, comments, title, link, author):
- self.hit = hit
- self.comments = comments
- self.title = title
- self.link = link
- self.author = author
- class BaiDu:
- def __init__(self,encoding):
- self.cc = self.cc = CookieContainer()
- self.encoding = encoding
- self.posts = []
- def getPosts(self, url):
- reader = getWebData(url, "GET", None, self.cc, self.encoding)
- doc = fromHtml(reader)
- trs = doc.SelectNodes("html//table[@id='thread_list_table']/tbody/tr")
- self.parsePosts(trs)
- def parsePosts(self, trs):
- for tr in trs:
- tds = tr.SelectNodes("td")
- hit = tds[0].InnerText
- comments = tds[1].InnerText
- title = tds[2].ChildNodes[1].InnerText
- link = tds[2].ChildNodes[1].Attributes["href"]
- author = tds[3].InnerText
- post = Post(hit, comments, title, link, author)
- self.posts.append(post)
- Dictionary
options = new Dictionary (); - options["Debug"] = true;
- ScriptEngine engine = Python.CreateEngine(options);
- ScriptScope scope = engine.ExecuteFile("baidu.py");
- dynamic baidu = engine.Operations.Invoke(scope.GetVariable("BaiDu"), "GBK");
参考链接: Build Python scripts and call methods from C# IronPython 与C#交互 Using SGMLParser With IronPython SGMLReader - Convert any HTML to valid XML Calling IronPython functions from .NET IronPython .NET Integration Xml中SelectSingleNode方法中的xpath用法 Dive Into Python Python基础教程(第2版)
- baidu.getPosts("http://tieba.baidu.com/f?kw=seo");
- dynamic posts = baidu.posts;
- foreach (dynamic post in posts)
- {
- Console.WriteLine("{0} (回复数:{1})(点击数:{2})[作者:{3}]",
- post.title,
- post.comments,
- post.hit,
- post.author);
- }
【编辑推荐】