用Scrapy爬网易新闻简易教程

最近在研究爬虫相关的东东，经过上一篇文章的一系列调研比较后，收敛到scarpy这个Python的开源框架。它的好处和简单介绍可以参见上一篇文章。 Scrapy的安装和最简单的入门也可以参见官方文档tutorial,同时readthedocs上也挂有完整翻译的Scrapy中文文档,也有几篇中文的博客将该文档翻译了过来，这些都是非常好的入门材料。如果想完全弄懂爬虫的一些更加高阶的功能，比如分布式，缓存技术，并发等等问题，那么就必须好好研究下文档了。

这里我们直接举一个简单的用Scrapy来爬去网易科技新闻的例子。爬虫之前我们先应该把我们的大体思路理清楚，我们爬虫的对象是网易科技新闻。这个网页上会有很多信息，有导航，有专栏，有新闻，还有一些广告，我们的目的则很明确——爬新闻，细分一下主要可以包括新闻的标题，正文，来源，时间等等。同时为了简单起见，我们对于新闻里的图片，视频等等一系列多媒体元素也不做处理。同时我们还要考虑到如何把爬到内容用数据库存储起来。好，那我们开始吧。

创建项目

使用命令创建项目：

1	scrapy startproject tech163

我们可以看到新建的项目目录结构如下：

tech163/
├── scrapy.cfg
└── tech163
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

其中scrapy.cfg定义的是项目的配置文件，一般里面的内容只是做了一个指向，将配置指向到了settings.py这个文件，这才是我们对项目进行配置的地方，包括对pipeline，download middleware，useragent。

定义要爬的内容

在scrapy框架中要爬的内容都被定义在items.py文件中，定义也非常简单，代码如下：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class Tech163Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    news_thread=scrapy.Field()
    news_title = scrapy.Field()
    news_url = scrapy.Field()
    news_time=scrapy.Field()
    news_from=scrapy.Field()
    from_url=scrapy.Field()
    news_body=scrapy.Field()

其中，news_thread是从每条新闻的url中提取特有的一个字符串，比如这条新闻,它的地址是：http://tech.163.com/14/0813/10/A3H72TD4000915BF.html，那么它的thread就是：A3H72TD4000915BF，个人感觉它应该是随机产生的新闻的识别号，14和0813是年份份和日期，10是一个两位的数字。news_title，news_url，news_time，news_from，from_url，news_body这些定义的是之前提到新闻的属性。

开始写爬虫蜘蛛

定义完要爬的内容，我们再开始来写我们的爬虫蜘蛛——spider，我们在目录tech163/spiders/下，创建我们的spider文件叫做news_spider.py。

#encoding: utf-8
import scrapy
import re
from scrapy.selector import Selector
from tech163.items import Tech163Item
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule
class ExampleSpider(CrawlSpider):
	name = "news"
	allowed_domains = ["tech.163.com"]
	start_urls = ['http://tech.163.com/']
	rules=(
		Rule(LinkExtractor(allow=r"/14/08\d+/\d+/*"),
		callback="parse_news",follow=True),
	)
	def printcn(suni):
		for i in uni:
			print uni.encode('utf-8')
	def parse_news(self,response):
		item = Tech163Item()
		item['news_thread']=response.url.strip().split('/')[-1][:-5]
		# self.get_thread(response,item)
		self.get_title(response,item)
		self.get_source(response,item)
		self.get_url(response,item)
		self.get_news_from(response,item)
		self.get_from_url(response,item)
		self.get_text(response,item)
		#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!remenber to Retrun Item after parse
		return item 
	def get_title(self,response,item):
		title=response.xpath("/html/head/title/text()").extract()
		if title:
			# print 'title:'+title[0][:-5].encode('utf-8')
			item['news_title']=title[0][:-5]
	def get_source(self,response,item):
		source=response.xpath("//div[@class='left']/text()").extract()
		if source:
			# print 'source'+source[0][:-5].encode('utf-8')
			item['news_time']=source[0][:-5]
	def get_news_from(self,response,item):
		news_from=response.xpath("//div[@class='left']/a/text()").extract()
		if news_from:
			# print 'from'+news_from[0].encode('utf-8')		
			item['news_from']=news_from[0]
	def get_from_url(self,response,item):
		from_url=response.xpath("//div[@class='left']/a/@href").extract()
		if from_url:
			# print 'url'+from_url[0].encode('utf-8')		
			item['from_url']=from_url[0]	
	def  get_text(self,response,item):
		news_body=response.xpath("//div[@id='endText']/p/text()").extract()
		if news_body:
			# for  entry in news_body:
			# 	print entry.encode('utf-8')
			item['news_body']=news_body	
	def get_url(self,response,item):
		news_url=response.url
		if news_url:
			#print news_url	
		item['news_url']=news_url

代码非常简单,大部分都是对网页做解析，解析的地方，大部分是用xpath对内容进行提取，这部分还是得参考文档。这些解析只实现了对单张新闻网页的解析，那么是什么能够让它不停的爬取科技频道的新闻呢？scrapy的方便之处就是。这个机制只用一行代码就实现了

rules=(
Rule(LinkExtractor(allow=r"/14/08\d+/\d+/*"),
callback="parse_news",follow=True),
)

这行代码，设置了整个爬虫的规则，通过LinkExtractor这个元件从response提取到所有的链接，再通过设置allow来设置需要再递归往下爬的新闻。这里使用正则表达式，将新闻的url做了约束。我们可以回顾下，正常的url的格式是这样的http://tech.163.com/14/0813/10/A3H72TD4000915BF.html ，代码中的正则/14/08\d+/\d+/*的含义是大概是爬去/14/08开头并且后面是数字/数字/任何格式/的新闻，可以说是14年8月份的新闻。通过这个正则我们便可以很好的对递归爬去做出筛选。follow=ture定义了是否再爬到的结果上继续往后爬。

定义数据的处理管道

scrapy通过在pipeline来对每次爬去到的items值进行处理，这里我们使用mongodb数据库来存储我们爬到的数据，新建的数据库为NewsDB。我们先在tech163目下单独使用文件，新建一个文件store.py来配置数据库存储，这里需要mongodb的依赖，以及pymongo依赖。

#encoding: utf-8
import pymongo
import random
HOST = "127.0.0.1"
PORT = 27017
client = pymongo.MongoClient(HOST, PORT)
NewsDB = client.NewsDB

然后我们的pipeline文件这么写：

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
#encoding: utf-8
from store import NewsDB
class Tech163Pipeline(object):
    def process_item(self, item, spider):
        if spider.name != "news":  return item
        if item.get("news_thread", None) is None: return item
        spec = { "news_thread": item["news_thread"] }
        NewsDB.news.update(spec, {'$set': dict(item)}, upsert=True)
        return None

最后我们在setting.py 文件中做一些设置，主要是定义pipeline：

# -*- coding: utf-8 -*-
# Scrapy settings for tech163 project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'tech163'
SPIDER_MODULES = ['tech163.spiders']
NEWSPIDER_MODULE = 'tech163.spiders'
ITEM_PIPELINES = ['tech163.pipelines.Tech163Pipeline',
			]
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tech163 (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'
DOWNLOAD_TIMEOUT = 15
# DOWNLOAD_DELAY = 0.1
# LOG_LEVEL = "INFO"
# LOG_STDOUT = True
# LOG_FILE = "log/newsSpider.log"

这样我们就完成了整个项目的配置，我们便可以在命令行中敲入：

scrapy crawl news

来开始爬去新闻。整个项目的源代码传到了github上供大家参考，完成这个以后可以学习图片爬取，分布式处理，以及缓存技术等等问题。