首页 > pyspider分布式爬取的设置

pyspider分布式爬取的设置

我写了一个爬虫,想使用pyspider把它的爬取过程分布到两台机器1和2上,但发现这样爬取一轮所花费的时间和单机几乎没有区别,都是5分16秒左右,我不知道该怎么改进才能使分布式的时间优于单机的时间.

机器规格:
机器1:1个scheduler,1个fetcher,1个processor,1个result_worker,1个webui
机器2:1个fetcher,1个processor,1个result_worker

爬虫特点:
rate/burst是20.0/3.0(试过100.0/3.0,结果几乎没有区别);
on_start设的是@every(seconds=60*60)on_start内大概发75个请求,每个请求都调用回调方法1;
回调方法1设的是@config(age=1),里面发一个请求,调用回调方法2;
回调方法2设的是@config(age=1),里面写一次文件;

爬虫代码:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-08-14 11:04:00
# Project: pyspider

import json
import lxml.etree as etree

import time
import datetime
import codecs
import random
from pyspider.libs.base_handler import *
from projects.produtils import *
from time import gmtime, strftime

class Handler(DistTestHandler):
    crawl_config = {
    }

    def __init__(self):
        self.round = 0

    @every(seconds=60*60)
    def on_start(self):
        self.round = self.round+1
        cities = ['110106','120116','130108','131003','130204','130304','130602','130703','130402','130804','130503','440305','440111','441900','440608']
        cats = ['fruit', 'food', 'drink', 'snack', 'milk']
        for city in cities:
            for cat in cats:
                cat_url = 'http://www.company.com/product/category/jsd-hb-{cat}?platform=android&access_token=&android_channel_value=wandoujia&version=3.1.0&prodcrgen={prodcrgen}'.format(cat=cat, prodcrgen=random.randrange(1,10000))
                region = '{"address_code":"'+city+'"}'
                self.crawl(cat_url, save={'round': self.round, 'cat': cat, 'region': city}, callback=self.cat_start)

    @config(age=1)
    def cat_start(self, response):
        jsonresp = response.json

        products = jsonresp['products']
        for product in products:
            if 'code' in product:
                print 'has code, skipped'
            else:
                prod_url = 'http://www.company.com/product/{sku}?platform=android&access_token=&android_channel_value=wandoujia&version=3.1.0&prodcrgen={prodcrgen}'.format(sku=product['sku'], prodcrgen=random.randrange(1,10000))
                region = '{"address_code":"'+response.save['region']+'"}'
                self.crawl(prod_url, save={'round': response.save['round'], 'cat': response.save['cat'], 'region': response.save['region']}, callback=self.prod_start)

    @config(age=1)
    def prod_start(self, response):
        jsonresp = response.json
        s = u'{sku}{_separator}{name}{_separator}{cat}{_separator}{region}{_separator}{stock}{_separator}{price}{_separator}{round}{_separator}{create_time}'\
            .format(sku=jsonresp['sku'], name=jsonresp['name'], cat=response.save['cat'], region=response.save['region'], stock=jsonresp['stock'], \
                    price=jsonresp['price'], round=str(response.save['round']), create_time=strftime("%Y-%m-%d %H:%M:%S", gmtime()), _separator=' | ')

        with codecs.open ('/home/ubuntu/pyspider/disttest1/history', 'a', 'utf-8') as f: f.write (s+'\n')

  1. 增加 burst 不小于 rate,burst 过小会导致在一个 schduler 调度循环中,由于水槽过小,只能分配到 3 个请求配额。

  2. dashboard 上各个队列的状态,增加堵塞的队列下游的对应模块数量。

【热门文章】
【热门文章】