首页 > Scrapy项目中如何添加middlewares

Scrapy项目中如何添加middlewares

最近学习使用Scrapy构建爬虫,但这个架构目前不能解析js那部分,需要自己添加一个网页爬虫的中间件,参考这里进行了添加。

在settings.py的同级目录下创建了middlewares目录,然后将源码放到了该目录下,执行的时候却提示找不到这个模块。又将源码放到了settings.py的同级目录下,执行后提示如下的错误:

2014-06-26 00:30:43+0800 [scrapy] INFO: Scrapy 0.20.0 started (bot: shiyifang)
2014-06-26 00:30:43+0800 [scrapy] DEBUG: Optional features available: ssl, http11
2014-06-26 00:30:43+0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'shiyifang.spiders', 'SPIDER_MODULES': ['shiyifang.spiders'], 'BOT_NAME': 'shiyifang'}
2014-06-26 00:30:43+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 4, in <module>
    execute()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 50, in run
    self.crawler_process.start()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 92, in start
    if self.start_crawling():
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 124, in start_crawling
    return self._start_crawler() is not None
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 139, in _start_crawler
    crawler.configure()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 47, in configure
    self.engine = ExecutionEngine(self, self._spider_closed)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 63, in __init__
    self.downloader = Downloader(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/__init__.py", line 77, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 50, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 35, in from_settings
    mw = mwcls()
TypeError: 'module' object is not callable

请大神赐教!


如果是自己写中间件的话,只要放在和items.py, settings.py同级的目录下就行了。但是写完之后还需要在settings.py设置DOWNLOAD_MIDDLEWAR以及SPIDER_MIDDLEWARE。注意,自己写中间件要重写scrapy规定的方法,详细设置参考官方文档吧:
Writing your own downloader middleware
另外补充:
scrapy解析JS可以用scrapy-Splash,这样的话就不需要自己写中间件了,直接在settings.py添加设置就行了。
更多的内容请参考官方文档“
scrapy-splash

【热门文章】
【热门文章】