最近学习使用Scrapy构建爬虫,但这个架构目前不能解析js那部分,需要自己添加一个网页爬虫的中间件,参考这里进行了添加。
在settings.py的同级目录下创建了middlewares目录,然后将源码放到了该目录下,执行的时候却提示找不到这个模块。又将源码放到了settings.py的同级目录下,执行后提示如下的错误:
2014-06-26 00:30:43+0800 [scrapy] INFO: Scrapy 0.20.0 started (bot: shiyifang)
2014-06-26 00:30:43+0800 [scrapy] DEBUG: Optional features available: ssl, http11
2014-06-26 00:30:43+0800 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'shiyifang.spiders', 'SPIDER_MODULES': ['shiyifang.spiders'], 'BOT_NAME': 'shiyifang'}
2014-06-26 00:30:43+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 4, in <module>
execute()
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 50, in run
self.crawler_process.start()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 92, in start
if self.start_crawling():
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 124, in start_crawling
return self._start_crawler() is not None
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 139, in _start_crawler
crawler.configure()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 47, in configure
self.engine = ExecutionEngine(self, self._spider_closed)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 63, in __init__
self.downloader = Downloader(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/__init__.py", line 77, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 50, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 35, in from_settings
mw = mwcls()
TypeError: 'module' object is not callable
请大神赐教!
如果是自己写中间件的话,只要放在和items.py, settings.py同级的目录下就行了。但是写完之后还需要在settings.py设置DOWNLOAD_MIDDLEWAR以及SPIDER_MIDDLEWARE。注意,自己写中间件要重写scrapy规定的方法,详细设置参考官方文档吧:
Writing your own downloader middleware
另外补充:
scrapy解析JS可以用scrapy-Splash,这样的话就不需要自己写中间件了,直接在settings.py添加设置就行了。
更多的内容请参考官方文档“
scrapy-splash