文章内容
2018/10/20 11:32:55,作 者: 黄兵
python 爬虫限速下载
如果我们爬取网站的速度过快,就会面临被封禁或是造成服务器过载的风险。为了降低这些风险,在两次下载之间添加延时,从而对爬虫降速。
from urllib.parse import urlparse import datetime.datetime import time class Throttle: """Add a delay between downloads to the same domain""" def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).seconds if sleep_secs > 0: # domain has been accessed recently # so need sleep time.sleep(sleep_secs) # update the last accessed time self.domains[domain] = datetime.datetime.now()
调用方式:
import throttle throttle=throttle.Throttle(5) throttle.wait('https://pdf-lib.org')
其中throttle.Throttle(5)是延迟时间,单位秒;
throttle.wait('https://pdf-lib.org')
是抓取的网址。
评论列表