文章内容
2018/10/20 11:32:55,作 者: 黄兵
python 爬虫限速下载
如果我们爬取网站的速度过快,就会面临被封禁或是造成服务器过载的风险。为了降低这些风险,在两次下载之间添加延时,从而对爬虫降速。
from urllib.parse import urlparse
import datetime.datetime
import time
class Throttle:
"""Add a delay between downloads to the same domain"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}
def wait(self, url):
domain = urlparse(url).netloc
last_accessed = self.domains.get(domain)
if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).seconds
if sleep_secs > 0:
# domain has been accessed recently
# so need sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = datetime.datetime.now()调用方式:
import throttle
throttle=throttle.Throttle(5)
throttle.wait('https://pdf-lib.org')其中throttle.Throttle(5)是延迟时间,单位秒;throttle.wait('https://pdf-lib.org')是抓取的网址。
评论列表