文章内容

2018/10/20 11:32:55,作 者: 黄兵

python 爬虫限速下载

如果我们爬取网站的速度过快,就会面临被封禁或是造成服务器过载的风险。为了降低这些风险,在两次下载之间添加延时,从而对爬虫降速。

from urllib.parse import urlparse
import datetime.datetime
import time


class Throttle:
    """Add a delay between downloads to the same domain"""

    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).seconds
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = datetime.datetime.now()

调用方式:

import throttle

throttle=throttle.Throttle(5)

throttle.wait('https://pdf-lib.org')

其中throttle.Throttle(5)是延迟时间,单位秒;throttle.wait('https://pdf-lib.org')是抓取的网址。

分享到:

发表评论

评论列表