Python的requests库应该是Python里最常用也是最好用的HTTP库。相较于之前的urllib、httplib等库,requests的封装更友好,上手更加容易。
show me the code
使用requests库完成一次http请求非常简单,代码如下:
import requests
url = "https://www.baidu.com"
resp = requests.get(url)
requests库在拥有高级封装的同时,也提供了一些有用的request hooks。它们可以帮助我们更好的处理请求中出现的问题。
request hooks
当我们在调用一个第三方的API时,通常需要检查API的返回是否有效,比如是否出现了4xx错误或者5xx错误。我们可以这么做:
import requests
url = "https://www.baidu.com"
resp = requests.get(url)
if resp.status_code >= 400:
# do sth
也可以使用requests库提供的raise_for_status方法来实现:
import requests
url = "https://www.baidu.com"
resp = requests.get(url)
resp. raise_for_status()
在上面的示例中,raise_for_status()
会在返回码是4xx或者5xx时raise一个exception出来。相比于第一个例子,实用raise_for_status()会显得更加优雅一些,但是也带来了另一个问题:我们不能在每一处的函数调用都执行一次 resp.raise_for_status()
,幸好,requests提供了一个hook接口,可以让我们一劳永逸的解决这个问题:
import requests
http = requests.Session()
assert_status_hook = lambda response, *args, **kwargs: response.raise_for_status()
http.hooks["response"] = [assert_status_hook]
url = "https://www.baidu.com"
http.get(url)
这么一来,代码显得更加整洁了~
设置请求的base_url
假设我们在请求一个API服务,比如: api.example.com,我们需要在每次都输入域名+URI,例如:
requests.get("https://api.example.com/api/v1/user")
requests.get("https://api.example.com/api/v1/goods")
如果不想每次都这么写的话,可以用BaseUrlSession。代码如下:
from requests_toolbelt import sessions
http = sessions.BaseUrlSession(base_url="https://api.example.com/api/v1")
http.get("/user")
http.get("/goods")
** 要注意,requests_toolbelt库并不在requests库中,所以需要pip安装一下。
设置超时
requests库的官方文档中提到,我们应该在使用requests库的生产环境代码中都加上超时设置。这是因为如果我们不加超时设置,当对端服务器阻塞时,我们的程序也会被卡住,并一直等待结果。这会让整个系统都挂起,严重影响系统的可用性。
在requests中增加超时非常简单:
requests.get('https://github.com/', timeout=0.001)
但是这样也会有一个隐患:如果某个新同学在代码中忘记增加超时设置。。。
Transport Adapters
所幸,requests库提供了Transport Adapters给我们,可以让我们对所有的requests请求都增加一个默认的超时设置,代码如下:
from requests.adapters import HTTPAdapter
DEFAULT_TIMEOUT = 5 # seconds
class TimeoutHTTPAdapter(HTTPAdapter):
def __init__(self, *args, **kwargs):
self.timeout = DEFAULT_TIMEOUT
if "timeout" in kwargs:
self.timeout = kwargs["timeout"]
del kwargs["timeout"]
super().__init__(*args, **kwargs)
def send(self, request, **kwargs):
timeout = kwargs.get("timeout")
if timeout is None:
kwargs["timeout"] = self.timeout
return super().send(request, **kwargs)
接下来我们可以这么用:
import requests
http = requests.Session()
# 同时允许http和https请求
adapter = TimeoutHTTPAdapter(timeout=2.5)
http.mount("https://", adapter)
http.mount("http://", adapter)
# 使用默认的2.5s超时
response = http.get("https://www.baidu.com/")
# 自定义一个10秒的超时
response = http.get("https://www.baidu.com/", timeout=10)
问题解决!
重试
在做网络爬虫时,由于网络不稳定,经常会出现某次请求失败的情况,这时就需要重试机制进行重试。我们可以通过自定义一个HTTPAdapter来为每一个请求增加一个重试策略:
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
retry_strategy = Retry(
total=3,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)
response = http.get("https://www.baidu.com")
其中Retry接受了3个参数,它们所表达的意思分别是:
1. total
total的的意思是最多重试次数,超过设定次数依旧失败时会抛出一个```urllib3.exceptions.MaxRetryError ```异常
2. status_forcelist
status_forcelist的意思是,当收到列表中的状态码时才会进行retry。其它错误码会忽略retry机制;
3. method_whitelist
method_whitelist的意思是,对于列表中的HTTP Method才会进行retry,其它比如POST则忽略retry策略。这是因为POST不是幂等的,直接重试可能会带来结果的不确定性;
还有一个参数是backoff_factor,它用来设置重试之前的等待时间,通常我们会建议使用一个递增的数列,比如:
··· {backoff factor} * (2 ** ({number of total retries} - 1))
例如,我们分别使用如下的backoff factor: backoff-factor=1, 每次retry的间隔为: 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256. backoff-factor=2: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 backoff-factor=10: 5, 10, 20, 40, 80, 160, 320, 640, 1280, 2560 ···
集成到一起
将retry的Adapter和超时的Adapter集成到一起也很简单:
retries = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
http.mount("https://", TimeoutHTTPAdapter(max_retries=retries))
调试http请求
我们可以通过设置request的debug_level,来打印更多的调试信息,例如:
import requests
import http
http.client.HTTPConnection.debuglevel = 1
requests.get("https://www.baidu.com/")
会得到如下输出:
send: b'GET / HTTP/1.1\r\nHost: www.baidu.com\r\nUser-Agent: python-requests/2.25.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
header: Connection: keep-alive
header: Content-Encoding: gzip
header: Content-Type: text/html
header: Date: Wed, 13 Jan 2021 09:26:44 GMT
header: Last-Modified: Mon, 23 Jan 2017 13:23:46 GMT
header: Pragma: no-cache
header: Server: bfe/1.0.8.18
header: Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
header: Transfer-Encoding: chunked
如果你想将参数信息都打印出来,可以这么做:
··· import requests from requests_toolbelt.utils import dump
def logging_hook(response, *args, **kwargs): data = dump.dump_all(response) print(data.decode('utf-8'))
http = requests.Session() http.hooks["response"] = [logging_hook]
http.get("https://api.openaq.org/v1/cities", params={"country": "BA"}) ··· 会得到如下输出:
< GET /v1/cities?country=BA HTTP/1.1
< Host: api.openaq.org
< User-Agent: python-requests/2.25.0
< Accept-Encoding: gzip, deflate
< Accept: */*
< Connection: keep-alive
<
> HTTP/1.1 200 OK
> Content-Type: application/json; charset=utf-8
> Transfer-Encoding: chunked
> Connection: keep-alive
> access-control-allow-credentials: true
> access-control-allow-headers: Authorization, Content-Type, If-None-Match
> access-control-allow-methods: GET, HEAD, POST, PUT, PATCH, DELETE, OPTIONS
> access-control-allow-origin: *
> access-control-expose-headers: WWW-Authenticate, Server-Authorization
> access-control-max-age: 86400
> Cache-Control: no-cache
> Content-Encoding: gzip
> Date: Wed, 13 Jan 2021 09:27:53 GMT
> Vary: origin,accept-encoding
> X-Cache: Miss from cloudfront
> Via: 1.1 65866bb6c20ad09669a6cfc294087ec0.cloudfront.net (CloudFront)
> X-Amz-Cf-Pop: NRT57-C2
> X-Amz-Cf-Id: pSZOGLVzQyAh_muhd6MFg7YXzjZhwsc6HOc0cQsgmxwF6hdCX3usSA==
>
{"meta":{"name":"openaq-api","license":"CC BY 4.0","website":"https://docs.openaq.org/","page":1,"limit":100,"found":10},"results":[{"country":"BA","name":"Goražde","city":"Goražde","count":70797,"locations":1},{"country":"BA","name":"Ilijaš","city":"Ilijaš","count":2912,"locations":1},{"country":"BA","name":"Jajce","city":"Jajce","count":62562,"locations":1},{"country":"BA","name":"Kakanj","city":"Kakanj","count":5637,"locations":1},{"country":"BA","name":"Lukavac","city":"Lukavac","count":149534,"locations":1},{"country":"BA","name":"N/A","city":"N/A","count":17428,"locations":1},{"country":"BA","name":"Sarajevo","city":"Sarajevo","count":493627,"locations":8},{"country":"BA","name":"Tuzla","city":"Tuzla","count":413909,"locations":3},{"country":"BA","name":"Zenica","city":"Zenica","count":233517,"locations":4},{"country":"BA","name":"Živinice","city":"Živinice","count":136137,"locations":1}]}
更多用法,可以参考:https://toolbelt.readthedocs.io/en/latest/dumputils.html
测试
使用requests库,我们可以很容易的mock一个http请求的返回,例如:
import unittest
import requests
import responses
class TestAPI(unittest.TestCase):
@responses.activate # intercept HTTP calls within this method
def test_simple(self):
response_data = {
"id": "ch_1GH8so2eZvKYlo2CSMeAfRqt",
"object": "charge",
"customer": {"id": "cu_1GGwoc2eZvKYlo2CL2m31GRn", "object": "customer"},
}
# mock the Stripe API
responses.add(
responses.GET,
"https://api.stripe.com/v1/charges",
json=response_data,
)
response = requests.get("https://api.stripe.com/v1/charges")
self.assertEqual(response.json(), response_data)
如果这个http跟mock的responses不匹配,那么会抛出一个ConnectionError 异常,例如:
class TestAPI(unittest.TestCase):
@responses.activate
def test_simple(self):
responses.add(responses.GET, "https://api.stripe.com/v1/charges")
response = requests.get("https://invalid-request.com")
会抛出异常:
requests.exceptions.ConnectionError: Connection refused by Responses - the call doesn't match any registered mock.
Request:
- GET https://invalid-request.com/
Available matches:
- GET https://api.stripe.com/v1/charges
修改UA
通常服务端会根据客户端请求的UA来判断是否是爬虫,因此我们可以在请求时,更改本次请求的UA信息,方法如下:
import requests
http = requests.Session()
http.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"
})