一、 token的解析与生成
1. 概述
在爬取美团店铺数据的时候,相信很多伙伴都遇到了类似这样的情况:在网页的Console能够通过XPath获取到HTML元素的值,但是在爬虫脚本中使用XPath却无法拿到,这是因为美团在获取店铺数据时加了token验证。比如,在chrome浏览器打开美团首页,右键检查,随便点击一类美食,可以看到如下图的getPoiList接口调用:
在上述图片中,可以看到调用getPoiList接口时传递了若干的参数,其中一个重要的就是token,美团后台为了防止爬虫,token是经过加密后生成的,而且每个token的有效时间是极其短暂的,所以正确解析token实现美团爬虫对很多人来说都是极其困难的。
2. token的解析
要想解析token,首先得弄清楚其使用的加密算法,而对于美团的token,加密算法其实比较简单,就是采用了二进制压缩与base64编码。所以解析token主要分为两个步骤:一是base64解码,二是二进制解压。解析token时主要用到了base64和zlib两个库,下面以实际例子来验证。
首先定义解析token方法:1
2
3
4
5
6def decode_token(token):
    # base64解码
    token_decode = base64.b64decode(token.encode())
    # 二进制解压
    token_string = zlib.decompress(token_decode)
    return token_string
然后获取美团的多个token,调用decode_token方法进行解析:1
2
3
4
5
6
7
8
9
10
11if __name__ == '__main__':
    token = [
        'eJxVjstuqzAURf/F06LYxkAgUgeQ0MvzkoQ8QFUHbngnJgGcpKG6/35dqR1UOtLeZ501OJ+gdzMwwwgZCEnglvdgBvAETTQgAT6Ii6qqsqxPsUIMIoHDb6bIhgTe+90CzF4xwkiaqujti6wFeMWGjCSMdIF+uiK6rIj5slwhgYrzyzCDsBwnLK/5lbaTw5lB0YeqhgeMofgECJ1thC7y+J30O/nPHorXhTvUZSta7t2z5sij+2iuqiuMqyT3lDH961/cpPO5/7IZojDYtlraKOfij7JtjiFG8yGyya3cO0TLCiiXZtMG9+xkLi1rSM9r4sEqXch6Qcan5WXbMs9edilVt3ubIXYKrHUXxXSJu8bmL5auGLt8nXgqbntVM6N459ZGjGwSnIp4rGoe1h+Qre5Dn+3plG4e88ZtF0fM/KvR3iKHXuerfSf3FtRPtMvIIXmi2Q2N2chI+95somyc15phQmdlOlH0cGgRBszmflI+P4N//wEWi44a',
        'eJxVjstuozAUht/F26LYBkxDpC4gocN1SEIuoGoWbsw1MQngJC1V372u1C5GOtJ/Od/i/wC9x8AMI2QipIBb3oMZwBM0MYACxCA/hBBVQwgjYmIFHP7vDGIq4LXfLcDsBcusPBL077tZy+IFmypSMJrK6tfr0qu6vG/KkxCohLgMMwjLccLzWlxpOzmcOZR+qGp4wBjKJUDifCNxqccfpT8qfnMkp0t2qMtWuty/s+Yo4vtoraorTKo09/Ux+xtcvLQLRPC8GeIo3LZG1ujn4o++bY4RRvMhdrRbuXc1gxVQLa2mDe/sZC1te8jOa82HVbZQp4U2Piwv25b7zrLLKNnuHY74KbTXXZzQJe4aRzzbU93c5evUJ7jtiWHFyc6rzQQ5WngqkrGqRVS/Qb66Dz3b00e6eZ83Xrs4Yh5czfYWu/Q6X+07tbfh9EQ7ph3SB8puaGQj19rXZhOzcV4bpgXdleXG8btLiyjkjgjS8ukJfH4B4qqN+w==',
        'eJxdjktvozAURv+Lt0WxjYFCpC4gocNzSEIeoGoWbswzMQngJFNG89/HldrNSFf6vnvuWdw/YPAZmGOELIQUcC8GMAd4hmYGUIAY5UXXdZUg3USEmAo4/scsSwHvw34J5m8YYaQ86+jXJ9lI8IYtFSkYmRJ9d012VZPzaflSArUQ13EOYTXNeNGIG+1mxwuHso91A48YQ/kJkDrfSl3m6SvpV4rvPZavS3dsqk62Iniw9iSSx2Sv6xtM66wItCn/GV79rA9F+LodkzjadUbeapfyh7ZrTzFGizFxyb06eMRgJVQru+2iBzvbK8cZ88uGBLDOl6pZkulpdd11PHBXfU713cHliJ8jZ9MnKV3hvnXFq2Nq1r7YZIGOu0E37CTd+42VIpdE5zKd6kbEzW/I149xYAf6TLcfi9bvlifMw5vV3ROP3hbrQ68ODjTPtGfkmD1RdkcTmzjp3tttwqZFY1g29Na2lyQfHi3jiLsizKqXF/D3Hwp7jhM=',
        'eJxdjktvozAURv+Lt0WxjYGESF1AQofnkIQ8QFUXbngndgI4yZTR/PdxpXZT6Urfd889i/sX9F4O5hghEyEF3IsezAGeoIkBFCAGedF1XSUIY4OougKOP5gxVcB7v1+C+StGGClTHb19ko0Er9hUkYLRTKLvrsmuanI+LU9KoBbiOswhrMYJKxpxo3xyvDAo+1A38IgxlJ8AqbOt1GWevpJ+pfjeI/m6dIem4rIV/iNvTyJ+jNa6vsGkTgtfG7PfwdVLu0AEL9shjsIdN7JWu5S/tF17ijBaDLFD7tXBJUZeQrWyWh4+8rO1su0hu2yID+tsqc5KMj6trjvOfGfVZVTfHRyG2Dm0N12c0BXuWke82DPN3Beb1Ncx73XDipO915gJckh4LpOxbkTU/IFs/Rj6/ECndPuxaD2+PGEW3Ex+j116W6wPndrbcHamXU6O6RPN72jMR0b4e7uN83HRGKYF3bXlxvGHS8soZI4I0ur5Gfz7D+r3jgA='
    ]
    for i in range(0, len(token)):
        token1 = decode_token(token[i])
        print(token1)
最后解析后得到了以下的结果:1
2
3
4b'{"rId":100900,"ver":"1.0.6","ts":1555228714393,"cts":1555228714429,"brVD":[1010,750],"brR":[[1920,1080],[1920,1040],24,24],"bI":["https://gz.meituan.com/meishi/c11/",""],"mT":[],"kT":[],"aT":[],"tT":[],"aM":"","sign":"eJwdjktOwzAQhu/ShXeJ4zYNKpIXqKtKFTsOMLUn6Yj4ofG4UjkM10CsOE3vgWH36df/2gAjnLwdlAPBBsYoR3J/hYD28f3z+PpUnmJEPqYa5UWEm0mlLBRqOSaP1qjEtFB849VeRXJ51nr56AOSVIi9S0E3LlfSzhitMix/mQwsrdWa7aTyCjInDk1mKu9nvOHauCQWq2rB/8laqd3cX+adv0zdzm3nbjTOdzCi69A/HQAHOOyHafMLmEtKXg=="}'
b'{"rId":100900,"ver":"1.0.6","ts":1555230010591,"cts":1555230010659,"brVD":[1010,750],"brR":[[1920,1080],[1920,1040],24,24],"bI":["https://gz.meituan.com/meishi/c11/",""],"mT":[],"kT":[],"aT":[],"tT":[],"aM":"","sign":"eJwdjktOwzAQhu/ShXeJ4zYNKpIXqKtKFTsOMLUn6Yj4ofG4UjkM10CsOE3vgWH36df/2gAjnLwdlAPBBsYoR3J/hYD28f3z+PpUnmJEPqYa5UWEm0mlLBRqOSaP1qjEtFB849VeRXJ51nr56AOSVIi9S0E3LlfSzhitMix/mQwsrdWa7aTyCjInDk1mKu9nvOHauCQWq2rB/8laqd3cX+adv0zdzm3nbjTOdzCi69A/HQAHOOyHafMLmEtKXg=="}'
b'{"rId":100900,"ver":"1.0.6","ts":1555230580338,"cts":1555230580399,"brVD":[1010,750],"brR":[[1920,1080],[1920,1040],24,24],"bI":["https://gz.meituan.com/meishi/c11/",""],"mT":[],"kT":[],"aT":[],"tT":[],"aM":"","sign":"eJwdjktOwzAQhu/ShXeJ4zYNKpIXqKtKFTsOMLUn6Yj4ofG4UjkM10CsOE3vgWH36df/2gAjnLwdlAPBBsYoR3J/hYD28f3z+PpUnmJEPqYa5UWEm0mlLBRqOSaP1qjEtFB849VeRXJ51nr56AOSVIi9S0E3LlfSzhitMix/mQwsrdWa7aTyCjInDk1mKu9nvOHauCQWq2rB/8laqd3cX+adv0zdzm3nbjTOdzCi69A/HQAHOOyHafMLmEtKXg=="}'
b'{"rId":100900,"ver":"1.0.6","ts":1555230116325,"cts":1555230116367,"brVD":[1010,750],"brR":[[1920,1080],[1920,1040],24,24],"bI":["https://gz.meituan.com/meishi/c11/",""],"mT":[],"kT":[],"aT":[],"tT":[],"aM":"","sign":"eJwdjktOwzAQhu/ShXeJ4zYNKpIXqKtKFTsOMLUn6Yj4ofG4UjkM10CsOE3vgWH36df/2gAjnLwdlAPBBsYoR3J/hYD28f3z+PpUnmJEPqYa5UWEm0mlLBRqOSaP1qjEtFB849VeRXJ51nr56AOSVIi9S0E3LlfSzhitMix/mQwsrdWa7aTyCjInDk1mKu9nvOHauCQWq2rB/8laqd3cX+adv0zdzm3nbjTOdzCi69A/HQAHOOyHafMLmEtKXg=="}'
3. token的生成
在上一步解析token的结果中,可以看到token中包含有以下几个信息,而且除了ts和cts两个属性值外,其余值都是固定的。1
2
3
4
5
6
7
8
9
10
11
12
13"rId":100900,
"ver":"1.0.6",
"ts":1555228714393,
"cts":1555228714429,
"brVD":[1010,750],
"brR":[[1920,1080],[1920,1040],24,24],
"bI":["https://gz.meituan.com/meishi/c11/",""],
"mT":[],
"kT":[],
"aT":[],
"tT":[],
"aM":"",
"sign":"eJwdjktOwzAQhu/ShXeJ4zYNKpIXqKtKFTsOMLUn6Yj4ofG4UjkM10CsOE3vgWH36df/2gAjnLwdlAPBBsYoR3J/hYD28f3z+PpUnmJEPqYa5UWEm0mlLBRqOSaP1qjEtFB849VeRXJ51nr56AOSVIi9S0E3LlfSzhitMix/mQwsrdWa7aTyCjInDk1mKu9nvOHauCQWq2rB/8laqd3cX+adv0zdzm3nbjTOdzCi69A/HQAHOOyHafMLmEtKXg=="
ts和cts两个值都是时间戳,其实就是token的有效时间,所以我们在生成token时主要修改这两个属性,其余值保持不变即可。但是要注意到这里的ts和cts是我们传统时间戳的1000倍,所以生成的时候千万不要忘记扩大1000倍。废话不多说,直接看代码吧。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27# 生成token
def encode_token():
     ts = int(datetime.now().timestamp() * 1000)
     token_dict = {
        'rId': 100900,
        'ver': '1.0.6',
        'ts': ts,
        'cts': ts + 100 * 1000,
        'brVD': [1010, 750],
        'brR': [[1920, 1080], [1920, 1040], 24, 24],
        'bI': ['https://gz.meituan.com/meishi/c11/', ''],
        'mT': [],
        'kT': [],
        'aT': [],
        'tT': [],
        'aM': '',
        'sign': 'eJwdjktOwzAQhu/ShXeJ4zYNKpIXqKtKFTsOMLUn6Yj4ofG4UjkM10CsOE3vgWH36df/2gAjnLwdlAPBBsYoR3J/hYD28f3z+PpUnmJEPqYa5UWEm0mlLBRqOSaP1qjEtFB849VeRXJ51nr56AOSVIi9S0E3LlfSzhitMix/mQwsrdWa7aTyCjInDk1mKu9nvOHauCQWq2rB/8laqd3cX+adv0zdzm3nbjTOdzCi69A/HQAHOOyHafMLmEtKXg=='
    }
    # 二进制编码
    encode = str(token_dict).encode()
    # 二进制压缩
    compress = zlib.compress(encode)
    # base64编码
    b_encode = base64.b64encode(compress)
    # 转为字符串
    token = str(b_encode, encoding='utf-8')
    return token
生成token其实就是解析token的逆过程,所以要先进行二进制压缩,然后进行base64编码,最后转为字符串,下面就是生成了4个token的结果。1
2
3
4eJxNUMtuo0AQ/BXfOCTyzABDIFIOYJPwDLbxAxTlMDFve8YGxnbCav99mV1tdg8tVXVXV0n1QzqspcfJ2/v9RProVgK+IUOG9xMEdThuv5kqmKyKEeLOzUYxgtCA41na815QjLFsaBiruqb9dtzOhSWCaBQ9YGEhkX+BrkBSxfm5fwSgHKY0r/mFsOn+RMGI+6oGe4SANIol8cG/f//PU//mkXBcSkLd1yUTOPduWXPg0W0wl9UFxFWSe+qQvvpnN2l97j+v+ygMNkxLG/VUvKib5hAiOOsjW7mWO0fRsgLIpdmw4JYdzYVl9elppXigSueyXijD3eK8YdSzF21K8GZnU0iPgbVqo5gsUNvY/NnSVWObrxIPI9ZhzYzirVsbMbSV4FjEQ1XzsP4EdHnru2xHHsj6a9a4bH5A1L8Y7Bo55DJb7lq5s4B+JG2m7JM7kl3hkA1UYR/NOsqGWa0ZJnCWphNFXw4pwoDa3E/KpyfRxTXvRBVoCqea4PRPiT9/AQ+1kuU=
eJxNkUlv2zAQhf9KbjqkMEltsQLkINlKtZa25UVC0ANtarVJWxJtNyr63yu2qNHDAPPezHyHNz+Vg+iV1ydkGIZqmZquITj98qTc8m50FTSBE1MZNVmP8uO77GI5kJ54eJ1PJQNCC8JRssfgf7b6j93XJZeMPLjT5ijwfbCX1RUkVZoH+pB9Cy9+2oYifF/3OI423Mwa/Vx81TfNMUZw1mNXu5U7TzNpAdTSbnh0pyd74Th9dl5pAaiyuTottOF5cdlwFriLNiPGZucyyE6Rs2pxQhaobVzx7kx1a5uv0sBAvDNMGydbv7YS6GrRqUiGqhZx/QOw5b3v6I68kPXnrPH5/IhYeLX4DXvkOlvuWrVzwPREWqod0mdCb3CgA9P4vlljOsxq07KBt7Q9jD89UsQRc0WYlm9vMsN9t53LsBBEY3IvBpSp7X1pKZUQl/4VgHKYsLwWV8InhzMDY99XNTggBCRB+XPRreTJB7LUETPGLDkPpUul6rLk8vHve379BuQ8ksc=
eJxNkEmP2kAQhf/K3HyYiF68DB5pDjZ44jUNmMXWKIeG9gptsN1AxlH+e9yJgnIo6b1XqlLV91Oha+X16eP7lyelr4pm1Erm31l9FOQ+WMvyCuIyyXxtSL8FFy9pAxG8r3sShZvGSGvtnH/VNvUxQnDWE0e9FTtXNVgOcGHVTXhnJ2th2316Xqk+KNM5nubq8Ly4bBruO4s2pfpm53DIT6G9aklMF6itHfFuTzVzm60SX0dNpxsWibdeZcbQUcNTHg9lJaLqB+DLe9+xHX2h689Z7TXzI+LB1WxuxKXX2XLX4s4G0xNtmXpInim7wYENXG329ZqwYVYZpgXcpeUS8unSPAq5I4KkeHtTRhYH0Y8okK7r2DSwiZGBx5RGko/sdx6TfQhNCEe79yREpRTi0r8CUAwTnlXiSpvJ4czBqPuyAgeEgJxVJOzjA7t4KP5Q+247lxpBNK5/0eHfcCWzD2TiMURwKtOH06TDmqw/W/+7H/27/5Z18gE0gRND+fUbPhuSzQ==
eJxNkElv4kAQhf9Kbj5kRHd7AyPlYIMzXtOAWWxFc2hor9AG2w0kHs1/jzujQXMo6b1XqlLV91sia2n69P7rx5PUlXk9aCn17rQ6cnzvzWVxBVERp57aJ2/+xY0bn/uv6w6HwabWk0o9Zz/VTXUMEZx12FZu+c5RdJoBOTerOrjTk7mwrC45rxQPFMlcnmRK/7y4bGrm2YsmIdpmZzPIToG1anBEFqipbP5qTVRjm65iT0N1q+kmjrZuaUTQVoJTFvVFycPyA7DlvWvpjozJ+nNWufX8iJh/Neobdsh1ttw1cmuByYk0VDnEz4TeYE97ptT7ao1pPyt1wwTO0nQw/nRIFgbM5n6cv7xIA4sD7wYUSNM02dBlQ0ZjeUhJKPiIfutS0YfQgHCwe1dAlArOL90UgLwfsbTkV1KPDmcGBt0VJTggBMSsJGAfH9j5Q7GH2rfbudAIomH9WIN/w5XI3pEhDyGCE5E+nCqcrIr63vrf/ejf/be0FQ+gERzp0p8vPwGSzw==
二、 getPoiList接口调用
1. 概述
上面讲述完解析和生成美团token之后,下面主要讲述如何调用美团的getPoiList接口来获取返回json格式的店铺数据。getPoiList接口返回的数据如下图所示:
2. 模拟浏览器请求
美团的数据经常会被人爬取,所以美团公司的反爬工作做得也是非常好,因此在调用接口时一定要模拟浏览器请求,具体模拟请求的参数如下,最重要的就是使用不同的 User-Agent:1
2
3
4
5
6
7
8
9simulateBrowserHeader = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
    'Host': 'gz.meituan.com',
    'Referer': 'https://gz.meituan.com/meishi/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
3. URL编码
在发送URL请求时会进行URL编码,细心的小伙伴会从浏览器的请求header中看到,URL请求中的特殊字符(如’/‘, ‘:’, ‘+’, ‘=’)会被编码为其他的特殊字符串,比如下图中的’:’被编码为’%3A’,’/‘被编码为’%2F’等。
因此,这里需要编写一个简单的函数对URL请求中的特殊字符进行替换处理,具体的代码如下所示(从上图中与URL中的_token对比,可以分析出这几个特殊符号编码后的字符串):1
2
3
4
5
6# 替换'/+=:'这几个特殊符号,用于URL请求
def str_replace(string):
    return string.replace('/', '%2F') \
        .replace('+', '%2B') \
        .replace('=', '%3D') \
        .replace(':', '%3A')
4. 调用getPoiList接口
在前面所有的预备工作都准备好之后,就可以进行接口的调用了,虽然这个接口传入的参数很多,但是参数的命名都是清晰易懂的,所以就不对参数进行详细介绍了,直接上调用接口的代码吧。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34if __name__ == '__main__':
    cityName = '广州'
    originUrl = str_replace('https://gz.meituan.com/meishi/c11/')
    # 生成token
    token_encode = encode_token()
    token = str_replace(token_encode)
    url = 'https://gz.meituan.com/meishi/api/poi/getPoiList?' \
          'cityName=%s' \
          '&cateId=11' \
          '&areaId=0' \
          '&sort=' \
          '&dinnerCountAttrId=' \
          '&page=1' \
          '&userId=' \
          '&uuid=05bf3db6-3c2f-41cd-a4ec-ed79ae0a9506' \
          '&platform=1' \
          '&partner=126' \
          '&originUrl=%s' \
          '&riskLevel=1' \
          '&optimusCode=1' \
          '&_token=%s' % (cityName, originUrl, token)
    response = requests.get(url, headers=simulateBrowserHeader)
    if response.status_code == 200:
        data = response.json()['data']
        with open('data.json', 'w') as f:
            json.dump(data, f, ensure_ascii=False)
            print('Save data into json file successfully!')
            f.close()
    if response.status_code == 403:
        print('Access is denied by server!')
上述中URL请求中的参数除了cityName、originUrl、token可以修改外,还有cateId、page也是可以修改的,关于参数的动态调整在下一节会详细介绍。这里只是把getPoiList接口返回的json数据就简单的保存为了json格式的文件(部分内容如下图所示),后续会介绍如何获取分页的数据并把它存储到数据库中。
三、 获取分页数据并存入MySQL数据库
1. 概述
前两节已经介绍了破解美团的token以及如何进行getPoiList接口调用,下面要讲解的内容主要包括两个方面:一是获取店铺的分页数据,二是将分页数据存入MySQL数据库。
2. 获取店铺的分页数据
在第二节调用接口的基础上,这里只需要对URL中的参数进行调整,就能够获取到店铺的分页数据。具体需要调整的参数主要有 cateId、page 和 originUrl 三个,第一个是指菜系的类别,第二个是指当前所在页,第三个是指获取分页需要调的URL。下面代码就是实现如何获取菜系ID以及调整URL参数:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36# get category id
def get_cateId(link_url):
    splits = link_url.split('/')
    return splits[-2].replace('c', '')
# call interface to get json data
def call_interface(page, originUrl):
    cityName = '广州'
    cate_id = get_cateId(originUrl)
    originUrl = str_replace(originUrl)
    token = str_replace(encode_token())
    url = 'https://gz.meituan.com/meishi/api/poi/getPoiList?' \
          'cityName=%s' \
          '&cateId=%s' \
          '&areaId=0' \
          '&sort=' \
          '&dinnerCountAttrId=' \
          '&page=%s' \
          '&userId=' \
          '&uuid=05bf3db6-3c2f-41cd-a4ec-ed79ae0a9506' \
          '&platform=1' \
          '&partner=126' \
          '&originUrl=%s' \
          '&riskLevel=1' \
          '&optimusCode=1' \
          '&_token=%s' % (cityName, cate_id, page, originUrl, token)
    response = requests.get(url, headers=simulateBrowserHeader)
    if response.status_code == 200:
        data = response.json()['data']
        return data
    if response.status_code == 403:
        print('Access is denied by server!')
        return {}
下一步需要先从美团首页爬取不同菜系的url列表dish_url_list,然后遍历dish_url_list列表,获取每个菜系的所有分页数据。注意调用getPoiList接口会返回该菜系的总店铺数totalCounts,因此可以计算出每个菜系的总页数,最后就能调用call_interface() 方法获取每个菜系的分页数据了。实现的代码如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31class FoodSpider(scrapy.Spider):
    name = 'food'
    allowed_domains = ['gz.meituan.com']
    start_urls = ['https://gz.meituan.com/meishi/']
    def parse(self, response):
        # 所有菜系URL
        dish_url_list = response.xpath('//*[@id="app"]//*[@data-reactid="20"]/li/a/@href').extract()
        # print(dish_url_list)
        # traverse each dish_url to get food data
        for dish_url in dish_url_list:
            yield Request(dish_url.replace('http', 'https'), callback=self.parse_food)
    def parse_food(self, response):
        origin_url = response.url
        print('crawl food from ' + origin_url)
        dish_type = response.xpath('//*[@id="app"]//*[@class="hasSelect"]/span[2]/text()').extract()[0]
        re_data = call_interface(1, origin_url)
        data_list = get_food_list(dish_type, re_data['poiInfos'])
        # calculate how many pages
        if re_data['totalCounts'] % 15 == 0:
            page_num = re_data['totalCounts'] // 15
        else:
            page_num = re_data['totalCounts'] // 15 + 1
        for page in range(2, page_num + 1):
            re_data = call_interface(page, origin_url)
            data_list.extend(get_food_list(dish_type, re_data['poiInfos']))
        write_to_db(data_list)
3. 将分页数据存入MySQL数据库
首先需要创建数据库表,根据自己对店铺数据的需要选择使用的字段,如果有想要偷懒或者闲麻烦的小伙伴,可以直接复制下面的SQL语句创建表。1
2
3
4
5
6
7
8
9
10
11
12create table tb_restaurants
(
    pk_id           char(36)     not null comment '主键',
    dish_type       varchar(20)  not null comment '所属菜系',
    restaurant_name varchar(50)  not null comment '餐厅名称',
    location        varchar(100) not null comment '餐厅地址',
    price           int          not null comment '人均价格',
    star            float        not null comment '评分',
    img_url         varchar(200) not null comment '餐厅图片链接',
    comment_num     int          not null comment '评论数量',
    primary key (pk_id)
);
然后需要根据数据库表,创建相应的实体类,并初始化所有数据成员都是None,代码如下所示。关于 to_json() 方法,是为了便于把获取的数据保存为json文件。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24class MeituanItem:
    def __init__(self):
        self.pk_id = None
        self.dish_type = None
        self.restaurant_name = None
        self.location = None
        self.price = None
        self.star = None
        self.img_url = None
        self.comment_num = None
    # transfer to dict, convenient to save as json file
    def to_json(self):
        return {
            'pk_id': self.pk_id,
            'dish_type': self.dish_type,
            'restaurant_name': self.restaurant_name,
            'location': self.location,
            'price': self.price,
            'star': self.star,
            'img_url': self.img_url,
            'comment_num': self.comment_num
        }
接着需要返回的数据转换为实体对象列表。下面代码中的category参数是指菜系类别,poiInfos参数是dict类型,就是调用getPoiList接口返回的json数据,其具体返回的属性值详见下图所示。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15# get food detail from poiInfos
def get_food_list(category, poiInfos):
    item_list = []
    for i in range(0, len(poiInfos)):
        item = MeituanItem()
        item.pk_id = str(uuid.uuid1())
        item.dish_type = category
        item.restaurant_name = poiInfos[i]['title']
        item.location = poiInfos[i]['address']
        item.price = 0 if poiInfos[i]['avgPrice'] is None else int(poiInfos[i]['avgPrice'])
        item.star = float(poiInfos[i]['avgScore'])
        item.img_url = poiInfos[i]['frontImg']
        item.comment_num = int(poiInfos[i]['allCommentNum'])
        item_list.append(item)
    return item_list
最后使用pymysql建立数据库连接,配置好host、port、user、password、database这几个参数,写入数据库这里使用的是原生的SQL语句方式,就不多做介绍了,直接看下面的代码吧。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26# write data into mysql database
def write_to_db(item_list):
    # mysql connection information
    conn = pymysql.Connect(
        host='127.0.0.1',
        port=3306,
        user='root',
        password='123456',
        database='meituan',
        charset='utf8')
    cursor = conn.cursor()
    # insert into database one by one
    for item in item_list:
        sql = 'INSERT INTO TB_RESTAURANTS(pk_id, dish_type, restaurant_name, location, price, star, img_url,' \
              ' comment_num) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)'
        params = (item.pk_id, item.dish_type, item.restaurant_name, item.location, item.price, item.star, item.img_url,
                  item.comment_num)
        # excute sql
        cursor.execute(sql, params)
        # commit
        conn.commit()
    cursor.close()
    # close connection
    conn.close()
    print('Write data into MySQL database successfully!')
写入数据成功后,可以在数据库中查询到有上万条的店铺数据,如下图所示:
四、 致谢
至此,爬取美团店铺数据整个流程都已经讲解完了,如果有需要整个项目源码的小伙伴,可前往github链接去下载。如果你觉得笔者写得还不错,还请在项目上点个star,感谢!

...
...