博客
关于我
「docker实战篇」python的docker-抖音web端数据抓取(19)
阅读量:332 次
发布时间:2019-03-04

本文共 12639 字,大约阅读时间需要 42 分钟。

原创文章,欢迎转载。转载请注明:转载自,谢谢!
原文链接地址:

抖音抓取实战,为什么没有抓取数据?例如:有个互联网的电商生鲜公司,这个公司老板想在一些流量上投放广告,通过增加公司产品曝光率的方式,进行营销,在投放的选择上他发现了抖音,抖音拥有很大的数据流量,尝试的想在抖音上投放广告,看看是否利润和效果有收益。他们分析抖音的数据,分析抖音的用户画像,判断用户的群体和公司的匹配度,需要抖音的粉丝数,点赞数,关注数,昵称。通过用户喜好将公司的产品融入到视频中,更好的推广公司的产品。一些公关公司通过这些数据可以找到网红黑马,进行营销包装。源码: (douyin)

抖音分享页面

  • 介绍

  • 安装谷歌xpath helper工具

源码中获取crx

谷歌浏览器输入:chrome://extensions/

直接将xpath-helper.crx 拖入界面chrome://extensions/

安装成功后

快捷键 ctrl+shift+x 启动xpath,一般都是谷歌的f12 开发者工具配合使用。

开始python 爬取抖音分享的网站数据

分析分享页面https://www.douyin.com/share/user/76055758243

1.抖音做了反派机制,抖音ID中的数字变成了字符串,进行替换。

{   'name':['  ','  ','  '],'value':0},        {   'name':['  ','  ','  '],'value':1},        {   'name':['  ','  ','  '],'value':2},        {   'name':['  ','  ','  '],'value':3},        {   'name':['  ','  ','  '],'value':4},        {   'name':['  ','  ','  '],'value':5},        {   'name':['  ','  ','  '],'value':6},        {   'name':['  ','  ','  '],'value':7},        {   'name':['  ','  ','  '],'value':8},        {   'name':['  ','  ','  '],'value':9},

2.获取需要的节点的的xpath

# 昵称//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()#抖音ID//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()#工作//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()#描述//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()#地址//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()#星座//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()#关注数//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()#粉丝数//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()#赞数//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()

  • 完整代码
import reimport requestsimport timefrom lxml import etreedef handle_decode(input_data,share_web_url,task):    search_douyin_str = re.compile(r'抖音ID:')    regex_list = [        {   'name':['  ','  ','  '],'value':0},        {   'name':['  ','  ','  '],'value':1},        {   'name':['  ','  ','  '],'value':2},        {   'name':['  ','  ','  '],'value':3},        {   'name':['  ','  ','  '],'value':4},        {   'name':['  ','  ','  '],'value':5},        {   'name':['  ','  ','  '],'value':6},        {   'name':['  ','  ','  '],'value':7},        {   'name':['  ','  ','  '],'value':8},        {   'name':['  ','  ','  '],'value':9},    ]    for i1 in regex_list:        for i2 in i1['name']:            input_data = re.sub(i2,str(i1['value']),input_data)    share_web_html = etree.HTML(input_data)    douyin_info = {   }    douyin_info['nick_name'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]    douyin_id = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))    douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id    try:        douyin_info['job'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip()    except:        pass    douyin_info['describe'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',')    douyin_info['location'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0]    douyin_info['xingzuo'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0]    douyin_info['follow_count'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip()    fans_value = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))    unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()")    if unit[-1].strip() == 'w':        douyin_info['fans'] = str((int(fans_value)/10))+'w'    like = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()"))    unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()")    if unit[-1].strip() == 'w':        douyin_info['like'] = str(int(like)/10)+'w'    douyin_info['from_url'] = share_web_url    print(douyin_info)def handle_douyin_web_share(share_id):    share_web_url = 'https://www.douyin.com/share/user/'+share_id    print(share_web_url)    share_web_header = {           'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'    }    share_web_response = requests.get(url=share_web_url,headers=share_web_header)    handle_decode(share_web_response.text,share_web_url,share_id)if __name__ == '__main__':    while True:        share_id = "76055758243"        if share_id == None:            print('当前处理task为:%s'%share_id)            break        else:            print('当前处理task为:%s'%share_id)            handle_douyin_web_share(share_id)        time.sleep(2)

mongodb

通过vagrant 生成虚拟机创建mongodb,具体查看
「docker实战篇」python的docker爬虫技术-python脚本app抓取(13)

su -#密码:vagrantdocker>https://hub.docker.com/r/bitnami/mongodb>默认端口:27017```bashdocker pull bitnami/mongodb:latestmkdir bitnamicd bitnamimkdir mongodbdocker run -d -v /path/to/mongodb-persistence:/root/bitnami -p 27017:27017 bitnami/mongodb:latest#关闭防火墙systemctl stop firewalld

  • 操作mongodb

读txt文件获取userId的编号。

#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time    : 2019/1/30 19:35# @Author  : Aries# @Site    : # @File    : handle_mongo.py.py# @Software: PyCharmimport pymongofrom pymongo.collection import Collectionclient = pymongo.MongoClient(host='192.168.66.100',port=27017)db = client['douyin']def handle_init_task():    task_id_collections = Collection(db, 'task_id')    with open('douyin_hot_id.txt','r') as f:        f_read = f.readlines()        for i in f_read:            task_info = {   }            task_info['share_id'] = i.replace('\n','')            task_id_collections.insert(task_info)def handle_get_task():    task_id_collections = Collection(db, 'task_id')    # return task_id_collections.find_one({})    return task_id_collections.find_one_and_delete({   })#handle_init_task()
  • 修改python程序调用
import reimport requestsimport timefrom lxml import etreefrom handle_mongo import handle_get_taskfrom handle_mongo import handle_insert_douyindef handle_decode(input_data,share_web_url,task):    search_douyin_str = re.compile(r'抖音ID:')    regex_list = [        {   'name':['  ','  ','  '],'value':0},        {   'name':['  ','  ','  '],'value':1},        {   'name':['  ','  ','  '],'value':2},        {   'name':['  ','  ','  '],'value':3},        {   'name':['  ','  ','  '],'value':4},        {   'name':['  ','  ','  '],'value':5},        {   'name':['  ','  ','  '],'value':6},        {   'name':['  ','  ','  '],'value':7},        {   'name':['  ','  ','  '],'value':8},        {   'name':['  ','  ','  '],'value':9},    ]    for i1 in regex_list:        for i2 in i1['name']:            input_data = re.sub(i2,str(i1['value']),input_data)    share_web_html = etree.HTML(input_data)    douyin_info = {   }    douyin_info['nick_name'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]    douyin_id = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))    douyin_info['douyin_id'] = re.sub(search_douyin_str,'',share_web_html.xpath("//div[@class='personal-card']/div[@class='info1']//p[@class='nickname']/text()")[0]).strip() + douyin_id    try:        douyin_info['job'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/div[@class='verify-info']/span[@class='info']/text()")[0].strip()    except:        pass    douyin_info['describe'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='signature']/text()")[0].replace('\n',',')    douyin_info['location'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[1]/text()")[0]    douyin_info['xingzuo'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='extra-info']/span[2]/text()")[0]    douyin_info['follow_count'] = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focus block']//i[@class='icon iconfont follow-num']/text()")[0].strip()    fans_value = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']//i[@class='icon iconfont follow-num']/text()"))    unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='follower block']/span[@class='num']/text()")    if unit[-1].strip() == 'w':        douyin_info['fans'] = str((int(fans_value)/10))+'w'    like = ''.join(share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']//i[@class='icon iconfont follow-num']/text()"))    unit = share_web_html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-num block']/span[@class='num']/text()")    if unit[-1].strip() == 'w':        douyin_info['like'] = str(int(like)/10)+'w'    douyin_info['from_url'] = share_web_url    print(douyin_info)    handle_insert_douyin(douyin_info)def handle_douyin_web_share(task):    share_web_url = 'https://www.douyin.com/share/user/'+task["share_id"]    print(share_web_url)    share_web_header = {           'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'    }    share_web_response = requests.get(url=share_web_url,headers=share_web_header)    handle_decode(share_web_response.text,share_web_url,task["share_id"])if __name__ == '__main__':    while True:        task=handle_get_task()        handle_douyin_web_share(task)        time.sleep(2)
  • mongodb字段

handle_init_task 是将txt存入mongodb中
handle_get_task 查出来一条然后删除一条,因为txt是存在的,所以删除根本没有关系

#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time    : 2019/1/30 19:35# @Author  : Aries# @Site    : # @File    : handle_mongo.py.py# @Software: PyCharmimport pymongofrom pymongo.collection import Collectionclient = pymongo.MongoClient(host='192.168.66.100',port=27017)db = client['douyin']def handle_init_task():    task_id_collections = Collection(db, 'task_id')    with open('douyin_hot_id.txt','r') as f:        f_read = f.readlines()        for i in f_read:            task_info = {   }            task_info['share_id'] = i.replace('\n','')            task_id_collections.insert(task_info)def handle_insert_douyin(douyin_info):    task_id_collections = Collection(db, 'douyin_info')    task_id_collections.insert(douyin_info)def handle_get_task():    task_id_collections = Collection(db, 'task_id')    # return task_id_collections.find_one({})    return task_id_collections.find_one_and_delete({   })handle_init_task()

PS:text文本中的数据1000条根本不够爬太少了,实际上是app端和pc端配合来进行爬取的,pc端负责初始化的数据,通过userID获取到粉丝列表然后在不停的循环来进行爬取,这样是不是就可以获取到很大量的数据。

你可能感兴趣的文章
(C++11/14/17学习笔记):线程启动、结束,创建线程多法、join,detach
查看>>
(C++11/14/17学习笔记):创建多个线程、数据共享问题分析及案例
查看>>
(QT学习笔记):按钮组中的常用控件
查看>>
(音视频学习笔记):SDL-YUV显示-播放音频PCM
查看>>
leetcode 14 最长公共前缀
查看>>
做做Java
查看>>
攻防世界新手区pwn
查看>>
2020-2021新技术讲座课程
查看>>
GIT简介
查看>>
eclipse github团队成员修改工程后push推送
查看>>
shell中的数学运算
查看>>
shell 数学运算
查看>>
如何使用4G模块通过MQTT协议传输温湿度数据到onenet
查看>>
图解:网络硬件的发展史
查看>>
vue项目配置文件vue.config.js中devServer.proxy 使用说明
查看>>
map的find函数和count函数
查看>>
C++并发与多线程(一)
查看>>
C++ 并发与多线程(五)
查看>>
STM32--USART串口收发数据
查看>>
逆合成孔径雷成像(一)— 傅里叶变换基础1
查看>>