selenium climb vibrato
Table of contents
A friend of mine received a comment about outsourced crawling of Douyin videos. I finished this and decided to make a summary to facilitate subsequent studies. Discussions and exchanges are also welcome. Vx:mastercy1
First, program for Baidu, see how the big guys do it, refer to some ideas, and then most of them are simulators. When I was preparing to download the mobile phone simulator, I found that Douyin already has a PC web terminal.https://www.douyin.com/
So let’s analyze it. First, check the structure of the webpage. The homepage is the same as the app’s various pushes.
You can click to enter the user’s homepage, and there will be all videos on the homepage.
You can also search to enter the user’s homepage or search for videos related to a certain keyword.
Clicking on the searched video or the video on the user’s homepage will enter a single video page
Then the pages are loaded by the waterfall flow. In fact, crawling comments and video download links are the same. In order not to be one-sided and just post a crawler code, I will first analyze the method of the request. Let’s find the method in F12
request analysis
We can find that this method is called when scrolling down, and then check the important parameters are max_cursor, count, and the three below generated with js need to be reversed, put a breakpoint and analyze it, and finally find these /aweme /v1/web/comment/list/ is
These two related js can be reversed
The positioning of page elements will not be repeated here, just copy it directly, xpath and re can be used
Normally set the request header, initiate a request to obtain data parsing and storage
selenium method
Fortunately, the request method will not have a verification code at the beginning, but as soon as Seleium opens the browser to access Douyin, there will be a sliding verification code. I refer to the method on the Internet and made a function to solve the verification code. The general idea is to save The pop-up picture compares the pixels to get the position of the different pixels, and the moving distance is then accelerated and decelerated to drag
import cv2
import time
from selenium.webdriver import ActionChains
import requests
import random
def canny(filepath, cell=7):
img = cv2.imread(filepath, 0)
blurred = cv2.GaussianBlur(img, (cell, cell), 0)
return cv2.Canny(blurred, 240, 250)
def getPosition(img_file1, img_file2):
img = canny(img_file1)
img2 = img.copy()
template = canny(img_file2, cell=5)
w, h = template.shape[::-1]
img = img2.copy()
method = eval("cv2.TM_CCOEFF_NORMED")
res = cv2.matchTemplate(img, template, method)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(res)
if method in [cv2.TM_SQDIFF, cv2.TM_SQDIFF_NORMED]:
top_left = min_loc
else:
top_left = max_loc
bottom_right = (top_left[0] + w, top_left[1] + h)
cv2.rectangle(img, top_left, bottom_right, 255, 2)
return top_left
def get_track(distance):
v = 0
t = 0.4
tracks = []
current = 0
mid = distance * 7 / 8
distance += 5
while current < distance:
if current < mid:
a = random.randint(2, 4) # Accelerate movement
else:
a = -random.randint(1, 3) # decelerate movement
v0 = v
s = v0 * t + 0.6 * a * (t ** 2)
current += s
tracks.append(round(s))
v = v0 + a * t
random.shuffle(tracks)
return tracks
def checkCode(b, img_file1, img_file2):
scale = 1.7
try:
while 1:
t = b.find_element_by_xpath('//*[@id="captcha-verify-image"]')
t = t.get_attribute("src")
img = requests.get(t)
f = open(img_file1, "wb")
f.write(img.content)
f.close()
t = b.find_element_by_xpath('//*[@id="captcha_container"]/div/div[2]/img[2]').get_attribute("src")
img = requests.get(t)
f = open(img_file2, "wb")
f.write(img.content)
f.close()
p = int(getPosition(img_file1, img_file2)[0] / scale)
# print(p)
button = b.find_element_by_xpath('//*[@id="secsdk-captcha-drag-wrapper"]/div[2]')
tracks = get_track(p)
ActionChains(b).click_and_hold(button).perform()
for x in tracks:
ActionChains(b).move_by_offset(xoffset=x, yoffset=0).perform()
ActionChains(b).release(button).perform()
time.sleep(1)
except:
print("ok")
Well, then our thinking is like this, there are two
- Directly search for a certain keyword, and then crawl all the comments of the first x videos under this keyword
- Search someone’s homepage directly, and then crawl all the comments of all videos made by this person
(Crawl four or five pages here first because there is no proxy ip, and there is no text verification code that will jump during frequent visits)
# @FILE : tt_getComment.py
# @Time : 2022/1/13 12:08
from selenium import webdriver
import re
from lxml import etree
import time
import os
import json
import datetime
import uuid
from tiktok_spider.check_Code import checkCode
#This script can run a single page or as a thread running multiple pages
#When running a single, modify the url to be crawled below and run this file directly
#Call get_comment(url) as a thread, and pass in parameters
def get_review_number(b):
#//*[@id="root"]/div/div[2]/div/div/div[1]/div[1]/div[3]/div/div[2]/div[1]/div[2]/span
# num=b.find_element_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[1]/div[3]/div/div[2]/div[1]/div[2]/span').text
# print(int(num))
for x in range(1, 15, 5):
time.sleep(1)
j = x * 12
js = 'document.documentElement.scrollTop=document.documentElement.scrollHeight* %f' % j
b.execute_script(js)
def get_comment(url,id):
chrome_d = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
option = webdriver.ChromeOptions()
option.add_argument('headless') # Add headless mode
b = webdriver.Chrome(executable_path=chrome_d,options=option)
b.get(url)
b.maximize_window()
time.sleep(2)
img1 = str(uuid.uuid1()) + '.jpeg'
img2 = str(uuid.uuid1()) + '.png'
checkCode(b,img1, img2) #over verification code
if os.path.exists(img1):
os.remove(img1)
if os.path.exists(img2):
os.remove(img2)
time.sleep(2)
get_review_number(b)
b.implicitly_wait(3)
# Review_list = b.find_elements_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[3]/div/div/div[4]/div/div')
Review_list = b.find_element_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[3]/div/div').get_attribute("outerHTML")
b.close()
html = etree.HTML(Review_list)
Review_list = html.xpath('//div[4]/div/div[@class="qolG5qEO"]')
review_infos = [] # content list
for i in Review_list:
# print('1',i)
review_html = etree.HTML(etree.tostring(i).decode())
# print(review_html)
review = review_html.xpath('//span[@class="mzZanXbP"]/span/span/span[1]/span/text()') #username and comment content
# print(review)
try:
if len(review)==1:
# print(1111)
review.append('[Facial expression]')
if len(review[2])!=0:
review[1]=review[1]+review[2]
except:
pass
result_like = review_html.xpath('//div[2]/div[2]/div/p/span/text()') #likes
content = etree.tostring(i).decode()
result_time = re.findall(r'<p class="bVGzXCUK"> (.*?)</p> ', content) #comment time
result = re.findall(r'a href="//(.*?)" class="yqT9PfJg"', content) #User home page address
if len(result_time)==0:
result_time[0]=0
# print(review[0], ':', review[1])
# print(result_time[0],' ',result[0])
review_info = {"User name": review[0], "Review content": review[1], "Review time": result_time[0], 'Number of likes': result_like[0],"User home page link": result[0]}
review_infos.append(review_info)
this = os.getcwd()#Get the current path
this=this+"\\tiktok_review_info"
ti='review_info%s-%s.txt'%(str(datetime.datetime.now().date()),id)#Get the time splicing string as the file name
path= os.path.join(this, ti)# put together two sections into a file storage path
fp = open(path,'w',encoding='utf-8')
fp.write('[\n')
for i in review_infos:
print(i)
data = json.dumps(i, ensure_ascii=False)
fp.write(data + ',\n')
fp.write(']')
fp.close()
It should be noted that due to experience and time issues, many places have not made abnormal judgments, and the judgment of illegal characters in comments is only relatively elementary
Then there is the main function, here is a given user homepage url to crawl all his videos, and then make the video address into a list, store it in a file, and stuff it into multi-threading (run x videos at a time)
For this, we need to pay attention to the naming problem of the program. When running a single instance, the verification code will be stored as 1.jpg, and 1.png will be fine. If multiple threads will be blocked, all threads will be called 1.jpg, so you need to get a unique field like uuid as a picture. name, and at the same time do a good job in the naming of deletion and data storage.
import random
import uuid
from selenium import webdriver
import time
import json
from threading import Thread
import requests
import re
from lxml import etree
import datetime
import os
from tiktok_spider.check_Code import checkCode
from tiktok_spider.tt_getComment import get_comment
def drop_down(b, img_file1, img_file2):
checkCode(b, img_file1, img_file2)
num = b.find_element_by_xpath(
'//*[@id="root"]/div/div[2]/div/div/div[4]/div[1]/div[1]/div[1]/span').text
for x in range(1, int(int(num) / 9), 1):
time.sleep(2)
j = x * 3
js = 'document.documentElement.scrollTop=document.documentElement.scrollHeight* %f' % j
b.execute_script(js)
# First analyze the page to find the user home page
# Try to use selenium to access a slider verification code to pass manually once a request is made
def getUserPage(b,url,path):
b.get(url)
time.sleep(3)
img1 = str(uuid.uuid1()) + '.jpeg'
img2 = str(uuid.uuid1()) + '.png'
drop_down(b, img1, img2)
time.sleep(1)
if os.path.exists(img1):
os.remove(img1)
if os.path.exists(img2):
os.remove(img2)
lis = b.find_elements_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[4]/div[1]/div[2]/ul/li/a')
fp = open(path, 'w', encoding='utf-8')
# url_list=[]
for li in lis:
href = li.get_attribute('href')
# print(href)
fp.write(href + ',\n')
# url_list.append(href)
fp.close()
print('Video address collection is complete! The data is stored in %s' % fp)
b.quit()
if __name__ == '__main__':
scale = 1.8
chrome_d = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
b = webdriver.Chrome(executable_path=chrome_d)
b.maximize_window()
# url = 'https://www.douyin.com/user/MS4wLjABAAAAJ_nEAirFKdd5UrdKdQstsksaV_JNJBxu0_qq2R4QpBY'
url='https://www.douyin.com/user/MS4wLjABAAAAIkUGvJjhqY2IV6W_Tkht31LnogAFWBF2MBkEEbvAtnQ'
this = os.getcwd() # Get the current path
ti = 'tiktok_video_url-%s.txt' % str(datetime.datetime.now().date()) # Get the time concatenation string as the file name
path = os.path.join(this, ti) # Combine the two sections into a file storage path
print("Get all video urls:")
# getUserPage(b, url,path) #
result = []
with open(path, 'r') as f:
for line in f:
result.append(str(line.strip(',\n').split(',')[0]))
T = []
# for url in result:
# print("Current Crawling", url)
# # getComment(b, url)
# #Need to set a proxy ip, set a get ip method, try when the thread is running, if there is an ip problem, call an external method to get a new ip and continue running
# #Introduce an account and open the browser and log in with the account request to get the cookie and put it in selenium
# t = Thread(target=get_comment, args=(url,))
# t.start()
# T.append(t)
# # break
for i in range(1,3): #len(result)
print("Join thread:", result[i])
# getComment(b, url)
# Need to set a proxy ip, set a get ip method, try when the thread is running, if there is an ip problem, call an external method to get a new ip and continue running
# Pass in an account and open the browser and log in with the account request to get the cookie and put it in selenium
t = Thread(target=get_comment, args=(result[i],i,))
t.start()
T.append(t)
# break
for t in T:
# join waits for the thread to end
t.join()
Let’s take a look at the crawled data, which is stored in the order of videos. If you want to crawl more than one person in a day, you can directly put the data in these files into the database, and store them according to keyword-i or user name-i.