selenium climb vibrato

admin 10/08/2023 Selenium 0

Table of contents

A friend of mine received a comment about outsourced crawling of Douyin videos. I finished this and decided to make a summary to facilitate subsequent studies. Discussions and exchanges are also welcome. Vx:mastercy1

First, program for Baidu, see how the big guys do it, refer to some ideas, and then most of them are simulators. When I was preparing to download the mobile phone simulator, I found that Douyin already has a PC web terminal.https://www.douyin.com/

So let’s analyze it. First, check the structure of the webpage. The homepage is the same as the app’s various pushes.

You can click to enter the user’s homepage, and there will be all videos on the homepage.

You can also search to enter the user’s homepage or search for videos related to a certain keyword.

Clicking on the searched video or the video on the user’s homepage will enter a single video page

Then the pages are loaded by the waterfall flow. In fact, crawling comments and video download links are the same. In order not to be one-sided and just post a crawler code, I will first analyze the method of the request. Let’s find the method in F12

request analysis

We can find that this method is called when scrolling down, and then check the important parameters are max_cursor, count, and the three below generated with js need to be reversed, put a breakpoint and analyze it, and finally find these /aweme /v1/web/comment/list/ is

These two related js can be reversed

The positioning of page elements will not be repeated here, just copy it directly, xpath and re can be used

Normally set the request header, initiate a request to obtain data parsing and storage

selenium method

Fortunately, the request method will not have a verification code at the beginning, but as soon as Seleium opens the browser to access Douyin, there will be a sliding verification code. I refer to the method on the Internet and made a function to solve the verification code. The general idea is to save The pop-up picture compares the pixels to get the position of the different pixels, and the moving distance is then accelerated and decelerated to drag

import cv2
import time
from selenium.webdriver import ActionChains
import requests
import random
def canny(filepath, cell=7):
    img = cv2.imread(filepath, 0)
    blurred = cv2.GaussianBlur(img, (cell, cell), 0)
    return cv2.Canny(blurred, 240, 250)
def getPosition(img_file1, img_file2):
    img = canny(img_file1)
    img2 = img.copy()
    template = canny(img_file2, cell=5)
    w, h = template.shape[::-1]
    img = img2.copy()
    method = eval("cv2.TM_CCOEFF_NORMED")

    res = cv2.matchTemplate(img, template, method)
    min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(res)

    if method in [cv2.TM_SQDIFF, cv2.TM_SQDIFF_NORMED]:
        top_left = min_loc
    else:
        top_left = max_loc
    bottom_right = (top_left[0] + w, top_left[1] + h)

    cv2.rectangle(img, top_left, bottom_right, 255, 2)
    return top_left


def get_track(distance):
    v = 0
    t = 0.4
    tracks = []
    current = 0
    mid = distance * 7 / 8
    distance += 5
    while current < distance:
        if current < mid:
            a = random.randint(2, 4) # Accelerate movement
        else:
            a = -random.randint(1, 3) # decelerate movement
        v0 = v
        s = v0 * t + 0.6 * a * (t ** 2)
        current += s
        tracks.append(round(s))
        v = v0 + a * t
    random.shuffle(tracks)
    return tracks


def checkCode(b, img_file1, img_file2):
    scale = 1.7
    try:
        while 1:
            t = b.find_element_by_xpath('//*[@id="captcha-verify-image"]')
            t = t.get_attribute("src")
            img = requests.get(t)
            f = open(img_file1, "wb")
            f.write(img.content)
            f.close()
            t = b.find_element_by_xpath('//*[@id="captcha_container"]/div/div[2]/img[2]').get_attribute("src")
            img = requests.get(t)
            f = open(img_file2, "wb")
            f.write(img.content)
            f.close()
            p = int(getPosition(img_file1, img_file2)[0] / scale)
            # print(p)
            button = b.find_element_by_xpath('//*[@id="secsdk-captcha-drag-wrapper"]/div[2]')
            tracks = get_track(p)
            ActionChains(b).click_and_hold(button).perform()
            for x in tracks:
                ActionChains(b).move_by_offset(xoffset=x, yoffset=0).perform()
            ActionChains(b).release(button).perform()
            time.sleep(1)
    except:
        print("ok")

Well, then our thinking is like this, there are two

Directly search for a certain keyword, and then crawl all the comments of the first x videos under this keyword
Search someone’s homepage directly, and then crawl all the comments of all videos made by this person

(Crawl four or five pages here first because there is no proxy ip, and there is no text verification code that will jump during frequent visits)

# @FILE     : tt_getComment.py
# @Time     : 2022/1/13 12:08
from selenium import webdriver
import re
from lxml import etree
import time
import os
import json
import datetime
import uuid
from tiktok_spider.check_Code import checkCode
#This script can run a single page or as a thread running multiple pages
#When running a single, modify the url to be crawled below and run this file directly
#Call get_comment(url) as a thread, and pass in parameters

def get_review_number(b):
    #//*[@id="root"]/div/div[2]/div/div/div[1]/div[1]/div[3]/div/div[2]/div[1]/div[2]/span
    # num=b.find_element_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[1]/div[3]/div/div[2]/div[1]/div[2]/span').text
    # print(int(num))
    for x in range(1, 15, 5):
        time.sleep(1)
        j = x * 12
        js = 'document.documentElement.scrollTop=document.documentElement.scrollHeight* %f' % j
        b.execute_script(js)
def get_comment(url,id):
    chrome_d = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
    option = webdriver.ChromeOptions()
    option.add_argument('headless') # Add headless mode
    b = webdriver.Chrome(executable_path=chrome_d,options=option)
    b.get(url)
    b.maximize_window()
    time.sleep(2)
    img1 = str(uuid.uuid1()) + '.jpeg'
    img2 = str(uuid.uuid1()) + '.png'
    checkCode(b,img1, img2) #over verification code
    if os.path.exists(img1):
        os.remove(img1)
    if os.path.exists(img2):
        os.remove(img2)
    time.sleep(2)
    get_review_number(b)
    b.implicitly_wait(3)
    # Review_list = b.find_elements_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[3]/div/div/div[4]/div/div')
    Review_list = b.find_element_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[3]/div/div').get_attribute("outerHTML")
    b.close()
    html = etree.HTML(Review_list)
    Review_list = html.xpath('//div[4]/div/div[@class="qolG5qEO"]')

    review_infos = [] # content list
    for i in Review_list:
        # print('1',i)
        review_html = etree.HTML(etree.tostring(i).decode())
        # print(review_html)
        review = review_html.xpath('//span[@class="mzZanXbP"]/span/span/span[1]/span/text()') #username and comment content
        # print(review)
        try:
            if len(review)==1:
                # print(1111)
                review.append('[Facial expression]')
            if len(review[2])!=0:
                review[1]=review[1]+review[2]
        except:
            pass
        result_like = review_html.xpath('//div[2]/div[2]/div/p/span/text()') #likes
        content = etree.tostring(i).decode()
        result_time = re.findall(r'<p class="bVGzXCUK"> (.*?)</p> ', content) #comment time
        result = re.findall(r'a href="//(.*?)" class="yqT9PfJg"', content) #User home page address
        if len(result_time)==0:
            result_time[0]=0
        # print(review[0], ':', review[1])
        # print(result_time[0],'   ',result[0])
        review_info = {"User name": review[0], "Review content": review[1], "Review time": result_time[0], 'Number of likes': result_like[0],"User home page link": result[0]}
        review_infos.append(review_info)
    this = os.getcwd()#Get the current path
    this=this+"\\tiktok_review_info"
    ti='review_info%s-%s.txt'%(str(datetime.datetime.now().date()),id)#Get the time splicing string as the file name
    path= os.path.join(this, ti)# put together two sections into a file storage path
    fp = open(path,'w',encoding='utf-8')
    fp.write('[\n')
    for i in review_infos:
        print(i)
        data = json.dumps(i, ensure_ascii=False)
        fp.write(data + ',\n')
    fp.write(']')
    fp.close()

It should be noted that due to experience and time issues, many places have not made abnormal judgments, and the judgment of illegal characters in comments is only relatively elementary

Then there is the main function, here is a given user homepage url to crawl all his videos, and then make the video address into a list, store it in a file, and stuff it into multi-threading (run x videos at a time)

For this, we need to pay attention to the naming problem of the program. When running a single instance, the verification code will be stored as 1.jpg, and 1.png will be fine. If multiple threads will be blocked, all threads will be called 1.jpg, so you need to get a unique field like uuid as a picture. name, and at the same time do a good job in the naming of deletion and data storage.

import random
import uuid

from selenium import webdriver
import time
import json
from threading import Thread
import requests
import re
from lxml import etree
import datetime
import os
from tiktok_spider.check_Code import checkCode

from tiktok_spider.tt_getComment import get_comment


def drop_down(b, img_file1, img_file2):
    checkCode(b, img_file1, img_file2)
    num = b.find_element_by_xpath(
        '//*[@id="root"]/div/div[2]/div/div/div[4]/div[1]/div[1]/div[1]/span').text
    for x in range(1, int(int(num) / 9), 1):
        time.sleep(2)
        j = x * 3
        js = 'document.documentElement.scrollTop=document.documentElement.scrollHeight* %f' % j
        b.execute_script(js)
# First analyze the page to find the user home page
# Try to use selenium to access a slider verification code to pass manually once a request is made
def getUserPage(b,url,path):
    b.get(url)
    time.sleep(3)
    img1 = str(uuid.uuid1()) + '.jpeg'
    img2 = str(uuid.uuid1()) + '.png'
    drop_down(b, img1, img2)
    time.sleep(1)
    if os.path.exists(img1):
        os.remove(img1)
    if os.path.exists(img2):
        os.remove(img2)
    lis = b.find_elements_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[4]/div[1]/div[2]/ul/li/a')
    fp = open(path, 'w', encoding='utf-8')
    # url_list=[]
    for li in lis:
        href = li.get_attribute('href')
        # print(href)
        fp.write(href + ',\n')
        # url_list.append(href)
    fp.close()
    print('Video address collection is complete! The data is stored in %s' % fp)
    b.quit()
if __name__ == '__main__':
    scale = 1.8
    chrome_d = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
    b = webdriver.Chrome(executable_path=chrome_d)
    b.maximize_window()
    # url = 'https://www.douyin.com/user/MS4wLjABAAAAJ_nEAirFKdd5UrdKdQstsksaV_JNJBxu0_qq2R4QpBY'
    url='https://www.douyin.com/user/MS4wLjABAAAAIkUGvJjhqY2IV6W_Tkht31LnogAFWBF2MBkEEbvAtnQ'

    this = os.getcwd() # Get the current path
    ti = 'tiktok_video_url-%s.txt' % str(datetime.datetime.now().date()) # Get the time concatenation string as the file name
    path = os.path.join(this, ti) # Combine the two sections into a file storage path
    print("Get all video urls:")
    # getUserPage(b, url,path)   #

    result = []
    with open(path, 'r') as f:
        for line in f:
            result.append(str(line.strip(',\n').split(',')[0]))
    T = []
    # for url in result:
    # print("Current Crawling", url)
    #     # getComment(b, url)
    # #Need to set a proxy ip, set a get ip method, try when the thread is running, if there is an ip problem, call an external method to get a new ip and continue running
    # #Introduce an account and open the browser and log in with the account request to get the cookie and put it in selenium
    #     t = Thread(target=get_comment, args=(url,))
    #     t.start()
    #     T.append(t)
    #     # break
    for i in range(1,3):  #len(result)
        print("Join thread:", result[i])
        # getComment(b, url)
        # Need to set a proxy ip, set a get ip method, try when the thread is running, if there is an ip problem, call an external method to get a new ip and continue running
        # Pass in an account and open the browser and log in with the account request to get the cookie and put it in selenium
        t = Thread(target=get_comment, args=(result[i],i,))
        t.start()
        T.append(t)
        # break
    for t in T:
        # join waits for the thread to end
        t.join()

Let’s take a look at the crawled data, which is stored in the order of videos. If you want to crawl more than one person in a day, you can directly put the data in these files into the database, and store them according to keyword-i or user name-i.

python, selenium, test tools

Pre: SQL Server Installation Guide

Next: [Small Program Column] Summarize the development specifications of uniapp development applets

selenium climb vibrato

request analysis

selenium method

Leave a Reply Cancel Reply

Advertisement

About Developerknow

Privacy Policy

Contact Us

Sitemap

selenium climb vibrato

request analysis

selenium method

Related Posts

Leave a Reply Cancel Reply

Advertisement

Tags

About Developerknow

Privacy Policy

Contact Us

Sitemap