selenium climb vibrato

Table of contents

          request

selenium


A friend of mine received a comment about outsourced crawling of Douyin videos. I finished this and decided to make a summary to facilitate subsequent studies. Discussions and exchanges are also welcome. Vx:mastercy1

First, program for Baidu, see how the big guys do it, refer to some ideas, and then most of them are simulators. When I was preparing to download the mobile phone simulator, I found that Douyin already has a PC web terminal.https://www.douyin.com/

So let’s analyze it. First, check the structure of the webpage. The homepage is the same as the app’s various pushes.

You can click to enter the user’s homepage, and there will be all videos on the homepage.

You can also search to enter the user’s homepage or search for videos related to a certain keyword.

Clicking on the searched video or the video on the user’s homepage will enter a single video page

Then the pages are loaded by the waterfall flow. In fact, crawling comments and video download links are the same. In order not to be one-sided and just post a crawler code, I will first analyze the method of the request. Let’s find the method in F12

request analysis

We can find that this method is called when scrolling down, and then check the important parameters are max_cursor, count, and the three below generated with js need to be reversed, put a breakpoint and analyze it, and finally find these /aweme /v1/web/comment/list/ is

 

These two related js can be reversed

The positioning of page elements will not be repeated here, just copy it directly, xpath and re can be used

Normally set the request header, initiate a request to obtain data parsing and storage

 

 

selenium method

Fortunately, the request method will not have a verification code at the beginning, but as soon as Seleium opens the browser to access Douyin, there will be a sliding verification code. I refer to the method on the Internet and made a function to solve the verification code. The general idea is to save The pop-up picture compares the pixels to get the position of the different pixels, and the moving distance is then accelerated and decelerated to drag

import cv2
import time
from selenium.webdriver import ActionChains
import requests
import random
def canny(filepath, cell=7):
    img = cv2.imread(filepath, 0)
    blurred = cv2.GaussianBlur(img, (cell, cell), 0)
    return cv2.Canny(blurred, 240, 250)
def getPosition(img_file1, img_file2):
    img = canny(img_file1)
    img2 = img.copy()
    template = canny(img_file2, cell=5)
    w, h = template.shape[::-1]
    img = img2.copy()
    method = eval("cv2.TM_CCOEFF_NORMED")

    res = cv2.matchTemplate(img, template, method)
    min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(res)

    if method in [cv2.TM_SQDIFF, cv2.TM_SQDIFF_NORMED]:
        top_left = min_loc
    else:
        top_left = max_loc
    bottom_right = (top_left[0] + w, top_left[1] + h)

    cv2.rectangle(img, top_left, bottom_right, 255, 2)
    return top_left


def get_track(distance):
    v = 0
    t = 0.4
    tracks = []
    current = 0
    mid = distance * 7 / 8
    distance += 5
    while current < distance:
        if current < mid:
            a = random.randint(2, 4) # Accelerate movement
        else:
            a = -random.randint(1, 3) # decelerate movement
        v0 = v
        s = v0 * t + 0.6 * a * (t ** 2)
        current += s
        tracks.append(round(s))
        v = v0 + a * t
    random.shuffle(tracks)
    return tracks


def checkCode(b, img_file1, img_file2):
    scale = 1.7
    try:
        while 1:
            t = b.find_element_by_xpath('//*[@id="captcha-verify-image"]')
            t = t.get_attribute("src")
            img = requests.get(t)
            f = open(img_file1, "wb")
            f.write(img.content)
            f.close()
            t = b.find_element_by_xpath('//*[@id="captcha_container"]/div/div[2]/img[2]').get_attribute("src")
            img = requests.get(t)
            f = open(img_file2, "wb")
            f.write(img.content)
            f.close()
            p = int(getPosition(img_file1, img_file2)[0] / scale)
            # print(p)
            button = b.find_element_by_xpath('//*[@id="secsdk-captcha-drag-wrapper"]/div[2]')
            tracks = get_track(p)
            ActionChains(b).click_and_hold(button).perform()
            for x in tracks:
                ActionChains(b).move_by_offset(xoffset=x, yoffset=0).perform()
            ActionChains(b).release(button).perform()
            time.sleep(1)
    except:
        print("ok")

Well, then our thinking is like this, there are two

  • Directly search for a certain keyword, and then crawl all the comments of the first x videos under this keyword
  • Search someone’s homepage directly, and then crawl all the comments of all videos made by this person

(Crawl four or five pages here first because there is no proxy ip, and there is no text verification code that will jump during frequent visits)

# @FILE     : tt_getComment.py
# @Time     : 2022/1/13 12:08
from selenium import webdriver
import re
from lxml import etree
import time
import os
import json
import datetime
import uuid
from tiktok_spider.check_Code import checkCode
#This script can run a single page or as a thread running multiple pages
#When running a single, modify the url to be crawled below and run this file directly
#Call get_comment(url) as a thread, and pass in parameters

def get_review_number(b):
    #//*[@id="root"]/div/div[2]/div/div/div[1]/div[1]/div[3]/div/div[2]/div[1]/div[2]/span
    # num=b.find_element_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[1]/div[3]/div/div[2]/div[1]/div[2]/span').text
    # print(int(num))
    for x in range(1, 15, 5):
        time.sleep(1)
        j = x * 12
        js = 'document.documentElement.scrollTop=document.documentElement.scrollHeight* %f' % j
        b.execute_script(js)
def get_comment(url,id):
    chrome_d = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
    option = webdriver.ChromeOptions()
    option.add_argument('headless') # Add headless mode
    b = webdriver.Chrome(executable_path=chrome_d,options=option)
    b.get(url)
    b.maximize_window()
    time.sleep(2)
    img1 = str(uuid.uuid1()) + '.jpeg'
    img2 = str(uuid.uuid1()) + '.png'
    checkCode(b,img1, img2) #over verification code
    if os.path.exists(img1):
        os.remove(img1)
    if os.path.exists(img2):
        os.remove(img2)
    time.sleep(2)
    get_review_number(b)
    b.implicitly_wait(3)
    # Review_list = b.find_elements_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[3]/div/div/div[4]/div/div')
    Review_list = b.find_element_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[1]/div[3]/div/div').get_attribute("outerHTML")
    b.close()
    html = etree.HTML(Review_list)
    Review_list = html.xpath('//div[4]/div/div[@class="qolG5qEO"]')

    review_infos = [] # content list
    for i in Review_list:
        # print('1',i)
        review_html = etree.HTML(etree.tostring(i).decode())
        # print(review_html)
        review = review_html.xpath('//span[@class="mzZanXbP"]/span/span/span[1]/span/text()') #username and comment content
        # print(review)
        try:
            if len(review)==1:
                # print(1111)
                review.append('[Facial expression]')
            if len(review[2])!=0:
                review[1]=review[1]+review[2]
        except:
            pass
        result_like = review_html.xpath('//div[2]/div[2]/div/p/span/text()') #likes
        content = etree.tostring(i).decode()
        result_time = re.findall(r'<p class="bVGzXCUK"> (.*?)</p> ', content) #comment time
        result = re.findall(r'a href="//(.*?)" class="yqT9PfJg"', content) #User home page address
        if len(result_time)==0:
            result_time[0]=0
        # print(review[0], ':', review[1])
        # print(result_time[0],'   ',result[0])
        review_info = {"User name": review[0], "Review content": review[1], "Review time": result_time[0], 'Number of likes': result_like[0],"User home page link": result[0]}
        review_infos.append(review_info)
    this = os.getcwd()#Get the current path
    this=this+"\\tiktok_review_info"
    ti='review_info%s-%s.txt'%(str(datetime.datetime.now().date()),id)#Get the time splicing string as the file name
    path= os.path.join(this, ti)# put together two sections into a file storage path
    fp = open(path,'w',encoding='utf-8')
    fp.write('[\n')
    for i in review_infos:
        print(i)
        data = json.dumps(i, ensure_ascii=False)
        fp.write(data + ',\n')
    fp.write(']')
    fp.close()

It should be noted that due to experience and time issues, many places have not made abnormal judgments, and the judgment of illegal characters in comments is only relatively elementary

Then there is the main function, here is a given user homepage url to crawl all his videos, and then make the video address into a list, store it in a file, and stuff it into multi-threading (run x videos at a time)

For this, we need to pay attention to the naming problem of the program. When running a single instance, the verification code will be stored as 1.jpg, and 1.png will be fine. If multiple threads will be blocked, all threads will be called 1.jpg, so you need to get a unique field like uuid as a picture. name, and at the same time do a good job in the naming of deletion and data storage.

import random
import uuid

from selenium import webdriver
import time
import json
from threading import Thread
import requests
import re
from lxml import etree
import datetime
import os
from tiktok_spider.check_Code import checkCode

from tiktok_spider.tt_getComment import get_comment


def drop_down(b, img_file1, img_file2):
    checkCode(b, img_file1, img_file2)
    num = b.find_element_by_xpath(
        '//*[@id="root"]/div/div[2]/div/div/div[4]/div[1]/div[1]/div[1]/span').text
    for x in range(1, int(int(num) / 9), 1):
        time.sleep(2)
        j = x * 3
        js = 'document.documentElement.scrollTop=document.documentElement.scrollHeight* %f' % j
        b.execute_script(js)
# First analyze the page to find the user home page
# Try to use selenium to access a slider verification code to pass manually once a request is made
def getUserPage(b,url,path):
    b.get(url)
    time.sleep(3)
    img1 = str(uuid.uuid1()) + '.jpeg'
    img2 = str(uuid.uuid1()) + '.png'
    drop_down(b, img1, img2)
    time.sleep(1)
    if os.path.exists(img1):
        os.remove(img1)
    if os.path.exists(img2):
        os.remove(img2)
    lis = b.find_elements_by_xpath('//*[@id="root"]/div/div[2]/div/div/div[4]/div[1]/div[2]/ul/li/a')
    fp = open(path, 'w', encoding='utf-8')
    # url_list=[]
    for li in lis:
        href = li.get_attribute('href')
        # print(href)
        fp.write(href + ',\n')
        # url_list.append(href)
    fp.close()
    print('Video address collection is complete! The data is stored in %s' % fp)
    b.quit()
if __name__ == '__main__':
    scale = 1.8
    chrome_d = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
    b = webdriver.Chrome(executable_path=chrome_d)
    b.maximize_window()
    # url = 'https://www.douyin.com/user/MS4wLjABAAAAJ_nEAirFKdd5UrdKdQstsksaV_JNJBxu0_qq2R4QpBY'
    url='https://www.douyin.com/user/MS4wLjABAAAAIkUGvJjhqY2IV6W_Tkht31LnogAFWBF2MBkEEbvAtnQ'

    this = os.getcwd() # Get the current path
    ti = 'tiktok_video_url-%s.txt' % str(datetime.datetime.now().date()) # Get the time concatenation string as the file name
    path = os.path.join(this, ti) # Combine the two sections into a file storage path
    print("Get all video urls:")
    # getUserPage(b, url,path)   #

    result = []
    with open(path, 'r') as f:
        for line in f:
            result.append(str(line.strip(',\n').split(',')[0]))
    T = []
    # for url in result:
    # print("Current Crawling", url)
    #     # getComment(b, url)
    # #Need to set a proxy ip, set a get ip method, try when the thread is running, if there is an ip problem, call an external method to get a new ip and continue running
    # #Introduce an account and open the browser and log in with the account request to get the cookie and put it in selenium
    #     t = Thread(target=get_comment, args=(url,))
    #     t.start()
    #     T.append(t)
    #     # break
    for i in range(1,3):  #len(result)
        print("Join thread:", result[i])
        # getComment(b, url)
        # Need to set a proxy ip, set a get ip method, try when the thread is running, if there is an ip problem, call an external method to get a new ip and continue running
        # Pass in an account and open the browser and log in with the account request to get the cookie and put it in selenium
        t = Thread(target=get_comment, args=(result[i],i,))
        t.start()
        T.append(t)
        # break
    for t in T:
        # join waits for the thread to end
        t.join()

Let’s take a look at the crawled data, which is stored in the order of videos. If you want to crawl more than one person in a day, you can directly put the data in these files into the database, and store them according to keyword-i or user name-i.

Related Posts

Detailed explanation of GET and POST

In 2022, you still don’t know “low code”? Data science can also play with Low-Code!

3. Python Django’s GET request and POST request and response processing

WeChat Alipay swipe steps_one-for-all version (in use)

The reason why mysqlclient is installed on linux reports an error (ERROR: No matching distribution found for mysqlclient)

[Yu Gong Series] MinIO file storage server in January 2022 – client creation and bucket operation (Python version)

Remote connection ssh of vscode

How to insert a column in Pandas DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*