[Daily] Advanced crawler skills: selenium loading extensions (extension) and configuring user data (user-data)


preamble

This article will briefly explain the advanced skills of Python \text{Python} Python using the configuration (options \text{options} options) in the browser driver (Selenium \text{Selenium} Selenium):

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
# chrome_options.add_extension(...)
# chrome_options.add_argument(...)
# chrome_options.add_experimental_option(...)

The most complete configuration information can refer toChrome \text{Chrome} Chrome official website, if you can’t get over the wall, you can also search for more comprehensive blogs (such as@Kosmoo), the author will not introduce a lot of configuration options, but only configure plug-ins and configure user data with examples.


When writing blogs related to crawlers, the author usually sorts out a complete crawler task process, and rarely chooses such technical problems to make a fuss about creating a blog. However, there is no shortcut to learning crawler skills. Only after you continuously discover obstacles and try to solve them in several practices, your experience will continue to improve, and you will be able to solve similar bottlenecks in the future.

Perhaps you who are reading this article are already proficient in the application of Selenium \text{Selenium} Selenium and JS \text{JS} JS reverse methods, and think that everything you read and hear in the browser can be easily crawled, but The author is sure that you will still encounter puzzling problems in the crawling process, because there will always be knowledge that you are not familiar with.For example, under the same operation process, the Selenium \text{Selenium} Selenium driver is not the same as the page seen by the actual browser operation, In this case, the author will be very happy, because the occurrence of problems will mean the improvement of crawler skills. If there are no problems, it will only mean that there are more and more potential problems, and the crawler itself will lose its fun.

At the end of the preface, the author will introduce the crawler task of this article. Recently, the group is preparing for the 2021 2021 2021 international conference, the boss asked us to go Engineering Village \text{Engineering Village} Engineering VillageSeveral paper links are obtained by searching the names of various conferences. From the paper links, the name and email address of the author of the paper can be obtained, which is used to publish the conference call for papers email:

  • Figure 1 \text{Figure 1} Figure 1: Engineering Village \text{Engineering Village} Engineering Village search page

Since the author’s university does not have access to the Engineering Village \text{Engineering Village} Engineering Village database, fortunately the boss gaveXueba libraryVIP \text{VIP} VIP (equivalent to borrowing the unified identity authentication information of other universities to access Engineering Village \text{Engineering Village} Engineering Village, similar to VPN \text{VPN} VPN). However, the efficiency of campus network access to Engineering Village \text{Engineering Village} Engineering Village is too slow, and the page crashes frequently, which makes it difficult to manually complete the task of collecting names and email addresses.

In fact, the author had already written the crawler logic when organizing the meeting last year. Although the page of Engineering Village \text{Engineering Village} Engineering Village has not been updated at all this year, last year’s crawler logic is still feasible, but this year the colleges and universities provided by Xueba Library The channel has been updated and requiresdownload pluginOnly by borrowing the unified identity authentication information of other universities to visit Engineering Village \text{Engineering Village} Engineering Village:

  • Figure 2 \text{Figure 2} Figure 2: Plug-in installation instructions.Note that this plug-in can only be installed in the browser of Chrome \text{Chrome} Chrome core, Firefox browser and Edge \text{Edge} Edge must not be installed, so the S e l e n i u m \rm Selenium Selenium requirements in this article must be in Chrome \text To enable on {Chrome} Chrome, you need to install Chromedriver first. \text{Chromedriver} Chromedriver. For details, seehttps://www.cnblogs.com/lfri/p/10542797.html


Friends who are not interested in the crawler task of this article, you can directly skip the following preamble and read the main text directly.

If you are interested in the crawler tasks described in this article, can go toXueba libraryApply to register a new account, log in to the account and visitWelcome pageYou can get a free 24 24 24-hour VIP \text{VIP} VIP experience, which will benefit your use of the crawler code in this article. After re-logging into the account, enterDatabase Encyclopedia,existForeign Language Database Encyclopediafound inEngineering Village EI \text{Engineering Village EI} Engineering Village EI EngineeringLink to enter:

  • Figure 3 \text{Figure 3} Figure 3: Find Engineering Village EI \text{Engineering Village EI} Engineering Village EI engineering database link


You can find the following two channels,Note that these two channels may not exist after a while. Last year, the author used the channel of Shenyang University of Technology. This year, the two channels are the channels of University of Science and Technology of China and Beihang University. Although the channels may change in the future, but Engineering Village \text{Engineering Village} Engineering VillageThe logic of the crawler usually does not change, so the scope of maintenance of the crawler in this article is limited to the access channel

  • Figure 4 \text{Figure 4} Figure 4: find the university channel link


Click on any of the channel links in the above picture to find the plug-ins mentioned above:

  • Figure 5 \text{Figure 5} Figure 5: 2021 2021 2021 update content, you need to install a plug-in access channel, the installation of the plug-in is not complicated, follow the tutorial to choose the second zip \text{zip} zip format in Chrome \text {Chrome} Manually install the decompression package in Chrome, and it can be directly and completely deleted after deprecation.


The questions for this article start here.


1 Selenium \text{1 Selenium} 1 Selenium load plugin (Chromedriver \text{Chromedriver} Chromedriver)

Point 111 is relatively obvious.

Even if you install the plug-in on Chrome \text{Chrome} Chrome according to Figure 2 \text{Figure 2} Figure 2, you will still get stuck when using Chrome \text{Chrome} Chrome’s Selenium \text{Selenium} Selenium driver to access the channel On the hint page for Figure 5 \text{Figure 5} Figure 5, this is a tricky \text{tricky} tricky thing.

The author is not sure to say that Selenium \text{Selenium} Selenium will not load the plugin by default, because Figure 2 \text{Figure 2} Figure 2 actually provides two plugin installation methods, but the first one ( crx \text {crx} crx file) does not seem to be installed in Chrome \text{Chrome} Chrome, so only the second (zip \text{zip} zip file) installation method can be used, but in the source code of part 3 3 3 of this article 81 -84 \text{81-84} Lines 81-84 can still configure and install the crx \text{crx} crx plug-in for the browser driver.

chrome_options = webdriver.ChromeOptions() # Initialize Chrome options
chrome_options.add_extension(self.extension_path) # Install Xueba library channel plug-in
chrome_options.add_argument(r'user-data-dir=C:\Users\lenovo\AppData\Local\Google\Chrome\User Data')
chrome_options.add_experimental_option('useAutomationExtension', False)

This is a very confusing point, because one of the configuration (options \text{options} options) in the browser driver (Selenium \text{Selenium} Selenium) is--disable-plugins, that is to disable the plugin, it seems that if you do not configure--disable-pluginsParameters, the plug-ins already installed on the browser will take effect by default. But the actual result is not the case, even if you do not configure--disable-pluginsParameters, plugins are not enabled by default, or maybe plugins (plugins \text{plugins} plugins) and extensions (extensions \text{extensions} extensions) do not refer to the same thing?

Also unavailableadd_extensionThe method installs the plug-in in zip \text{zip} zip format, although it is opened in the extension installation page of Chrome \text{Chrome} Chromedeveloper modeThe unpacked zip \text{zip} zip file can be installed.

In short, omitting lines 82 82 82 in the source code of part 3 3 3 will not be able to skip the Figure 5 \text{Figure 5} Figure 5 page, and the issue of browser plug-ins still needs to be investigated.


2 Selenium \text{2 Selenium} 2 Selenium configures the method and purpose of User Data \text{User Data} User Data (Chromedriver \text{Chromedriver} Chromedriver)

This is the most puzzling point of this crawler.

In fact, there is also a way to enter Engineering Village \text{Engineering Village} Engineering Village without taking the passage shown in Figure 4 \text{Figure 4} Figure 4, as long as you log in to Xueba Library and directly enter access in the address barhttp://www.engineeringvillage.comYou can get the page of Figure 1 \text{Figure 1} Figure 1 (at this time, you have obtained the permission to access the Engineering Village \text{Engineering Village} Engineering Village database, and you can search and query), this method will actually be more stable.

However, when the author is using the Selenium \text{Selenium} Selenium driver, after logging in to Xueba Library, directly enter access in the address barhttp://www.engineeringvillage.comBut it can only stay on the Welcome \text{Welcome} Welcome page, that is, the login verification permission is still required.

  • Figure 6 \text{Figure 6} Figure 6: Unable to enter the page shown in Figure 1 \text{Figure 1} Figure 1 (Stay on the welcome interface)


This problem has troubled the author for a long time, and finally I can only succumb to the passage of Figure 4 \text{Figure 4} Figure 4, but the BUAA \text{BUAA} BUAA given in Figure 4 \text{Figure 4} Figure 4 The channel crashes frequently, and it hasn’t been fixed even from last night to this morning. I’m sure it’s not because of a plug-in problem, so I started testing different options \text{options} options parameters this morning, and wanted to know which parameter caused the two the difference between.

Finally, it was found that line 83 83 83 in the source code of part 3 3 3, namely:

chrome_options.add_argument(r'user-data-dir=C:\Users\lenovo\AppData\Local\Google\Chrome\User Data')

added inuser-data-dirAfter setting parameters, the problem is solved.

In fact, this parameter will be very useful. Many historical records, C o k i e \rm Cookie Cookie information, and website login information are stored in the User Data \text{User Data} User Data folder. For example, some websites support browsers to remember login information. If the User Data \text{User Data} User Data parameter is not added, the remembered information cannot take effect in Selenium \text{Selenium} Selenium-driven browsers. If you add this parameter, you can easily skip the login verification of many websites, and you can avoid the complicated verification code processing process.


3 3 3 Source code and operating instructions

The author has uploaded the source code and crx \text{crx} crx plug-in and sample keyword text to:

Link: https://pan.baidu.com/s/1F2AGagKI89Lqi2_leeo5gw 
Extract code: hm4q

Finally, the author provides the complete code of this crawler.

Just modify the user name, password and crx \text{crx} crx plug-in path of line 261, 262, 265 261,262,265 261,262,265 in xuebalib.py \text{xuebalib.py} xuebalib.py to run.

In addition, you need to check whether the User Data \text{User Data} User Data path in line 83 83 83 matches the path on your computer. Generally speaking, Chrome \text{Chrome} Chrome in Windows \text{Windows} Windows system User data paths are bothC:\Users\lenovo\AppData\Local\Google\Chrome\User Data, but you’d better double check.

The code format has been optimized, the comments are detailed and the console output information is more readable.

# -*- coding: UTF-8 -*-
# @author: caoyang
# @email: [email protected]
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
# Read the file that stores the search keywords
# Each line in the file only records one keyword
# Lines starting with '#' will be ignored automatically, try not to have blank lines
def load_keywords(filepath: str) -> list:
with open(filepath, 'r', encoding='utf8') as f:
keywords = f.read().splitlines()
return list(filter(lambda x: not x.startswith('#'), keywords))
class Xuebalib(object):
def __init__(self, 
keywords: list, 
username: str, 
password: str, 
host: str, 
extension_path: str=None,								
mode: str='expert') -> None:
self.keywords = keywords[:]
self.username = username
self.password = password
self.host = host
self.extension_path = extension_path # There is a major difference from the situation in 2020. In 2021, you need to install a plug-in to enter the Xueba library channel. You need to download the plug-in in crx format and record the path
self.mode = mode.strip().lower()
assert self.mode in ['quick', 'expert'], f'Unknown mode: {self.mode}'
self.max_trial_time = 8
@staticmethod
def get_detailurl(soup: BeautifulSoup) -> list:
"""
Get links to all paper detail pages on the search results page:
:param soup: the page source code parsed by BeautifulSoup;
"""
detaillinks = soup.find_all('a', class_='detaillink')
detailurls = []
for detaillink in detaillinks:
detailurl = detaillink.attrs['href']
detailurls.append(detailurl)
return detailurls
@staticmethod
def get_author_and_email(soup: BeautifulSoup, ignore: bool=False) -> list:
"""Get the author and corresponding email address on the paper details page
:param soup : the page source code parsed by BeautifulSoup;
:param ignore : Whether to ignore those authors who do not have an email address, the default is not to ignore;
:return author_and_email: A list of 2-tuples composed of author and email;
"""
author_and_email = []
if ignore:
emaillinks = soup.find_all('a', class_='emaillink')
for emaillink in emaillinks:
email = emaillink.attrs['href']
authorlink = emaillink.find_previous_sibling('a', class_='authorSearchLink')
author = str(authorlink.string)
author_and_email.append((author, email))
else:
ul = soup.find('ul', class_='abs_authors')
if ul is not None: 
for li in ul.find_all('li'):
authorlink = li.find('a', class_='authorSearchLink')
emaillink = li.find('a', class_='emaillink')
author = str(authorlink.string)
email = None if emaillink is None else emaillink.attrs['href']
author_and_email.append((author, email))
return author_and_email
def run(self):
# Initialize the browser driver
print('Initiate driver ...')
if self.extension_path is not None:
chrome_options = webdriver.ChromeOptions() # Initialize Chrome options
chrome_options.add_extension(self.extension_path) # Install Xueba library channel plug-in
chrome_options.add_argument(r'user-data-dir=C:\Users\lenovo\AppData\Local\Google\Chrome\User Data')
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(chrome_options=chrome_options) # Configure Chrome options
else: # Usually this situation is not feasible at present, but you may not need the plug-in in the future, you can restore the default Firefox browser driver to be more stable
driver = webdriver.Firefox()
driver.set_page_load_timeout(15) # Set the maximum loading time, otherwise it will be stuck on the paper details page
driver.maximize_window() # Maximize the window: new changes in 2021, if the window is not maximized, the layout of the search page will change and cannot be converted to expert mode
print('  - Complete !')
# Login Xueba Library
print('Login ...')
driver.get('http://www.xuebalib.com')
print('  - Waiting for textinput ...')
WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//input[@name="username"]').is_displayed())
print('    + OK !')
print('  - Input username and password ...')
driver.find_element_by_xpath('//input[@name="username"]').send_keys(self.username)
driver.find_element_by_xpath('//input[@name="password"]').send_keys(self.password)
Driver. Find_element_by_xpath (' // input value = "login"] [@ '). Click ()
print('    + OK !')
time.sleep(3)
print('  - Complete !')
# Obtain channel access authority: use BUAA (Beijing University of Aeronautics and Astronautics) channel in 2021
print('Get through Passageway ... (It is extremely slow and may be failed for several times)')
driver.get('http://www.xuebalib.com/db.php/EI')
print('  - Waiting for Passageway link ...')
WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//a[contains(text(), "BUAA")]').is_displayed())
print('    + OK !')
print('  - Enter Passageway ...')
driver.find_element_by_xpath('//a[contains(text(),"BUAA")]').click()
print('    + OK !')
print('  - Switch to new window ...')
windows = driver.window_handles # Initialize window handles
print(f'    + Totally {len(windows)} windows !')
driver.switch_to.window(windows[1]) # switch to new window
print('    + OK !')
print('  - Access to Engineering Village ... (It is the most difficult step and always failed)')
WebDriverWait(driver, 60).until(lambda driver: driver.find_element_by_xpath('//a[@href="https://www-engineeringvillage-com.e1.buaa.edu.cn"]').is_displayed())
# start searching
count = 0
for index, keyword in enumerate(self.keywords):
flag = 1
while flag <= self.max_trial_time:
try:
print(f'    + No.{flag} Trial ...')
driver.get('http://www.engineeringvillage.com') # 20211017 Update: Why did this link not work before, because the User Data was not added in, resulting in the Welcome interface and unable to log in with permission. I tried it out this morning. chrome_options.add_argument(r'user-data-dir=C:\Users\lenovo\AppData\Local\Google\Chrome\User Data'), this channel is more stable
# driver.get('https://www-engineeringvillage-com-443.e1.buaa.edu.cn/search/quick.url')
# driver.find_element_by_xpath('//a[@href="https://www-engineeringvillage-com.e1.buaa.edu.cn"]').click()
print('    + Waiting for search textinput ...')
WebDriverWait(driver, 60).until(lambda driver: driver.find_element_by_xpath('//input[@class="search-word"]').is_displayed())
print('    + OK !')
break
except Exception as e:
flag += 1
print(f'      * Fail: {e}')
continue
print('  - Complete !')
# Reconfirm that you have entered Engineering Village
print('Waiting for search textinput again ...')
WebDriverWait(driver, 60).until(lambda driver: driver.find_element_by_xpath('//input[@class="search-word"]').is_displayed())
print('  - Complete !')
if self.mode == 'expert': # Switch to expert search in export mode
print('Switch to expert mode ...')
driver.find_element_by_xpath('//span[@class="button-link-text" and contains(text(),"Search")]').click()
time.sleep(1)
driver.find_element_by_xpath('//span[@class="button-link-text" and contains(text(),"Expert")]').click()
print('  - Complete !')			
print(f'Search keyword: {keyword}')
with open(f'keyword_{index}.txt', 'w', encoding='utf8') as f:
pass
# reset the search box
print('  - Reset textinput ...')
xpath = '//a[@id="reset-form-link-quick"]' if self.mode == 'quick' else '//a[@id="reset-form-link-expert"]'
driver.find_element_by_xpath(xpath).click()
time.sleep(2)
print('    + OK !')
# enter keyword
print('  - Input keyword ...')
xpath = '//input[@class="search-word"]' if self.mode == 'quick' else '//textarea[@class="search-word text-area-lg"]'
driver.find_element_by_xpath(xpath).send_keys(keyword)
time.sleep(2)
print('    + OK !')
# Click to search
print('  - Click search engine ...')
xpath = '//a[@id="searchBtn"]' if self.mode == 'quick' else '//a[@id="expertSearchBtn"]'
driver.find_element_by_xpath(xpath).click()
print('    + OK !')
# Wait for search results
print('  - Waiting for search results ...')
WebDriverWait(driver, 60).until(lambda driver: driver.find_element_by_xpath('//a[@class="detaillink"]').is_displayed())
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
h2 = soup.find('h2', id='results-count')
for child in h2.children:
results_count = int(str(child).strip())
break
print(f'    + Totally {results_count} results !')
current_url = driver.current_url # Record the current URL, so that you can return to this page again
time.sleep(3)
print('    + OK !')
# *** Trying to adjust the number of drop-down boxes displayed on each page to 100, so that you can save a few pages, but for some reason, the Select method cannot be used
# driver.find_element_by_xpath('//span[@class='select2-selection__arrow']').click()
# time.sleep(2)
# select_el = Select(driver.find_element_by_xpath('//select[@id='results-per-page-select']'))
# select_el.select_by_visible_text('100')
current_index = 1 # Record the serial number of the search result
page_number = 0 # record pagination value
while True:
page_number += 1 # traverse each page of search results
print(f'  - Page {page_number} ...')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
detailurls = Xuebalib.get_detailurl(soup)
for detailurl in detailurls:
count += 1
print(f'    + Processing No.{count} paper ...')
try: # May fail: the reason is from the configuration driver.set_page_load_timeout(10), but the maximum loading time of the page is not configured, and errors may also be reported elsewhere
driver.get(self.host + detailurl)				
WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//ul[@class="abs_authors"]').is_displayed())
except: # Although the loading is not complete, it doesn't matter anyway, the information needed has been loaded with a high probability
print('    + Load incompletely ! (Do not care about this)')
# Parse the author and email address
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
author_email_pairs = Xuebalib.get_author_and_email(soup, ignore=False)
# Write the author and email to the file
print('    + Write to file ...')	
for author,email in author_email_pairs:
with open(f'keyword_{index}.txt', 'a', encoding='utf8') as f:
f.write(f'{author}\t{email}\n')
print('    + OK !')		
# Go back to the search results page and click on the next page: this method is not stable
# driver.get(current_url)
# try: # If you have reached the last page, you will find that there is no next page button
# WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//a[@id='next-page-top']').is_displayed())
# except: # Exit the loop at this time
# break
# driver.find_element_by_xpath('//a[@id='next-page-top']').click()
# WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//a[@class='detaillink']').is_displayed())
# current_url = driver.current_url						
# So I thought of a good way, you can go to the next page directly by changing the query string
current_index += 25 # The default is 25 per page, because the logic of changing the drop-down box above to 100 always fails, so remember to change the number of pages in the future
if current_index > results_count: # Once the number of search results recorded before is exceeded, it means that the keyword has been crawled
break
index1 = current_url.find('COUNT=')
index2 = current_url.find('&', index1)
next_url = current_url[: index1 + 6] + str(current_index) + current_url[index2: ]
while True:
try: 
print(f'  - Switch to next page (next page is {page_number + 1})...')
driver.get(next_url)
print('    + Waiting for search results ...')
WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//a[@class="detaillink"]').is_displayed())
print('    + OK !')	
break
except Exception as e: 
print(f'    + Fail: {e} ...')
continue
if __name__ == '__main__':
keywords = load_keywords('kw.txt')
print(keywords)
username = '' # Xueba library username
password = '' # Xueba library password
# host = 'https://www-engineeringvillage-com-443.e1.buaa.edu.cn' # Xueba Library is used to access Engineering Village's university host URL: In 2020, the host of Shenyang University of Technology ( http://202.199.103.219), the latest one used in 2021 is the host of Beihang University
host = 'http://www.engineeringvillage.com' # 20211017 update: You can skip the channel and visit directly
extension_path = 'D:/xuebalib.crx' # Xueba Library plug-in for channel access: Since the browser driver does not have a plug-in by default, even if the auto-start plug-in has been installed on the browser, it will not be activated when driving Enables plugins, so plugins need to be loaded each time
mode = 'expert' # It is recommended to use the export mode instead of the default quick mode, because the former is relatively less error-prone
xuebalib = Xuebalib(keywords=keywords, 
username=username, 
password=password, 
host=host, 
extension_path=extension_path,								
mode=mode)
xuebalib.run()

postscript

Everything is well, don’t worry about it.

Long time, long time, gone.

Related Posts

Python develops multi-thread blasting port 21. (Get account number and password, log in successfully)

C fell from the altar, Python finally topped the list | TIOBE October programming language list

DingTalk’s H5 enterprise application does not need to log in to Python3.7+Django3.2.5. The backend uses auth_code to obtain the current DingTalk login user userId and other information

[Python from zero to one] Thirty-nine. Image geometric transformation of image processing basics (mirror affine perspective)

Error in deploying django project and installing uwsgi

30 Data Visualization Tips You Can’t Ignore

python–Google Dinosaur Run small project

postman’s interface association

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*