hi my name is Dr.Ho
Today we are going to talk about crawling in Python.
module to install
pip install requests
pip install pandas
pip install BeautifulSoup4
pip install xlrd
pip install openpyxl
import
from bs4 import BeautifulSoup
from urllib.request import urlopen
or import requests
urllib is built-in to Python, and requests requires installation.
There is a difference. urllib transmits in binary form, requests specify get and post, and transmits in dictionary form
requests.get(url,allow_redirects=False)
params1 = { 'param1': 'value1', value2', value3'}
res = requests.get( URL, params = params1 )
userdata = {“name”: “lsrank”, “nickname”: “lsrank”, “password”: “lsrankdotcom”}
resp = requests.post(‘http://www.lsrank.com/login’, data=userdata)
import requests, json
data = {‘outer’: {‘inner’: ‘value’}}
res = requests.post(URL, data=json.dumps(data))
headers = {‘Content-Type’: ‘application/json; charset=utf-8’}
cookies = {'session_id': 'sorryidontcare'}
res = requests.get(URL, headers=headers, cookies=cookies)
Built-in json decoder
res.request
res.status_code response code
res.raise_for_status() 200 Error if not ok code
resp.json() : Convert json response to dictionary type
resp.text
parser type
Parser declaration advantages and disadvantages
html.parser BeautifulSoup(get_html.content,’html.parser’) medium speed
lxml HTML parser BeautifulSoup(get_html.content,’lxml’) fast lxml installation
lxml XML parser BeautifulSoup(get_html.conten,’ lxml-xml)
BeautifulSoup(get_html.content,’ lxml-xm) fast lxml installation
html5lib BeautifulSoup(get_html.content,’html5lib’) Install HTML5 html5lib
slow
request
html = requests.get(‘https://search.naver.com/search.naver?query=weather’)
add header
req = Request(url)
req.add_header('User-Agent', 'Mozilla/5.0')
request to server
html = urlopen(req).read()
Finding elements in html code
How to find all matching tags find_all , select
How to use the select method
soup.select(“parent tag or id or class > sub tag or id or class”)
#idname for id
.classname for class
tagname if tag
soup.select(“p > #idname > .classname”)
When you want to find an item with a tag and classname: soup.select(“a .classname”)
Groups only items with a tag and href element into a list res = soup.select(‘a[href]’)
When there is an a tag and an href element and you want to group only a specific class res = soup.select(“a[href] + .classname”)
find_all finds all tags matching the condition
find_all(name, attrs, recursive, string, limit, **kwargs)
Find all a tags: res_list = bs.find_all('a')
Find all atags with matching class: res_list = bs.find_all('a', {'class':'test_class'})
Find two a-tags with matching class res_list = bs.find_all('a', {'class':'test_class'}, limit=2)
The second element found above res_list[1]
string value search soup.find_all(string=”searchzz”)
If you only need to find one that matches a certain condition, find
find finds the first tag that matches the condition
find(name, attrs, recursive, string, **kwargs)
Find a tag matching class name: res_list = bs.find(‘a’, class_=:test)
or res = bs.find('a', {'class':'test'})
class is a reserved word, so you should write class_.
Find matching tags by id: res = bs.find(id=’test_id’)
or res = bs.find(' ', {'id'='test_id'})
When you only need to get a specific value in a tag
Use get when you want to get only a specific property value
res.get('href')
Import the <title> tag as well
soup.title
soup. find('title')
Use string or get_text() when you want to get only the contents inside the <title> tag.
soup.title.string
soup.find('title') .get_text()
BeautifulSoup is used to fetch and parse web page sources.
soup = BeautifulSoup(html, ‘html.parser’)
Add encoding if characters are garbled when parsing
soup = BeautifulSoup(html, ‘html.parser’, from_encoding=’utf-8’)
0 Comments