hi my name is Dr.Ho 

Today we are going to talk about crawling in Python.



 

module to install

pip install requests

pip install pandas

pip install BeautifulSoup4

pip install xlrd

pip install openpyxl


import

from bs4 import BeautifulSoup

from urllib.request import urlopen

or import requests


urllib is built-in to Python, and requests requires installation.

There is a difference. urllib transmits in binary form, requests specify get and post, and transmits in dictionary form


requests.get(url,allow_redirects=False)


params1 = { 'param1': 'value1', value2', value3'}

res = requests.get( URL, params = params1 )


userdata = {“name”: “lsrank”, “nickname”: “lsrank”, “password”: “lsrankdotcom”}

resp = requests.post(‘http://www.lsrank.com/login’, data=userdata)


import requests, json

data = {‘outer’: {‘inner’: ‘value’}}

res = requests.post(URL, data=json.dumps(data))



 

headers = {‘Content-Type’: ‘application/json; charset=utf-8’}

cookies = {'session_id': 'sorryidontcare'}

res = requests.get(URL, headers=headers, cookies=cookies)


Built-in json decoder

res.request

res.status_code response code

res.raise_for_status() 200 Error if not ok code

resp.json() : Convert json response to dictionary type


resp.text


parser type

Parser declaration advantages and disadvantages

html.parser BeautifulSoup(get_html.content,’html.parser’) medium speed

lxml HTML parser BeautifulSoup(get_html.content,’lxml’) fast lxml installation

lxml XML parser BeautifulSoup(get_html.conten,’ lxml-xml)

BeautifulSoup(get_html.content,’ lxml-xm) fast lxml installation

html5lib BeautifulSoup(get_html.content,’html5lib’) Install HTML5 html5lib

slow

request

html = requests.get(‘https://search.naver.com/search.naver?query=weather’)


add header

req = Request(url)

req.add_header('User-Agent', 'Mozilla/5.0')

request to server

html = urlopen(req).read()


Finding elements in html code

How to find all matching tags find_all , select


How to use the select method


soup.select(“parent tag or id or class > sub tag or id or class”)

#idname for id

.classname for class

tagname if tag

soup.select(“p > #idname > .classname”)


When you want to find an item with a tag and classname: soup.select(“a .classname”)

Groups only items with a tag and href element into a list res = soup.select(‘a[href]’)

When there is an a tag and an href element and you want to group only a specific class res = soup.select(“a[href] + .classname”)


find_all finds all tags matching the condition


find_all(name, attrs, recursive, string, limit, **kwargs)

Find all a tags: res_list = bs.find_all('a')

Find all atags with matching class: res_list = bs.find_all('a', {'class':'test_class'})

Find two a-tags with matching class res_list = bs.find_all('a', {'class':'test_class'}, limit=2)

The second element found above res_list[1]

string value search soup.find_all(string=”searchzz”)


If you only need to find one that matches a certain condition, find

find finds the first tag that matches the condition

find(name, attrs, recursive, string, **kwargs)


Find a tag matching class name: res_list = bs.find(‘a’, class_=:test)

or res = bs.find('a', {'class':'test'})

class is a reserved word, so you should write class_.


Find matching tags by id: res = bs.find(id=’test_id’)

or res = bs.find(' ', {'id'='test_id'})


When you only need to get a specific value in a tag

Use get when you want to get only a specific property value

res.get('href')


Import the <title> tag as well

soup.title

soup. find('title')


Use string or get_text() when you want to get only the contents inside the <title> tag.

soup.title.string

soup.find('title') .get_text()


BeautifulSoup is used to fetch and parse web page sources.

soup = BeautifulSoup(html, ‘html.parser’)

Add encoding if characters are garbled when parsing

soup = BeautifulSoup(html, ‘html.parser’, from_encoding=’utf-8’)