python crawling codelitenews, 라이트 뉴스, ITNews, 어도비 뉴스,Adobe news

hi my name is Dr.Ho

Today we are going to talk about crawling in Python.

module to install

pip install requests

pip install pandas

pip install BeautifulSoup4

pip install xlrd

pip install openpyxl

import

from bs4 import BeautifulSoup

from urllib.request import urlopen

or import requests

urllib is built-in to Python, and requests requires installation.

There is a difference. urllib transmits in binary form, requests specify get and post, and transmits in dictionary form

requests.get(url,allow_redirects=False)

params1 = { 'param1': 'value1', value2', value3'}

res = requests.get( URL, params = params1 )

userdata = {“name”: “lsrank”, “nickname”: “lsrank”, “password”: “lsrankdotcom”}

resp = requests.post(‘http://www.lsrank.com/login’, data=userdata)

import requests, json

data = {‘outer’: {‘inner’: ‘value’}}

res = requests.post(URL, data=json.dumps(data))

headers = {‘Content-Type’: ‘application/json; charset=utf-8’}

cookies = {'session_id': 'sorryidontcare'}

res = requests.get(URL, headers=headers, cookies=cookies)

Built-in json decoder

res.request

res.status_code response code

res.raise_for_status() 200 Error if not ok code

resp.json() : Convert json response to dictionary type

resp.text

parser type

Parser declaration advantages and disadvantages

html.parser BeautifulSoup(get_html.content,’html.parser’) medium speed

lxml HTML parser BeautifulSoup(get_html.content,’lxml’) fast lxml installation

lxml XML parser BeautifulSoup(get_html.conten,’ lxml-xml)

BeautifulSoup(get_html.content,’ lxml-xm) fast lxml installation

html5lib BeautifulSoup(get_html.content,’html5lib’) Install HTML5 html5lib

slow

request

html = requests.get(‘https://search.naver.com/search.naver?query=weather’)

add header

req = Request(url)

req.add_header('User-Agent', 'Mozilla/5.0')

request to server

html = urlopen(req).read()

Finding elements in html code

How to find all matching tags find_all , select

How to use the select method

soup.select(“parent tag or id or class > sub tag or id or class”)

#idname for id

.classname for class

tagname if tag

soup.select(“p > #idname > .classname”)

When you want to find an item with a tag and classname: soup.select(“a .classname”)

Groups only items with a tag and href element into a list res = soup.select(‘a[href]’)

When there is an a tag and an href element and you want to group only a specific class res = soup.select(“a[href] + .classname”)

find_all finds all tags matching the condition

find_all(name, attrs, recursive, string, limit, **kwargs)

Find all a tags: res_list = bs.find_all('a')

Find all atags with matching class: res_list = bs.find_all('a', {'class':'test_class'})

Find two a-tags with matching class res_list = bs.find_all('a', {'class':'test_class'}, limit=2)

The second element found above res_list[1]

string value search soup.find_all(string=”searchzz”)

If you only need to find one that matches a certain condition, find

find finds the first tag that matches the condition

find(name, attrs, recursive, string, **kwargs)

Find a tag matching class name: res_list = bs.find(‘a’, class_=:test)

or res = bs.find('a', {'class':'test'})

class is a reserved word, so you should write class_.

Find matching tags by id: res = bs.find(id=’test_id’)

or res = bs.find(' ', {'id'='test_id'})

When you only need to get a specific value in a tag

Use get when you want to get only a specific property value

res.get('href')

Import the <title> tag as well

soup.title

soup. find('title')

Use string or get_text() when you want to get only the contents inside the <title> tag.

soup.title.string

soup.find('title') .get_text()

BeautifulSoup is used to fetch and parse web page sources.

soup = BeautifulSoup(html, ‘html.parser’)

Add encoding if characters are garbled when parsing

soup = BeautifulSoup(html, ‘html.parser’, from_encoding=’utf-8’)

Music

python crawling code

비트코인으로 litenews에 후원해주세요

Posted by Trip_man

Post a Comment

0 Comments

Comments

Search

Subscribe Us

Facebook

Report Abuse

Comments

Search This Blog

About Me

Followers

Trending

페이지

Subscribe Us

Featured News

Popular Feed

Popular Posts

그래픽 장치를 초기화 할 수 없습니다. 롤 오류_리그 오브 렌전드 오류

포토샵 프리미어 프로 일러스트 어도비 환경설정 재설정 방법

d3dx9_39.dll 롤 오류 해결 방법 간단 정리

Footer Menu Widget

Contact form

Ad Code

Music

python crawling code

비트코인으로 litenews에 후원해주세요

Posted by Trip_man

You may like these posts

Post a Comment

0 Comments

Comments

Social Plugin

Search

Subscribe Us

Facebook

Report Abuse

Comments

Search This Blog

About Me

Followers

Trending

페이지

Subscribe Us

Featured News

Popular Feed

Popular Posts

그래픽 장치를 초기화 할 수 없습니다. 롤 오류_리그 오브 렌전드 오류

포토샵 프리미어 프로 일러스트 어도비 환경설정 재설정 방법

d3dx9_39.dll 롤 오류 해결 방법 간단 정리

Ad Code

Footer Menu Widget

Contact form