编码输出问题
问题描述
有个同学在抓取这个网页 http://www.ikanchai.com/ 的时候,出现乱码,如下:
import requests
from bs4 import BeautifulSoup as bs
web_data = requests.get('http://www.ikanchai.com/')
soup = bs(web_data.text,'lxml')
dates = soup.select('div.sort.channel.clearfix > ul > li > a')
for data in dates:
print (data)
-----输出结果-----
<a href="http://www.ikanchai.com/view/">观点</a>
<a href="http://www.ikanchai.com/start/">创投</a>
<a href="http://www.ikanchai.com/evaluation/">评测</a>
<a href="http://www.ikanchai.com/vr/">VR</a>
<a href="http://www.ikanchai.com/push/">专æ </a>
<a href="http://app.ikanchai.com/roll.php">动æ€</a>
那么其实这是编码问题,加一句web_data.encoding = 'utf-8'
即可以了,具体的编码选择在html文档开头有说明。
问题解决
更正之后的代码:
import requests
from bs4 import BeautifulSoup as bs
web_data = requests.get('http://www.ikanchai.com/')
web_data.encoding = 'utf-8'
soup = bs(web_data.text,'lxml')
dates = soup.select('div.sort.channel.clearfix > ul > li > a')
for data in dates:
print (data)
-----输出结果------
<a href="http://www.ikanchai.com/view/">观点</a>
<a href="http://www.ikanchai.com/start/">创投</a>
<a href="http://www.ikanchai.com/evaluation/">评测</a>
<a href="http://www.ikanchai.com/vr/">VR</a>
<a href="http://www.ikanchai.com/push/">专栏</a>
<a href="http://app.ikanchai.com/roll.php">动态</a>