什么是网页爬虫? 网页爬虫是一种自动获取网页内容的程序。它可以帮助我们:
采集公开数据
监控价格变化
聚合信息内容
构建搜索引擎
准备工作 环境配置 1 2 3 4 5 6 7 python -m venv venv source venv/bin/activate pip install requests beautifulsoup4 lxml
基本工具
库
用途
requests
发送 HTTP 请求
BeautifulSoup
解析 HTML
lxml
高性能解析器
Selenium
动态页面爬取
Scrapy
爬虫框架
第一个爬虫 使用 requests 获取网页 1 2 3 4 5 6 7 8 9 10 11 12 import requestsurl = 'https://httpbin.org/get' response = requests.get(url) print (response.status_code) print (response.text) print (response.json())
设置请求头 1 2 3 4 5 6 7 headers = { 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' , 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9' , 'Accept-Language' : 'zh-CN,zh;q=0.9,en;q=0.8' , } response = requests.get(url, headers=headers)
处理参数和表单 1 2 3 4 5 6 7 8 9 10 11 params = {'key' : 'value' , 'page' : 1 } response = requests.get(url, params=params) data = {'username' : 'user' , 'password' : 'pass' } response = requests.post(url, data=data) import jsonresponse = requests.post(url, json={'key' : 'value' })
解析 HTML BeautifulSoup 基础 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 from bs4 import BeautifulSouphtml = ''' <html> <head><title>示例页面</title></head> <body> <h1 class="title">欢迎来到 BitMosaic Lab</h1> <div id="content"> <p>这是第一段文字</p> <p>这是第二段文字</p> <a href="https://example.com">链接</a> </div> <ul class="list"> <li>项目1</li> <li>项目2</li> <li>项目3</li> </ul> </body> </html> ''' soup = BeautifulSoup(html, 'lxml' ) print (soup.title.string) h1 = soup.find('h1' , class_='title' ) print (h1.text) paragraphs = soup.find_all('p' ) for p in paragraphs: print (p.text) link = soup.find('a' ) print (link['href' ]) print (link.text)
CSS 选择器 1 2 3 4 5 6 title = soup.select_one('h1.title' ) items = soup.select('ul.list li' ) for item in items: print (item.text)
实战:爬取新闻列表 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import requestsfrom bs4 import BeautifulSoupimport timedef fetch_news (): """爬取新闻列表示例""" url = 'https://news.ycombinator.com/' headers = { 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)' } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'lxml' ) news_list = [] items = soup.select('.titleline > a' ) for item in items[:10 ]: news = { 'title' : item.text, 'link' : item['href' ] } news_list.append(news) return news_list if __name__ == '__main__' : news = fetch_news() for i, item in enumerate (news, 1 ): print (f"{i} . {item['title' ]} " ) print (f" {item['link' ]} \n" )
处理动态页面 Selenium 基础 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECdriver = webdriver.Chrome() try : driver.get('https://example.com' ) element = WebDriverWait(driver, 10 ).until( EC.presence_of_element_located((By.ID, "content" )) ) content = driver.find_element(By.ID, "content" ) print (content.text) button = driver.find_element(By.CLASS_NAME, "btn" ) button.click() input_field = driver.find_element(By.NAME, "search" ) input_field.send_keys("Python" ) finally : driver.quit()
无头浏览器 1 2 3 4 5 6 7 8 9 from selenium.webdriver.chrome.options import Optionsoptions = Options() options.add_argument('--headless' ) options.add_argument('--disable-gpu' ) options.add_argument('--no-sandbox' ) driver = webdriver.Chrome(options=options)
数据存储 保存为 JSON 1 2 3 4 5 6 7 8 9 10 11 import jsondata = [{'title' : 'News 1' }, {'title' : 'News 2' }] with open ('data.json' , 'w' , encoding='utf-8' ) as f: json.dump(data, f, ensure_ascii=False , indent=2 ) with open ('data.json' , 'r' , encoding='utf-8' ) as f: loaded_data = json.load(f)
保存为 CSV 1 2 3 4 5 6 7 8 9 10 11 12 import csvdata = [ {'title' : 'News 1' , 'link' : 'https://...' }, {'title' : 'News 2' , 'link' : 'https://...' }, ] with open ('data.csv' , 'w' , newline='' , encoding='utf-8' ) as f: writer = csv.DictWriter(f, fieldnames=['title' , 'link' ]) writer.writeheader() writer.writerows(data)
保存到数据库 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import sqlite3conn = sqlite3.connect('news.db' ) cursor = conn.cursor() cursor.execute(''' CREATE TABLE IF NOT EXISTS news ( id INTEGER PRIMARY KEY, title TEXT, link TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) ''' )cursor.execute( 'INSERT INTO news (title, link) VALUES (?, ?)' , ('News Title' , 'https://...' ) ) conn.commit() conn.close()
反爬虫处理 常见反爬措施
User-Agent 检测
IP 限制
验证码
登录验证
动态加载
应对策略 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import randomimport timeuser_agents = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' , 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' , 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36' , ] headers = { 'User-Agent' : random.choice(user_agents) } time.sleep(random.uniform(1 , 3 )) proxies = { 'http' : 'http://proxy.example.com:8080' , 'https' : 'https://proxy.example.com:8080' , } response = requests.get(url, proxies=proxies)
爬虫框架 Scrapy 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import scrapyclass NewsSpider (scrapy.Spider): name = 'news' start_urls = ['https://example.com/news' ] def parse (self, response ): for article in response.css('.article' ): yield { 'title' : article.css('h2::text' ).get(), 'link' : article.css('a::attr(href)' ).get(), } next_page = response.css('.next-page::attr(href)' ).get() if next_page: yield response.follow(next_page, self .parse)
最佳实践 1. 遵守规则 1 2 3 4 5 6 7 8 9 import urllib.robotparserrp = urllib.robotparser.RobotFileParser() rp.set_url('https://example.com/robots.txt' ) rp.read() can_fetch = rp.can_fetch('*' , '/path/to/page' )
2. 错误处理 1 2 3 4 5 6 7 8 9 10 11 12 13 import requestsfrom requests.exceptions import RequestExceptiondef safe_request (url, retries=3 ): for i in range (retries): try : response = requests.get(url, timeout=10 ) response.raise_for_status() return response except RequestException as e: print (f"请求失败 ({i+1 } /{retries} ): {e} " ) time.sleep(2 ** i) return None
3. 日志记录 1 2 3 4 5 6 7 8 9 10 import logginglogging.basicConfig( level=logging.INFO, format ='%(asctime)s - %(levelname)s - %(message)s' , filename='crawler.log' ) logging.info('开始爬取...' ) logging.error('请求失败' )
法律与道德
遵守 robots.txt
控制请求频率
不爬取敏感数据
遵守网站服务条款
合理使用数据
总结 Python 爬虫开发要点:
使用 requests 发送请求
使用 BeautifulSoup 解析 HTML
使用 Selenium 处理动态页面
合理存储数据
遵守法律法规
推荐资源
有问题欢迎评论区讨论!