python如何爬取壁纸网站数据

发布时间：2022-01-14 15:21:17 作者：小新
来源：亿速云阅读：387

# Python如何爬取壁纸网站数据

网络爬虫是获取互联网公开数据的有效工具，本文将以壁纸网站为例，详细介绍使用Python爬取图片数据的完整流程。我们将使用`requests`、`BeautifulSoup`和`os`等库实现这个项目。

## 一、准备工作

### 1.1 安装必要库
```python
pip install requests beautifulsoup4

1.2 目标网站分析

以Wallhaven.cc为例（实际使用时请遵守网站robots.txt规则）： - 页面结构：壁纸以缩略图形式展示 - 图片URL规律：点击缩略图后进入详情页获取原图 - 分页机制：URL带有page参数

二、基础爬虫实现

2.1 获取网页内容

import requests
from bs4 import BeautifulSoup

def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"请求失败: {e}")
        return None

2.2 解析图片链接

def parse_links(html):
    soup = BeautifulSoup(html, 'html.parser')
    thumbnails = soup.select('img.preview[data-src]')
    return [img['data-src'].replace('small', 'full') for img in thumbnails]

三、高级功能实现

3.1 自动翻页处理

def crawl_multiple_pages(base_url, max_pages=5):
    for page in range(1, max_pages + 1):
        url = f"{base_url}/page/{page}"
        html = get_html(url)
        if html:
            image_links = parse_links(html)
            download_images(image_links)

3.2 图片下载与存储

import os

def download_images(urls, folder='wallpapers'):
    if not os.path.exists(folder):
        os.makedirs(folder)
    
    for url in urls:
        try:
            filename = os.path.join(folder, url.split('/')[-1])
            with open(filename, 'wb') as f:
                response = requests.get(url, stream=True)
                for chunk in response.iter_content(1024):
                    f.write(chunk)
            print(f"下载成功: {filename}")
        except Exception as e:
            print(f"下载失败 {url}: {e}")

四、反爬应对策略

4.1 请求头设置

headers = {
    'User-Agent': 'Mozilla/5.0',
    'Referer': 'https://wallhaven.cc/',
    'Accept-Language': 'en-US,en;q=0.9'
}

4.2 请求间隔控制

import time
import random

def delayed_request(url):
    time.sleep(random.uniform(1, 3))
    return requests.get(url)

五、完整代码示例

import os
import time
import random
import requests
from bs4 import BeautifulSoup

class WallpaperCrawler:
    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0',
            'Referer': 'https://wallhaven.cc/'
        }
    
    def crawl(self, category='general', pages=3):
        base_url = f"https://wallhaven.cc/{category}"
        for page in range(1, pages + 1):
            self.download_page(f"{base_url}?page={page}")
            time.sleep(random.uniform(2, 5))
    
    def download_page(self, url):
        html = self.get_html(url)
        if html:
            soup = BeautifulSoup(html, 'html.parser')
            wallpapers = soup.select('figure.thumb')
            for wp in wallpapers:
                detail_url = wp.find('a')['href']
                self.download_wallpaper(detail_url)
    
    def download_wallpaper(self, url):
        html = self.get_html(url)
        if html:
            soup = BeautifulSoup(html, 'html.parser')
            img_url = soup.select_one('img#wallpaper')['src']
            self.save_image(img_url)
    
    def save_image(self, url):
        try:
            filename = os.path.join('wallpapers', url.split('/')[-1])
            response = requests.get(url, headers=self.headers)
            with open(filename, 'wb') as f:
                f.write(response.content)
            print(f"Saved: {filename}")
        except Exception as e:
            print(f"Error saving {url}: {e}")
    
    def get_html(self, url):
        try:
            response = requests.get(url, headers=self.headers)
            return response.text
        except Exception as e:
            print(f"Request failed: {e}")
            return None

if __name__ == "__main__":
    crawler = WallpaperCrawler()
    crawler.crawl(pages=2)

六、注意事项

遵守robots.txt：检查目标网站的爬虫政策
控制请求频率：避免给服务器造成过大压力
版权问题：仅下载允许自由使用的图片
异常处理：网络请求需要完善的错误处理
数据存储：大量图片建议使用云存储

通过以上方法，你可以构建一个高效的壁纸爬虫。如需扩展功能，可以考虑： - 添加分辨率筛选 - 实现多线程下载 - 开发GUI界面 - 增加自动换壁纸功能 “`