您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# 怎么利用Python网络爬虫获取电影天堂视频下载链接
## 前言
在互联网时代,获取影视资源的方式多种多样。电影天堂(www.dytt8.net)作为国内知名的影视资源站点,提供了大量高清电影、电视剧的下载链接。本文将详细介绍如何利用Python网络爬虫技术,自动化获取电影天堂的视频下载链接。
---
## 一、准备工作
### 1.1 技术栈选择
- **Python 3.x**:基础编程语言
- **Requests库**:发送HTTP请求
- **BeautifulSoup4**:HTML解析
- **正则表达式**:数据清洗
- **Fake-Useragent**:伪装浏览器头
### 1.2 安装依赖库
```bash
pip install requests beautifulsoup4 fake-useragent
电影天堂的典型页面结构: - 首页:分类导航(最新电影、国内电影等) - 详情页:包含磁力链接/迅雷下载地址 - 反爬机制:简单IP限制、无验证码
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
def get_movie_list(page=1):
url = f"https://www.dytt8.net/html/gndy/dyzz/list_23_{page}.html"
headers = {"User-Agent": UserAgent().random}
try:
response = requests.get(url, headers=headers, timeout=10)
response.encoding = 'gb2312' # 注意编码问题
soup = BeautifulSoup(response.text, 'html.parser')
movie_links = []
for a in soup.select('.co_content8 ul a'):
if 'href' in a.attrs:
movie_links.append({
'title': a.text,
'url': 'https://www.dytt8.net' + a['href']
})
return movie_links
except Exception as e:
print(f"获取列表页失败: {e}")
return []
import re
def parse_download_url(detail_url):
try:
response = requests.get(detail_url, headers={'User-Agent': UserAgent().random})
response.encoding = 'gb2312'
html = response.text
# 使用正则提取磁力链接
magnet_pattern = r'magnet:\?xt=urn:btih:[a-zA-Z0-9]{40}'
magnets = re.findall(magnet_pattern, html)
# 提取迅雷链接
thunder_pattern = r'thunder://[A-Za-z0-9+/=]+'
thunders = re.findall(thunder_pattern, html)
return {
'magnets': list(set(magnets)), # 去重
'thunders': list(set(thunders))
}
except Exception as e:
print(f"解析详情页失败: {e}")
return {}
proxies = {
'http': 'http://123.456.789.012:8080',
'https': 'https://123.456.789.012:8080'
}
response = requests.get(url, proxies=proxies)
import random
import time
time.sleep(random.uniform(1, 3)) # 1-3秒随机延迟
import csv
def save_to_csv(data, filename='movies.csv'):
with open(filename, 'a', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['title', 'url', 'magnets', 'thunders'])
writer.writerow(data)
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['movie_db']
collection = db['dytt']
def save_to_mongo(data):
collection.insert_one(data)
import requests
from bs4 import BeautifulSoup
import re
import time
import random
from fake_useragent import UserAgent
class DyttSpider:
def __init__(self):
self.base_url = "https://www.dytt8.net"
self.ua = UserAgent()
def get_list_page(self, page=1):
url = f"{self.base_url}/html/gndy/dyzz/list_23_{page}.html"
try:
response = requests.get(url, headers={'User-Agent': self.ua.random})
response.encoding = 'gb2312'
return response.text
except Exception as e:
print(f"Error fetching list page: {e}")
return None
def parse_list_page(self, html):
soup = BeautifulSoup(html, 'html.parser')
movies = []
for a in soup.select('.co_content8 ul a'):
if a.get('href'):
movies.append({
'title': a.get_text(),
'url': self.base_url + a['href']
})
return movies
def get_download_links(self, detail_url):
try:
time.sleep(random.uniform(1, 2))
response = requests.get(detail_url, headers={'User-Agent': self.ua.random})
response.encoding = 'gb2312'
# 两种提取方式结合提高准确性
magnets = re.findall(r'magnet:\?xt=urn:btih:[a-zA-Z0-9]{40}', response.text)
thunders = re.findall(r'thunder://[A-Za-z0-9+/=]+', response.text)
# 备用XPath提取方案
soup = BeautifulSoup(response.text, 'html.parser')
download_div = soup.find('div', {'id': 'Zoom'})
if download_div:
for a in download_div.find_all('a'):
href = a.get('href', '')
if href.startswith('magnet'):
magnets.append(href)
elif href.startswith('thunder'):
thunders.append(href)
return {
'magnets': list(set(magnets)),
'thunders': list(set(thunders))
}
except Exception as e:
print(f"Error parsing detail page: {e}")
return {}
def run(self, start_page=1, end_page=3):
all_movies = []
for page in range(start_page, end_page + 1):
print(f"正在抓取第 {page} 页...")
html = self.get_list_page(page)
if html:
movies = self.parse_list_page(html)
for movie in movies:
download_links = self.get_download_links(movie['url'])
movie.update(download_links)
all_movies.append(movie)
print(f"已获取: {movie['title']} - 磁力链接数: {len(movie['magnets'])}")
time.sleep(2)
return all_movies
if __name__ == "__main__":
spider = DyttSpider()
results = spider.run(start_page=1, end_page=2)
print(f"共获取 {len(results)} 部电影资源")
版权声明:
爬虫道德:
可视化展示:
功能增强:
多站点支持:
通过本文介绍的方法,你可以快速构建一个电影天堂资源爬虫。在实际开发中,需要注意: - 网站结构变化时的代码适配 - 反爬机制的持续对抗 - 数据存储方案的优化
建议定期维护代码,并始终遵守网络爬虫的相关法律法规。 “`
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。