您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# Python爬虫如何爬取英文文档存为PDF,再读取PDF自动翻译文档
## 目录
1. [技术方案概述](#技术方案概述)
2. [环境准备与依赖安装](#环境准备与依赖安装)
3. [网页爬取与PDF生成](#网页爬取与pdf生成)
4. [PDF内容提取技术](#pdf内容提取技术)
5. [自动翻译实现方案](#自动翻译实现方案)
6. [完整代码实现](#完整代码实现)
7. [常见问题与优化](#常见问题与优化)
8. [应用场景与扩展](#应用场景与扩展)
## 技术方案概述
本文将详细介绍如何使用Python构建一个自动化工作流:
1. 爬取指定英文网页内容
2. 将内容转换为PDF格式保存
3. 从PDF中提取文本内容
4. 调用翻译API实现自动翻译
5. 输出翻译后的文档
技术栈组成:
- 爬虫框架:Requests/Scrapy
- HTML转PDF:pdfkit/wkhtmltopdf
- PDF解析:PyPDF2/pdfminer
- 翻译服务:Google Translate API/DeepL
## 环境准备与依赖安装
### 基础环境要求
- Python 3.7+
- pip包管理工具
- wkhtmltopdf(需单独安装)
### 安装必要库
```bash
pip install requests beautifulsoup4 pdfkit PyPDF2 googletrans==4.0.0-rc1
Windows系统: 1. 从官网下载安装包 2. 添加安装目录到系统PATH
Linux系统:
sudo apt-get install wkhtmltopdf
import requests
from bs4 import BeautifulSoup
def fetch_webpage(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# 移除不需要的元素
for element in soup(['script', 'style', 'nav', 'footer']):
element.decompose()
return str(soup)
except Exception as e:
print(f"抓取失败: {e}")
return None
import pdfkit
import os
def html_to_pdf(html_content, output_path):
options = {
'encoding': 'UTF-8',
'quiet': '',
'page-size': 'A4',
'margin-top': '15mm',
'margin-right': '15mm',
'margin-bottom': '15mm',
'margin-left': '15mm',
}
try:
pdfkit.from_string(html_content, output_path, options=options)
print(f"PDF已保存至: {output_path}")
return True
except Exception as e:
print(f"PDF生成失败: {e}")
return False
def crawl_and_save_pdf(url, output_pdf):
print(f"开始处理: {url}")
html_content = fetch_webpage(url)
if html_content:
success = html_to_pdf(html_content, output_pdf)
return success
return False
from PyPDF2 import PdfReader
def extract_text_pypdf2(pdf_path):
text = ""
try:
with open(pdf_path, 'rb') as file:
reader = PdfReader(file)
for page in reader.pages:
text += page.extract_text() + "\n"
except Exception as e:
print(f"文本提取失败: {e}")
return text
from pdfminer.high_level import extract_text
def extract_text_pdfminer(pdf_path):
try:
return extract_text(pdf_path)
except Exception as e:
print(f"文本提取失败: {e}")
return ""
import re
def preprocess_text(text):
# 合并换行符
text = re.sub(r'-\n', '', text)
text = re.sub(r'\n+', '\n', text)
# 移除特殊字符
text = re.sub(r'[^\x00-\x7F]+', ' ', text)
return text.strip()
from googletrans import Translator
def translate_text(text, dest='zh-cn'):
translator = Translator(service_urls=['translate.google.com'])
try:
# 分段处理长文本
if len(text) > 5000:
return translate_long_text(text, dest)
translation = translator.translate(text, dest=dest)
return translation.text
except Exception as e:
print(f"翻译失败: {e}")
return None
def translate_long_text(text, dest):
"""处理超过5000字符的长文本"""
chunks = [text[i:i+4500] for i in range(0, len(text), 4500)]
translated = []
for chunk in chunks:
result = translate_text(chunk, dest)
if result:
translated.append(result)
time.sleep(1) # 避免速率限制
return '\n'.join(translated)
import requests
import os
def translate_deepl(text, target_lang='ZH'):
DEEPL_KEY = os.getenv('DEEPL_API_KEY')
if not DEEPL_KEY:
raise ValueError("未设置DeepL API密钥")
url = "https://api-free.deepl.com/v2/translate"
params = {
'auth_key': DEEPL_KEY,
'text': text,
'target_lang': target_lang
}
try:
response = requests.post(url, data=params)
return response.json()['translations'][0]['text']
except Exception as e:
print(f"DeepL翻译失败: {e}")
return None
import time
from pathlib import Path
def main():
# 配置参数
url = "https://example.com/english-document"
output_pdf = "document.pdf"
translated_txt = "translated.txt"
# 步骤1: 爬取并保存PDF
if not crawl_and_save_pdf(url, output_pdf):
return
# 步骤2: 提取PDF文本
print("提取PDF文本内容...")
raw_text = extract_text_pypdf2(output_pdf)
if not raw_text:
raw_text = extract_text_pdfminer(output_pdf)
cleaned_text = preprocess_text(raw_text)
print(f"提取到{len(cleaned_text)}个字符")
# 步骤3: 翻译文本
print("开始翻译...")
start_time = time.time()
translated = translate_text(cleaned_text)
print(f"翻译完成,耗时{time.time()-start_time:.2f}秒")
# 保存结果
with open(translated_txt, 'w', encoding='utf-8') as f:
f.write(translated)
print(f"翻译结果已保存至: {translated_txt}")
if __name__ == "__main__":
main()
def batch_process(url_list, output_dir="output"):
Path(output_dir).mkdir(exist_ok=True)
for i, url in enumerate(url_list, 1):
print(f"\n处理文档 {i}/{len(url_list)}")
try:
# 生成唯一文件名
pdf_path = Path(output_dir) / f"doc_{i}.pdf"
txt_path = Path(output_dir) / f"translated_{i}.txt"
# 执行转换流程
if crawl_and_save_pdf(url, pdf_path):
text = extract_text_pypdf2(pdf_path)
translated = translate_text(text)
with open(txt_path, 'w', encoding='utf-8') as f:
f.write(translated)
except Exception as e:
print(f"处理失败: {url} - {str(e)}")
continue
PDF生成乱码
'encoding': 'UTF-8'
翻译API限制
time.sleep(1)
PDF提取格式错乱
from concurrent.futures import ThreadPoolExecutor
def parallel_process(url_list, workers=4):
with ThreadPoolExecutor(max_workers=workers) as executor:
executor.map(batch_process, [url_list])
import hashlib
import json
translation_cache = {}
def get_cache_key(text, target_lang):
return hashlib.md5(f"{text}_{target_lang}".encode()).hexdigest()
def cached_translate(text, dest='zh-cn'):
cache_key = get_cache_key(text, dest)
if cache_key in translation_cache:
return translation_cache[cache_key]
result = translate_text(text, dest)
if result:
translation_cache[cache_key] = result
# 可持久化到文件
with open('translation_cache.json', 'w') as f:
json.dump(translation_cache, f)
return result
添加GUI界面
支持更多格式 “`python
from docx import Document
def docx_to_text(filepath): doc = Document(filepath) return ‘\n’.join([p.text for p in doc.paragraphs])
3. **集成机器学习翻译**
- 使用HuggingFace Transformers
- 加载本地化翻译模型
4. **自动化部署**
- 打包为Docker容器
- 设置定时任务自动抓取
### 最终建议
1. 重要文档建议人工校对机器翻译结果
2. 遵守目标网站的robots.txt规则
3. 商业用途需购买专业翻译API服务
---
本文完整代码已托管至GitHub: [示例仓库链接]()
注:实际使用时需要: 1. 替换示例URL为真实目标网址 2. 配置API密钥等敏感信息 3. 根据具体需求调整PDF生成参数 4. 遵守各翻译服务的API使用条款
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。