Java jsoup怎么使用

发布时间：2022-01-26 15:22:58 作者：iii
来源：亿速云阅读：175

# Java jsoup怎么使用

## 一、jsoup简介

jsoup是一个用于处理实际HTML的Java库。它提供了一套非常便捷的API，可以通过DOM、CSS以及类似jQuery的操作方法来提取和操作数据。jsoup的主要功能包括：

1. 从URL、文件或字符串中解析HTML
2. 使用DOM遍历或CSS选择器查找和提取数据
3. 操作HTML元素、属性和文本
4. 清除用户提交的内容以防止XSS攻击
5. 输出整洁的HTML

jsoup非常适合用于：
- 网页抓取和数据提取
- 解析和清理HTML
- 网页内容分析和处理

## 二、环境准备

### 1. 添加jsoup依赖

Maven项目在pom.xml中添加：
```xml
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.16.1</version> <!-- 使用最新版本 -->
</dependency>

Gradle项目：

implementation 'org.jsoup:jsoup:1.16.1'

2. 手动下载

可以从jsoup官网下载jar文件，然后手动添加到项目中。

三、基本使用方法

1. 解析HTML文档

从字符串解析

String html = "<html><head><title>First parse</title></head>"
        + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

从URL加载

Document doc = Jsoup.connect("https://example.com/").get();
String title = doc.title();

从文件加载

File input = new File("/path/to/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "https://example.com/");

2. 数据提取

使用DOM方法

Document doc = Jsoup.connect("https://example.com").get();

// 获取标题
String title = doc.title();

// 获取特定id的元素
Element content = doc.getElementById("content");

// 获取所有链接
Elements links = doc.getElementsByTag("a");
for (Element link : links) {
    String linkHref = link.attr("href");
    String linkText = link.text();
}

使用CSS选择器

// 选择带有href属性的a元素
Elements links = doc.select("a[href]");

// 选择class为masthead的div
Elements masthead = doc.select("div.masthead");

// 选择直接子元素
Elements resultLinks = doc.select("h3.r > a");

3. 修改数据

Document doc = Jsoup.parse("<div><p>Lorem ipsum.</p></div>");

// 修改属性
Element div = doc.select("div").first();
div.attr("class", "newClass");

// 添加类
div.addClass("anotherClass");

// 修改文本内容
div.text("New text content");

// 修改HTML内容
div.html("<p>New <b>HTML</b> content</p>");

// 追加内容
div.append("<p>Appended paragraph</p>");

// 在元素前插入内容
div.prepend("<p>Prepended paragraph</p>");

四、高级功能

1. 处理表单

// 获取登录表单
Document doc = Jsoup.connect("http://example.com/login").get();
Element loginForm = doc.selectFirst("form#login");

// 准备表单数据
Connection.Response res = Jsoup.connect("http://example.com/login")
        .data("username", "myUser")
        .data("password", "myPass")
        .method(Connection.Method.POST)
        .execute();

// 获取登录后的会话cookie
Map<String, String> cookies = res.cookies();

// 使用cookie访问受保护页面
Document protectedPage = Jsoup.connect("http://example.com/protected")
        .cookies(cookies)
        .get();

2. 处理相对路径

Document doc = Jsoup.connect("https://example.com/news").get();

// 获取绝对URL
Elements links = doc.select("a[href]");
for (Element link : links) {
    String absUrl = link.attr("abs:href"); // 转换为绝对URL
    System.out.println(absUrl);
}

3. 清理HTML

String unsafeHtml = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";

// 使用白名单清理
String safeHtml = Jsoup.clean(unsafeHtml, 
        Whitelist.basic()
        .addTags("p")
        .addAttributes("a", "href"));

System.out.println(safeHtml);
// 输出: <p><a href="http://example.com/">Link</a></p>

4. 代理设置

Document doc = Jsoup.connect("https://example.com")
        .proxy("proxy.example.com", 8080) // 设置代理
        .userAgent("Mozilla/5.0") // 设置User-Agent
        .timeout(10000) // 设置超时时间
        .get();

五、实战案例

案例1：爬取新闻标题和链接

public class NewsCrawler {
    public static void main(String[] args) throws IOException {
        String url = "https://news.example.com";
        Document doc = Jsoup.connect(url).get();
        
        Elements newsHeadlines = doc.select(".news-item h3 a");
        
        for (Element headline : newsHeadlines) {
            String title = headline.text();
            String link = headline.attr("abs:href");
            
            System.out.println("标题: " + title);
            System.out.println("链接: " + link);
            System.out.println("------------------");
        }
    }
}

案例2：提取表格数据

public class TableExtractor {
    public static void main(String[] args) throws IOException {
        String url = "https://example.com/data-table";
        Document doc = Jsoup.connect(url).get();
        
        Element table = doc.select("table.data").first();
        Elements rows = table.select("tr");
        
        for (Element row : rows) {
            Elements cols = row.select("td");
            for (Element col : cols) {
                System.out.print(col.text() + "\t");
            }
            System.out.println();
        }
    }
}

案例3：构建HTML文档

public class HtmlBuilder {
    public static void main(String[] args) {
        Document doc = Document.createShell("");
        doc.title("Generated Page");
        
        Element body = doc.body();
        body.appendElement("h1").text("Welcome to my page");
        
        Element div = body.appendElement("div")
                .attr("class", "content");
                
        div.appendElement("p")
                .text("This is a paragraph.")
                .addClass("highlight");
                
        System.out.println(doc);
    }
}

六、性能优化

缓存解析结果：对于频繁访问的页面，考虑缓存Document对象
限制选择范围：先缩小选择范围再使用精细选择器 “`java // 不推荐 doc.select(“div.content p.small”);

// 推荐 Element content = doc.selectFirst(“div.content”); content.select(“p.small”);

3. **合理设置超时**：根据网络情况调整连接超时时间
4. **使用连接池**：对于大量请求，考虑使用连接池
5. **并行处理**：对于独立的任务可以使用多线程

## 七、常见问题解决

### 1. 处理SSL证书问题

```java
// 跳过SSL验证（不推荐生产环境使用）
Connection connection = Jsoup.connect("https://example.com");
connection.sslSocketFactory(SSLSocketClient.getSSLSocketFactory());
Document doc = connection.get();

2. 处理重定向

Document doc = Jsoup.connect("https://example.com")
        .followRedirects(true) // 启用重定向
        .get();

3. 处理403禁止访问

Document doc = Jsoup.connect("https://example.com")
        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
        .referrer("http://www.google.com")
        .header("Accept-Language", "en-US")
        .get();

4. 处理大文件

// 使用流式处理大文件
FileInputStream fis = new FileInputStream(new File("large.html"));
BufferedReader reader = new BufferedReader(new InputStreamReader(fis));
StringBuilder sb = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
    sb.append(line);
}
Document doc = Jsoup.parse(sb.toString());

八、最佳实践

尊重robots.txt：检查目标网站的robots.txt文件
设置合理的爬取间隔：避免给服务器造成过大压力
处理异常：妥善处理网络异常和解析异常
遵守法律法规：确保爬取行为符合相关法律法规
使用日志记录：记录爬取过程中的重要信息
资源释放：及时关闭连接和释放资源

九、与其他库的比较

特性	jsoup	HtmlUnit	Selenium
执行JavaScript	不支持	支持	支持
轻量级	是	中等	重
学习曲线	低	中等	中等
适用场景	简单HTML解析	复杂网页交互	浏览器自动化测试
性能	高	中等	低

十、总结

jsoup是一个功能强大且易于使用的HTML解析库，特别适合Java开发者进行网页内容提取和操作。通过本文的介绍，你应该已经掌握了：

jsoup的基本使用方法
各种数据提取技术
高级功能和实战案例
性能优化和问题解决技巧

在实际项目中，建议根据具体需求选择合适的工具。对于简单的HTML解析和内容提取，jsoup无疑是最佳选择之一；对于需要执行JavaScript的复杂页面，可能需要考虑HtmlUnit或Selenium等工具。

十一、资源推荐

希望本文能帮助你快速掌握jsoup的使用，在实际开发中提高工作效率！ “`