您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# 如何用R语言抓取网页图片
## 前言
在数据科学和网络爬虫领域,R语言凭借其强大的数据处理能力和丰富的扩展包生态系统,已经成为网页数据抓取的重要工具之一。本文将详细介绍如何使用R语言从网页中抓取图片,涵盖从基础概念到实际操作的完整流程,帮助读者掌握这一实用技能。
## 一、准备工作
### 1.1 必要的R包安装
在开始抓取网页图片前,需要确保已安装以下关键R包:
```r
install.packages(c("rvest", "httr", "xml2", "magick", "purrr", "dplyr"))
rvest
:网页抓取核心包,提供HTML解析功能httr
:HTTP请求处理包xml2
:XML/HTML文档解析magick
:图片处理包purrr
:函数式编程工具dplyr
:数据操作工具<img>
标签中,通过src
或data-src
属性指定URLlibrary(rvest)
library(httr)
url <- "https://example.com"
webpage <- read_html(url)
# 方法1:通过CSS选择器
img_links <- webpage %>%
html_nodes("img") %>%
html_attr("src")
# 方法2:通过XPath
img_links <- webpage %>%
html_nodes(xpath = "//img/@src") %>%
html_text()
library(purrr)
absolute_links <- map_chr(img_links, ~{
ifelse(startsWith(.x, "http"), .x, url_absolute(.x, url))
})
现代网页常用懒加载技术,真实图片URL可能藏在data-src
属性:
lazy_links <- webpage %>%
html_nodes("img[data-src]") %>%
html_attr("data-src")
对于JavaScript渲染的页面:
library(RSelenium)
# 启动浏览器驱动
rd <- rsDriver(browser = "chrome")
remDr <- rd$client
# 导航到目标页面
remDr$navigate(url)
# 获取渲染后的页面源码
page_source <- remDr$getPageSource()[[1]]
webpage <- read_html(page_source)
download_images <- function(urls, dir = "images") {
if (!dir.exists(dir)) dir.create(dir)
walk2(urls, seq_along(urls), ~{
tryCatch({
res <- httr::GET(.x)
ext <- tools::file_ext(.x) %||% "jpg"
filename <- file.path(dir, paste0("img_", .y, ".", ext))
writeBin(httr::content(res, "raw"), filename)
message("Downloaded: ", filename)
}, error = function(e) message("Failed: ", .x))
})
}
library(magick)
# 批量调整尺寸
resize_images <- function(dir, width = 800) {
imgs <- list.files(dir, full.names = TRUE)
walk(imgs, ~{
img <- image_read(.x)
img <- image_scale(img, width)
image_write(img, .x)
})
}
extract_metadata <- function(img_path) {
img <- image_read(img_path)
info <- image_info(img)
attributes <- image_attributes(img)
list(info = info, attributes = attributes)
}
headers <- c(
"User-Agent" = "Mozilla/5.0",
"Accept" = "image/webp,*/*"
)
res <- httr::GET(url, httr::add_headers(.headers = headers))
proxy <- httr::use_proxy("http://proxy.example.com", port = 8080)
res <- httr::GET(url, proxy)
library(purrr)
slow_download <- function(urls, delay = 2) {
walk(urls, ~{
Sys.sleep(delay)
try(download.file(.x, destfile = basename(.x)))
})
}
library(tidyverse)
# 获取图片页面
unsplash_url <- "https://unsplash.com/s/photos/nature"
page <- read_html(unsplash_url)
# 提取高清图片链接
img_urls <- page %>%
html_nodes("figure a[itemprop='contentUrl']") %>%
html_attr("href") %>%
paste0("https://unsplash.com", .) %>%
map_chr(~read_html(.x) %>% html_node("img") %>% html_attr("src"))
# 下载前10张图片
download_images(head(img_urls, 10), "unsplash_images")
建议使用数据库存储图片元信息:
library(DBI)
library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), "images.db")
# 创建数据表
dbExecute(con, "
CREATE TABLE IF NOT EXISTS images (
id INTEGER PRIMARY KEY,
url TEXT,
filename TEXT,
download_time DATETIME,
size INTEGER
)
")
# 插入数据
img_data <- data.frame(
url = img_urls,
filename = paste0("img_", seq_along(img_urls), ".jpg"),
download_time = Sys.time(),
size = file.size("unsplash_images")
)
dbWriteTable(con, "images", img_data, append = TRUE)
safe_download <- possibly(download.file, otherwise = NA)
download_results <- map(urls, ~safe_download(.x, destfile = basename(.x)))
library(httr)
chunk_download <- function(url, dest, chunk_size = 2^20) {
res <- GET(url, write_disk(dest, overwrite = TRUE), progress(),
config(http_content_length = chunk_size))
stop_for_status(res)
}
validate_image <- function(file) {
tryCatch({
img <- image_read(file)
!is.null(image_info(img))
}, error = function(e) FALSE)
}
library(robotstxt)
paths_allowed("https://example.com")
library(furrr)
plan(multisession)
future_map(urls, download.file)
library(digest)
url_hash <- map_chr(urls, digest)
rvest
: https://rvest.tidyverse.org/httr
: https://httr.r-lib.org/plash
包:专门用于图片抓取的扩展包webdriver
:替代RSelenium的轻量方案通过本文的系统学习,读者应该已经掌握了使用R语言抓取网页图片的完整技术栈。从基础抓取到高级处理,从性能优化到法律合规,这些知识将帮助你在实际项目中高效、合规地获取所需的图片资源。建议读者在实践中不断尝试和优化,逐步构建适合自己需求的图片抓取工作流。 “`
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。