如何用R语言的ggplot2+ggforce包绘制散点图并添加分组边界

发布时间：2021-11-22 15:55:46 作者：柒染
来源：亿速云阅读：810

# 如何用R语言的ggplot2+ggforce包绘制散点图并添加分组边界

## 前言

在数据可视化领域，散点图是最基础但功能最强大的图表类型之一。它能够直观地展示两个连续变量之间的关系，而当我们想要观察不同组别在散点图中的分布特征时，为每个组添加视觉边界就显得尤为重要。R语言中的`ggplot2`包提供了强大的绘图功能，而`ggforce`包则扩展了`ggplot2`的能力，使我们能够轻松地为散点图添加分组边界。

本文将详细介绍如何使用`ggplot2`和`ggforce`包绘制散点图并添加分组边界，包括数据准备、基础散点图绘制、分组边界添加、图形美化等完整流程。文章包含代码示例、参数解释和可视化效果展示，适合从初学者到中高级用户参考。

## 准备工作

### 安装和加载必要的包

首先需要确保已安装`ggplot2`和`ggforce`包。如果尚未安装，可以通过以下代码安装：

```r
install.packages("ggplot2")
install.packages("ggforce")

然后加载这些包：

library(ggplot2)
library(ggforce)

示例数据集

我们将使用R内置的iris数据集作为示例，它包含了三种鸢尾花（Setosa、Versicolor和Virginica）的萼片和花瓣测量数据。

data(iris)
head(iris)

基础散点图绘制

首先，我们使用ggplot2绘制一个基础的散点图，展示不同种类鸢尾花的花瓣长度（Petal.Length）和花瓣宽度（Petal.Width）关系。

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point(size = 3) +
  labs(title = "鸢尾花花瓣长度与宽度关系",
       x = "花瓣长度 (cm)", 
       y = "花瓣宽度 (cm)") +
  theme_minimal()

这段代码会生成一个按Species分色的散点图，但此时各组数据点混合在一起，难以清晰分辨各组边界。

添加分组边界

使用geommark*系列函数

ggforce包提供了一系列geom_mark_*函数来为分组添加边界，常用的包括：

geom_mark_hull(): 使用凸包算法绘制边界
geom_mark_ellipse(): 绘制椭圆边界
geom_mark_rect(): 绘制矩形边界
geom_mark_circle(): 绘制圆形边界

1. 凸包边界 (Convex Hull)

凸包是包含所有数据点的最小凸多边形，能很好地勾勒出数据分布的整体形状。

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, fill = Species)) +
  geom_point(size = 3) +
  geom_mark_hull(expand = unit(3, "mm"), alpha = 0.2) +
  labs(title = "使用凸包边界的分组散点图",
       x = "花瓣长度 (cm)", 
       y = "花瓣宽度 (cm)") +
  theme_minimal()

参数说明： - expand: 边界向外扩展的量 - alpha: 填充透明度 - concavity: 控制边界的”凹陷”程度（默认1，值越大边界可能越凹）

2. 椭圆边界 (Ellipse)

椭圆边界特别适合呈现数据的正态分布特征。

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, fill = Species)) +
  geom_point(size = 3) +
  geom_mark_ellipse(expand = unit(3, "mm"), alpha = 0.2) +
  labs(title = "使用椭圆边界的分组散点图",
       x = "花瓣长度 (cm)", 
       y = "花瓣宽度 (cm)") +
  theme_minimal()

关键参数： - n: 控制椭圆的光滑度（分段数） - type: “t”（默认，假设多元t分布）、”norm”（多元正态）或”euclid”（普通椭圆）

3. 矩形边界 (Rectangle)

矩形边界提供了一种简洁明了的分组方式。

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, fill = Species)) +
  geom_point(size = 3) +
  geom_mark_rect(expand = unit(3, "mm"), alpha = 0.2) +
  labs(title = "使用矩形边界的分组散点图",
       x = "花瓣长度 (cm)", 
       y = "花瓣宽度 (cm)") +
  theme_minimal()

4. 圆形边界 (Circle)

圆形边界适合数据分布较为集中的情况。

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, fill = Species)) +
  geom_point(size = 3) +
  geom_mark_circle(expand = unit(3, "mm"), alpha = 0.2) +
  labs(title = "使用圆形边界的分组散点图",
       x = "花瓣长度 (cm)", 
       y = "花瓣宽度 (cm)") +
  theme_minimal()

高级定制技巧

1. 边界标签优化

geom_mark_*函数会自动添加分组标签，但我们可以通过label.*系列参数进行优化：

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, fill = Species)) +
  geom_point(size = 3) +
  geom_mark_ellipse(
    aes(label = Species),
    label.fontsize = 12,
    label.buffer = unit(5, 'mm'),
    label.fill = "white",
    label.colour = "black",
    con.colour = "grey50",
    expand = unit(3, "mm"), 
    alpha = 0.2
  ) +
  labs(title = "优化标签后的分组散点图",
       x = "花瓣长度 (cm)", 
       y = "花瓣宽度 (cm)") +
  theme_minimal()

2. 多图层组合

可以组合不同类型的边界来突出显示特定组：

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
  geom_point(aes(color = Species), size = 3) +
  geom_mark_ellipse(
    data = subset(iris, Species == "setosa"),
    aes(fill = Species),
    alpha = 0.2
  ) +
  geom_mark_hull(
    data = subset(iris, Species == "versicolor"),
    aes(fill = Species),
    alpha = 0.2
  ) +
  geom_mark_rect(
    data = subset(iris, Species == "virginica"),
    aes(fill = Species),
    alpha = 0.2
  ) +
  scale_color_manual(values = c("setosa" = "#1b9e77", 
                               "versicolor" = "#d95f02", 
                               "virginica" = "#7570b3")) +
  scale_fill_manual(values = c("setosa" = "#1b9e77", 
                              "versicolor" = "#d95f02", 
                              "virginica" = "#7570b3")) +
  labs(title = "组合不同边界类型的散点图",
       x = "花瓣长度 (cm)", 
       y = "花瓣宽度 (cm)") +
  theme_minimal()

3. 处理重叠边界

当边界重叠时，可以通过调整expand参数和透明度来改善可读性：

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, fill = Species)) +
  geom_point(size = 3) +
  geom_mark_hull(expand = unit(2, "mm"), alpha = 0.15, concavity = 2) +
  labs(title = "优化重叠边界的散点图",
       subtitle = "通过调整expand和alpha参数改善可读性",
       x = "花瓣长度 (cm)", 
       y = "花瓣宽度 (cm)") +
  theme_minimal() +
  theme(legend.position = "bottom")

实际应用案例

案例1：客户细分可视化

假设我们有一个客户数据集，包含年龄、年消费额和客户细分标签：

# 模拟客户数据
set.seed(123)
customer_data <- data.frame(
  age = c(rnorm(100, 30, 5), rnorm(100, 45, 7), rnorm(100, 60, 5)),
  spending = c(rnorm(100, 500, 100), rnorm(100, 800, 150), rnorm(100, 1200, 200)),
  segment = rep(c("年轻群体", "中年群体", "银发群体"), each = 100)
)

ggplot(customer_data, aes(x = age, y = spending, color = segment, fill = segment)) +
  geom_point(alpha = 0.7, size = 3) +
  geom_mark_ellipse(alpha = 0.1, expand = unit(2, "mm")) +
  labs(title = "客户年龄与消费额分布",
       subtitle = "按客户细分分组",
       x = "年龄", 
       y = "年消费额 (元)",
       color = "客户细分",
       fill = "客户细分") +
  theme_minimal() +
  scale_y_continuous(labels = scales::dollar_format(prefix = "¥"))

案例2：基因表达数据

展示不同条件下基因表达量的变化：

# 模拟基因表达数据
set.seed(456)
gene_data <- data.frame(
  condition1 = c(rnorm(50, 5, 1), rnorm(50, 8, 1.5), rnorm(50, 12, 2)),
  condition2 = c(rnorm(50, 6, 1.2), rnorm(50, 7, 1), rnorm(50, 9, 1.5)),
  gene_group = rep(c("代谢基因", "信号基因", "结构基因"), each = 50)
)

ggplot(gene_data, aes(x = condition1, y = condition2, color = gene_group, fill = gene_group)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_mark_hull(alpha = 0.1, expand = unit(3, "mm"), concavity = 1.5) +
  labs(title = "不同基因在两种条件下的表达量",
       x = "条件1 (log2表达量)", 
       y = "条件2 (log2表达量)") +
  theme_bw() +
  theme(panel.grid = element_blank())

常见问题与解决方案

问题1：边界形状不符合预期

解决方案：尝试调整concavity参数（对于凸包）或使用不同类型的边界。对于非常复杂的数据分布，可能需要考虑手动绘制边界。

问题2：标签位置不理想

解决方案：使用label.buffer调整标签与边界的距离，或使用label.family、label.fontsize等参数调整标签样式。

问题3：大数据集性能问题

解决方案：对于大数据集，可以： 1. 使用geom_mark_ellipse()代替geom_mark_hull()（计算量更小） 2. 调整n参数减少边界平滑度 3. 考虑抽样展示部分数据

总结

本文详细介绍了如何使用ggplot2和ggforce包绘制带有分组边界的散点图。通过geom_mark_hull()、geom_mark_ellipse()等函数，我们可以轻松地为不同组别的数据点添加视觉边界，显著提升散点图的信息传达效率。

关键要点总结： 1. ggforce扩展了ggplot2的分组可视化能力 2. 不同类型的边界适用于不同分布特征的数据 3. 通过调整参数可以优化边界形状和标签位置 4. 组合多种边界类型可以创建更丰富的信息可视化

掌握这些技巧后，你可以创建出更具洞察力的分组散点图，有效展示复杂数据集中的分组模式和异常值。这种可视化方法在生物信息学、市场细分、质量控制等领域都有广泛应用。