7  基础统计绘图

这一小节的内容将会持续更新,详情请关注更新日志

7.1 曼哈顿图

曼哈顿图是一种常用的GWAS结果可视化方法,它通过将每个SNP的p值转换为负对数尺度,并在x轴上表示SNP的位置,在y轴上表示p值,从而直观地展示GWAS结果。在图中,显著的结果(p值较小)会显示为高亮的点,而未显著的结果则显示为较暗的点。曼哈顿图可以帮助研究者快速识别出显著关联的SNP,并进一步分析其功能。

7.1.1 加载R包

library(tidyverse)
library(ggtext)
library(normentR) # https://github.com/norment/normentR
library(janitor)

7.1.2 构建GWAS数据

simulateGWAS函数模拟基因组关联研究 (GWAS) 数据, 其中有100,000个单核苷酸多态性 (SNPs) 和3个显著列。

然后,使用 janitor::clean_names() 来标准化数据框的列名。

head(gwas_data_load)
  chr        snp      bp nmiss        beta         se           r2          t
1   1 rs90578709 1792930 20000 0.028629422 0.08827353 2.524271e-05  1.5888569
2   1 rs47937938 2194939 20003 0.008846140 0.06841859 1.036697e-04 -1.4265594
3   1 rs62499553  399524 19959 0.011786914 0.16753433 6.561403e-05 -0.8456437
4   1 rs74488285 1715827 19934 0.054787018 0.18286659 2.880185e-05 -1.6648839
5   1 rs76191429  256078 19892 0.009700569 0.13516717 3.942546e-05  1.2122360
6   1 rs97368765  843372 19719 0.166006546 0.20609338 7.874865e-05  0.5277339
           p
1 0.06548946
2 0.65420950
3 0.61314289
4 0.78773539
5 0.42533221
6 0.04793131

从模拟的GWAS数据中选择所有 p 值小于 0.05 的行,这些行代表统计上显著的结果。

模拟的GWAS数据中选择所有 p 值大于或等于 0.05 的行,这些行代表非显著的结果。

然后按染色体(chr)对这些非显著结果进行分组。

对每个染色体组抽取20%的样本,进行数据降采样。

将显著和非显著的数据行合并回一个单一的数据框中。

sig_data <- gwas_data_load |> subset(p < 0.05)
notsig_data <- gwas_data_load |>
  subset(p >= 0.05) |>
  group_by(chr) |>
  sample_frac(0.2)
gwas_data <- bind_rows(sig_data, notsig_data)

对合并后的数据按染色体分组,并计算每个染色体组的最大基对位置(max_bp)。

使用 cumsum 函数计算每个染色体组的累积最大基对位置,并通过lag函数将这些值错位,以便于后续的累积位置计算。

lag函数会将数据框或向量中的每一行向前移动指定的位置,并返回移动后的值。

选择染色体和累积基对位置的错位值(bp_add)

data_cum <- gwas_data |>
  group_by(chr) |>
  summarise(max_bp = max(bp)) |>
  mutate(bp_add = lag(cumsum(max_bp), default = 0)) |>
  select(chr, bp_add)
head(data_cum)
# A tibble: 6 × 2
    chr  bp_add
  <int>   <dbl>
1     1       0
2     2 2198735
3     3 4296366
4     4 6294848
5     5 8193847
6     6 9991547

将累积基对位置数据与原始GWAS数据合并。

计算每个SNP的累积基对位置(bp_cum),这有助于后续的可视化或分析。

gwas_data <- gwas_data |>
  inner_join(data_cum, by = "chr") |>
  mutate(bp_cum = bp + bp_add)

对每个染色体分组的数据计算累积基对位置的平均值,用于确定每个染色体在后续图表的中心位置。

找出具有最小 p 值的行,计算其对数变换后的 p 值,并确定y轴的极限 (ylim) ,以便于在图表中突出显示最显著的结果。

计算Bonferroni校正后的显著性阈值。这是通过将0.05除以GWAS数据中的行数(即测试的总数)来实现的,用于调整多重比较的影响。

axis_set <- gwas_data |>
  group_by(chr) |>
  summarize(center = mean(bp_cum))
ylim <- gwas_data |>
  filter(p == min(p)) |>
  mutate(ylim = abs(floor(log10(p))) + 2) |>
  pull(ylim)
sig <- 0.05 / nrow(gwas_data)

7.1.3 绘制曼哈顿图

ggplot(gwas_data, aes(
  x = bp_cum, y = -log10(p),
  color = as_factor(chr), size = -log10(p)
)) +
  # x轴为累积基对位置(bp_cum),y轴为负对数p值(-log10(p)),颜色和大小由染色体(chr)和p值决定。
  geom_hline(
    yintercept = -log10(sig),
    color = "grey40", linetype = "dashed"
  ) +
  # 添加一条水平虚线,表示经过Bonferroni校正的显著性阈值。
  geom_point(alpha = 0.5) + # 添加散点图的点,透明度设置为0.5。
  scale_x_continuous(
    label = axis_set$chr,
    breaks = axis_set$center
  ) +
  # 设置x轴的刻度标签和断点,使其对应于染色体的中心位置
  scale_y_continuous(expand = c(0, 0), limits = c(0, ylim)) + # 设置y轴的范围和扩展。
  scale_color_manual(values = rep(
    c("#788FCE", "#E6956F"),
    length(axis_set$chr)
  )) +
  # 手动设置颜色,为每个染色体分配不同的颜色。
  scale_size_continuous(range = c(0.5, 3)) + # 设置点的大小范围。
  labs(x = NULL, y = "-log<sub>10</sub>(p)") + # 设置图形的标签,x轴标签为空,y轴标签为负对数p值。
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.title.y = element_markdown(),
    axis.text.x = element_text(angle = 45, size = 8, vjust = 0.5)
  )

曼哈顿图由ggplot2的点图+一条直线绘制而成。gwas数据由分布在染色体上的单核苷酸突变位点构成,每个突变位点记录了它在染色体上的绝对位置和它的p-value。为了对这个数据进行可视化,我们将每个单核苷酸突变位点的位置作为x轴,它的p-value作为y轴,然后沿y轴绘制一条p-value的阈值线,就完成可视化。在实际操作过程中,需要注意每个单核苷酸突变位点在染色体上的位置,这里的数据中,单核苷酸突变位点在染色体上的位置表示为在单个染色体上的位置,比如在2号染色体的1253位点,表示为1253,而非加上染色体1的长度,所以需要转换数据。另外,x轴和y轴的标签也做了相应转换,x轴取了每个染色体的中心位置写标签,而y轴做了-log10转换。

7.2 分裂小提琴图

分裂小提琴图能够很方便的展示嵌套数据中两组数据的分布情况。分裂小提琴图来自introdatavizgeom_split_violin()函数,有关如何实现这个函数,请参见 Chapter 10

7.2.1 加载R包

library(tidyverse)
library(introdataviz) # https://github.com/PsyTeachR/introdataviz
library(ggpubr)
library(scales)
library(patchwork)

7.2.2 加载数据

读取数据,将Treatment列数据按-拆分成A, B两列,并将A列转换成因子类型,除去B列中的数字。

df1 <- read_tsv("data/bsg/F1ag.txt") |>
  select(2, 3, -1) |>
  separate(`Treatment`, into = c("A", "B"), sep = "-", convert = TRUE) |>
  mutate(
    B = str_replace_all(B, "[0-9]", ""),
    A = as.factor(A)
  )
df2 <- read_tsv("data/bsg/F1ag.txt") %>%
  select(2, 4, -1) |>
  separate(`Treatment`, into = c("A", "B"), sep = "-", convert = TRUE) |>
  mutate(
    B = str_replace_all(B, "[0-9]", ""),
    A = as.factor(A)
  )
df3 <- read_tsv("data/bsg/F1BC.txt") |>
  select(2, 3, -1) |>
  separate(`Treatment`, into = c("A", "B"), sep = "-", convert = TRUE) |>
  mutate(
    B = str_replace_all(B, "[0-9]", ""),
    A = as.factor(A)
  )

head(df1)
# A tibble: 6 × 3
  A       B     `SOC (g kg-1)`
  <fct>   <chr>          <dbl>
1 Conserv Amb             12.9
2 Conserv Amb             12.1
3 Conserv Amb             11.6
4 Conserv Amb             12.8
5 Conserv Warm            12.1
6 Conserv Warm            12.1

7.2.3 画图

根据图层顺序依次添加

  1. 分裂小提琴图
  2. 误差线图
  3. 均值点图
  4. 两组均值t检验结果
# 定义公共主题
theme_cus <- function(...) {
  theme_cus <- theme(
    axis.text.x = element_text(
      angle = 0, hjust = 0.5, vjust = 0.5, colour = "black",
      size = 10, margin = margin(b = 2)
    ),
    axis.text.y = element_text(
      color = "black", size = 10,
      margin = margin(r = 1)
    ),
    panel.background = element_rect(fill = NA, color = NA),
    panel.grid.minor = element_line(linewidth = 0.2, color = "#e5e5e5"),
    panel.grid.major = element_line(linewidth = 0.2, color = "#e5e5e5"),
    panel.border = element_rect(
      fill = NA, color = "black",
      linewidth = 1, linetype = "solid"
    ),
    legend.key = element_blank(),
    legend.title = element_blank(),
    legend.text = element_text(color = "black", size = 8),
    legend.spacing.x = unit(0.1, "cm"),
    legend.key.width = unit(0.5, "cm"),
    legend.key.height = unit(0.5, "cm"),
    legend.justification = c(1, 0),
    legend.background = element_blank(),
    ...
  )
  return(theme_cus)
}

# 绘制分裂小提琴图
a <- ggplot(df1, aes(x = A, y = `SOC (g kg-1)`, fill = B)) +
  # 分裂小提琴图
  geom_split_violin(trim = FALSE, color = NA) +
  # 添加图例
  guides(fill = guide_legend(title = "group")) +
  # 绘制误差线
  stat_summary(
    fun.data = "mean_sd", position = position_dodge(0.15),
    geom = "errorbar", width = 0.1
  ) +
  # 将均值以点的形式添加到误差线上
  stat_summary(
    fun = mean, geom = "point",
    position = position_dodge(0.15), show.legend = FALSE
  ) +
  # 添加B列中两组均值的t检验结果
  stat_compare_means(aes(group = B),
    label = "p.signif", label.y = 15.1,
    method = "t.test", size = 5
  ) +
  scale_fill_manual(values = c("#788FCE", "#e6956f")) +
  labs(x = NULL, y = "SOC (g kg-1)") +
  theme_cus(legend.position = c(1, 0))
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
g <- ggplot(df2, aes(x = A, y = `DOC(mg kg-1)`, fill = B)) +
  geom_split_violin(trim = F, color = NA, adjust = 1.5) +
  guides(fill = guide_legend(title = "group")) +
  stat_summary(
    fun.data = "mean_sd",
    position = position_dodge(0.15), geom = "errorbar", width = .1
  ) +
  stat_summary(
    fun = "mean", geom = "point",
    position = position_dodge(0.15), show.legend = F
  ) +
  stat_compare_means(aes(group = B),
    label = "p.signif",
    label.y = 210, method = "t.test", size = 5
  ) +
  scale_fill_manual(values = c("#788FCE", "#E6956F")) +
  labs(x = NULL, y = "DOC(mg kg-1)") +
  theme_cus(legend.position = "non")

b <- ggplot(df3, aes(x = A, y = `Soil temperate (℃)`, fill = B)) +
  geom_split_violin(trim = F, color = NA, adjust = 1.5) +
  guides(fill = guide_legend(title = "group")) +
  stat_summary(
    fun.data = "mean_sd", position = position_dodge(0.15),
    geom = "errorbar", width = .1
  ) +
  stat_summary(
    fun = "mean", geom = "point",
    position = position_dodge(0.15), show.legend = F
  ) +
  stat_compare_means(aes(group = B),
    label = "p.signif",
    label.y = 17.1, method = "t.test", size = 5
  ) +
  scale_fill_manual(values = c("#788FCE", "#E6956F")) +
  labs(x = NULL, y = "Soil temperate") +
  theme_cus(legend.position = "non")

# 将3个图形按行排列在一起

# a + b + g + plot_annotation(tag_levels = "A")

plot(a)
plot(b)
plot(g)

7.3 阴影条带哑铃图

用于展示每行数据从x1增长到x2比较合适,但是该图比较难以绘制。下面的示例使用基础ggplot2对象绘制,需要使用者对ggplot2很熟悉。这个示例利用geom_rect()绘制矩形框架,然后利用geom_text()将结果绘制到对应位置。

7.3.1 加载R包

7.3.2 数据处理

df <- read_csv("data/bsg/values.csv")
df <- df |>
  mutate(across(2:4, ~ .x / 100)) |>
  arrange(Overall) |>
  mutate(
    # 添加一个新列order,包含每行的行号
    order = row_number(), 
    # 根据order列的值重新排序Institution列
    Institution = fct_reorder(Institution, order)
  ) |> 
  rename(cont = 1, overall = 2, x1 = 3, x2 = 4) |>
  mutate(
    diff = abs(x2 - x1),
    x1pr = percent(x1, accuracy = 1),
    x2pr = percent(x2, accuracy = 1),
    dipr = percent(diff, accuracy = 1),
    ovpr = percent(overall, accuracy = 1),
    x1nu = if_else(x1 > x2, 1, if_else(x1 < x2, -1, -1)) / 16,
    x2nu = -1 * x1nu
  )
df2 <- df |>
  # 从df中筛选出行号为偶数的行
  filter(row_number() %% 2 == 0) |> 
  mutate(xmin = -2, xmax = 2)
df <- left_join(df, df2, by = colnames(df)) |>
  head(10)

df
# A tibble: 10 × 14
   cont         overall    x1    x2 order   diff x1pr  x2pr  dipr  ovpr     x1nu
   <fct>          <dbl> <dbl> <dbl> <int>  <dbl> <chr> <chr> <chr> <chr>   <dbl>
 1 Congress        0.12  0.07  0.17     1 0.1    7%    17%   10%   12%   -0.0625
 2 Television …    0.16  0.06  0.25     2 0.19   6%    25%   19%   16%   -0.0625
 3 Big business    0.18  0.19  0.17     3 0.0200 19%   17%   2%    18%    0.0625
 4 The crimina…    0.2   0.2   0.19     4 0.0100 20%   19%   1%    20%    0.0625
 5 Newspapers      0.21  0.08  0.35     5 0.27   8%    35%   27%   21%   -0.0625
 6 Organized l…    0.28  0.16  0.39     6 0.23   16%   39%   23%   28%   -0.0625
 7 Large techn…    0.29  0.22  0.34     7 0.12   22%   34%   12%   29%   -0.0625
 8 The public …    0.32  0.2   0.43     8 0.23   20%   43%   23%   32%   -0.0625
 9 Banks           0.33  0.35  0.33     9 0.0200 35%   33%   2%    33%    0.0625
10 The U.S. Su…    0.36  0.39  0.35    10 0.0400 39%   35%   4%    36%    0.0625
# ℹ 3 more variables: x2nu <dbl>, xmin <dbl>, xmax <dbl>

7.3.3 定义主题

txt_sz <- 3.5
theme_cus <- theme(
  text = element_text(size = 14),
  panel.background = element_blank(),
  axis.text.y = element_text(
    color = "black", hjust = 0,
    size = 14 / 5 * txt_sz
  ),
  axis.text.x = element_blank(),
  axis.ticks = element_blank(),
  # 这里直接设置axis.title=element_blank()不会生效,需要对x,y单独设置
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  # 同上
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.border = element_blank(),
  legend.position = "none",
  plot.title = element_text(size = rel(0.8), hjust = -1),
  plot.margin = unit(c(1, 1, 1, 1), "cm")
)

7.3.4 数据可视化

txt.sz <- 3.5
ggplot(df, aes(y = cont)) +
  # 添加矩形图层,用于背景标记,颜色为浅灰色
  geom_rect(aes(
    xmin = xmin, xmax = xmax,
    # xmin,xmax有缺失值,只绘制没有缺失值的
    ymin = as.numeric(cont) - 1.5,
    ymax = as.numeric(cont) - 0.5
    # xmin - (required) left edge of rectangle
    # xmax - (required) right edge of rectangle
    # ymin - (required) bottom edge of rectangle
    # ymax - (required) top edge of rectangle
  ), fill = "#f6f6f6") + 
  # 添加文本标签"Institution"
  geom_text(
    label = "Institution", x = -0.675, y = dim(df)[1] + 1,
    size = txt.sz, color = "black", fontface = 2, hjust = 0
  ) + 
  # 添加文本标签"Overall"
  geom_text(
    label = "Overall", x = 1.075, y = dim(df)[1] + 1,
    size = txt.sz, color = "black", fontface = 2
  ) + 
  geom_text(
    label = "\u0394", x = 1.2, y = dim(df)[1] + 1,
    size = txt_sz, color = "black", fontface = 2, hjust = 0
  ) +
  geom_text(aes(x = x1, label = x1pr),
    size = txt_sz, color = "#e6956f",
    nudge_x = df$x1nu
  ) +
  # 在x2位置添加文本,显示x2的百分比值
  geom_text(aes(x = x2, label = x2pr),
    size = txt.sz,
    color = "#112fb3", nudge_x = df$x2nu
  ) + 
  # 在固定位置添加文本,显示overall的百分比值
  geom_text(aes(x = x1 * 0 + 1.075, label = ovpr),
    size = txt.sz,
    color = "black"
  ) + 
  # 在固定位置添加文本,显示差异的百分比值
  geom_text(aes(x = x1 * 0 + 1.2, label = dipr),
    size = txt.sz,
    color = "black"
  ) + 
  # 添加线段,连接x1和x2点
  geom_segment(aes(x = x1, xend = x2, yend = cont),
    color = "gray40", linewidth = 1
  ) + 
  # 在x1位置添加点
  geom_point(aes(x = x1),
    shape = 16, size = 2.5,
    color = "#E6956F"
  ) + 
  # 在x2位置添加点
  geom_point(aes(x = x2),
    shape = 16, size = 2.5,
    color = "#E6956F"
  ) + 
  # 设置y轴为离散值并调整扩展
  scale_y_discrete(expand = c(0, 0)) + 
  # 设置x轴为连续值
  scale_x_continuous() + 
  # 设置坐标轴范围和裁剪行为
  coord_cartesian(
    ylim = c(0.2, 12.5),
    xlim = c(0, 1.2), clip = "off"
  ) + 
   # 应用自定义主题
  theme_cus