由于课题可能用上cut&tag这个技术,遂跟教程学习一波,记录一下以便后续的学习(主要是怕忘了)
教程网址cut&tag教程
背景知识:靶标下裂解与标记(Cleavage Under Targets & Tagmentation,CUT&Tag)是一种使用蛋白-A-Tn5(pA-Tn5)转座子融合蛋白的系留方法(图 1b)。在 CUT&Tag 中,用特定染色质蛋白抗体孵育透化细胞或细胞核,然后将装有镶嵌末端适配体的 pA-Tn5 依次拴系在抗体结合位点上。通过添加镁离子激活转座子,将适配体整合到附近的 DNA 中。然后对这些基因进行扩增,生成测序文库。基于抗体拴系 Tn5 的方法灵敏度高,因为 pA-Tn5 拴系后会对样本进行严格清洗,而且适配体整合效率高。与 ChIP-seq 相比,信噪比的提高使绘制染色质特征图所需的测序量减少了一个数量级,这样就可以通过对文库进行条形码 PCR,在 Illumina NGS 测序仪上进行成对端测序。通过CUT&Tag可以检测组蛋白修饰状态。
环境要求:
一、安装环境所需要的所有包
参考我前面的博客,通过conda安装
二、数据准备
1.下载数据
首先获取SRR_Acc_List.txt,然后批量下载,下载的SRA数据存放路径~/my_project/PRJNA606273/raw
cd raw/
cat ../SRR_Acc_List.txt |while read id;do (prefetch ${id} &);done
2.SRR数据转fastq
cd ../
mkdir -p PRJNA606273/data
for i in `ls raw`;do fastq-dump --split-3 $i --gzip -O ./data;done
3.FastQC质量检测
cd data/
ls *fastq.gz|while read id;do fastqc $id;done
二、比对
1.Alignment to HG38
1)需要先建立hg38索引
cd my_project/PRJNA606273
mkdir -p ref/hg38
mkdir -p ref/index
cd ref/hg38
nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz &
cd ref
bowtie2-build ./hg38/hg38.fa ./index/hg38 #耗时比较久,可以挂后台运行
2)比对
#!/bin/bash
cores=8
projPath=~/my_project/PRJNA606273
ref="${projPath}/ref/index/hg38"
mkdir -p ${projPath}/alignment/sam/bowtie2_summary
mkdir -p ${projPath}/alignment/bam
mkdir -p ${projPath}/alignment/bed
mkdir -p ${projPath}/alignment/bedgraph
ls *_1.fastq.gz|while read id; do
id=${id/_1.fastq.gz/} #将字符串 id 中的_1.fq.gz 部分替换为空字符串,即将_1.fq.gz删除
bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${ref} -1 \
${projPath}/data/${id}_1.fastq.gz -2 ${projPath}/data/${id}_2.fastq.gz \
-S ${projPath}/alignment/sam/${id}_bowtie2.sam &> ${projPath}/alignment/sam/bowtie2_summary/${id}_bowtie2.txt
done
比对完之后可以在alignment/sam/bowtie2_summary/${id}_bowtie2.txt看到比对结果的summary,如下
我发现我做出来的结果测序深度一致,但是比对的结果有一点差异,不知道是我的操作问题,还是参考基因组更新了导致比对结果出现了些许不同
2.Alignment to spike-in genome for spike-in calibration
1)需要先建立Ecoli索引
cd my_project/PRJNA606273
mkdir -p ref/Ecoli
cd ./ref/Ecoli
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
gzip -dc GCF_000005845.2_ASM584v2_genomic.fna.gz > E.coli_K12_MG1655.fa
cd ../
bowtie2-build ./Ecoli/E.coli_K12_MG1655.fa ./index/Ecoli
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes
2)比对
#!/bin/bash
cores=8
projPath=~/my_project/PRJNA606273
spikeInRef="${projPath}/ref/index/Ecoli"
chromSize="~/my_project/PRJNA606273/ref/hg38.chrom.sizes"
ls *_1.fastq.gz|while read id; do
id=${id/_1.fastq.gz/} #将字符串 id 中的_1.fq.gz 部分替换为空字符串,即将_1.fq.gz删除
bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${spikeInRef} -1 ${projPath}/data/${id}_1.fastq.gz -2 ${projPath}/data/${id}_2.fastq.gz -S $projPath/alignment/sam/${id}_bowtie2_spikeIn.sam &> $projPath/alignment/sam/bowtie2_summary/${id}_bowtie2_spikeIn.txt
seqDepthDouble=`samtools view -F 0x04 $projPath/alignment/sam/${id}_bowtie2_spikeIn.sam | wc -l`
seqDepth=$((seqDepthDouble/2))
echo $seqDepth >$projPath/alignment/sam/bowtie2_summary/${id}_bowtie2_spikeIn.seqDepth
done
在运行上面的代码的时候遇到一个关于samtools的报错
samtools: /home/administrator/miniconda3/envs/breakseq/bin/…/lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /home/administrator/miniconda3/envs/breakseq/bin/…/lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /home/administrator/miniconda3/envs/breakseq/bin/…/lib/libncursesw.so.6: no version information available (required by samtools)
解决办法:
conda install -c conda-forge ncurses
3.汇报比对结果
此部分在R上进行
因为前面部分我一直采用都是SRR的编号命名,而教程提供的代码都是用K4me3和K27me3命名,为了直接套用这里的代码,我就按照对应的关系将文件重命名
cd sam
rename 's/SRR11074254/K4me3_rep1/g' *
rename 's/SRR11074240/K27me3_rep2/g' *
rename 's/SRR11074258/K4me3_rep2/g' *
rename 's/SRR12246717/K27me3_rep1/g' *
rename 's/SRR8754611/IgG_rep2/g' *
rename 's/SRR8754612/IgG_rep1/g' *
cd bowtie2_summary/
rename 's/SRR11074254/K4me3_rep1/g' *
rename 's/SRR11074240/K27me3_rep2/g' *
rename 's/SRR11074258/K4me3_rep2/g' *
rename 's/SRR12246717/K27me3_rep1/g' *
rename 's/SRR8754611/IgG_rep2/g' *
rename 's/SRR8754612/IgG_rep1/g' *
1)Sequencing depth
library(dplyr)
library(stringr)
library(ggplot2)
library(viridis)
library(GenomicRanges)
library(chromVAR) ## For FRiP analysis and differential analysis
library(DESeq2) ## For differential analysis section
library(ggpubr) ## For customizing figures
library(corrplot) ## For correlation plot
##=== R command ===##
## Path to the project and histone list
projPath = "~/my_project/PRJNA606273"
sampleList = c("K27me3_rep1", "K27me3_rep2", "K4me3_rep1", "K4me3_rep2", "IgG_rep1", "IgG_rep2")
histList = c("K27me3", "K4me3", "IgG")
## Collect the alignment results from the bowtie2 alignment summary files
alignResult = c()
for(hist in sampleList){
alignRes = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2.txt"), header = FALSE, fill = TRUE)
alignRate = substr(alignRes$V1[6], 1, nchar(as.character(alignRes$V1[6]))-1)
histInfo = strsplit(hist, "_")[[1]]
alignResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2],
SequencingDepth = alignRes$V1[1] %>% as.character %>% as.numeric,
MappedFragNum_hg38 = alignRes$V1[4] %>% as.character %>% as.numeric + alignRes$V1[5] %>% as.character %>% as.numeric,
AlignmentRate_hg38 = alignRate %>% as.numeric) %>% rbind(alignResult, .)
}
alignResult$Histone = factor(alignResult$Histone, levels = histList)
alignResult %>% mutate(AlignmentRate_hg38 = paste0(AlignmentRate_hg38, "%"))
上图是我得到的结果,下图是教程的结果,比对结果有一些差异,我也不知道具体原因是啥。特别是IgG_rep1,我后续直接按照教程下载的fastq文件重新跑也是一样结果,奇怪的很。
2)Spike-in alignment
##=== R command ===##
spikeAlign = c()
for(hist in sampleList){
spikeRes = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2_spikeIn.txt"), header = FALSE, fill = TRUE)
alignRate = substr(spikeRes$V1[6], 1, nchar(as.character(spikeRes$V1[6]))-1)
histInfo = strsplit(hist, "_")[[1]]
spikeAlign = data.frame(Histone = histInfo[1], Replicate = histInfo[2],
SequencingDepth = spikeRes$V1[1] %>% as.character %>% as.numeric,
MappedFragNum_spikeIn = spikeRes$V1[4] %>% as.character %>% as.numeric + spikeRes$V1[5] %>% as.character %>% as.numeric,
AlignmentRate_spikeIn = alignRate %>% as.numeric) %>% rbind(spikeAlign, .)
}
spikeAlign$Histone = factor(spikeAlign$Histone, levels = histList)
spikeAlign %>% mutate(AlignmentRate_spikeIn = paste0(AlignmentRate_spikeIn, "%"))
上图是我得到的结果,下图是教程的结果,除了IgG_rep1外比对结果一致。感觉是IgG_rep1给错了数据编号。
3)Summarize the alignment to hg38 and E.coli
##=== R command ===##
alignSummary = left_join(alignResult, spikeAlign, by = c("Histone", "Replicate", "SequencingDepth")) %>%
mutate(AlignmentRate_hg38 = paste0(AlignmentRate_hg38, "%"),
AlignmentRate_spikeIn = paste0(AlignmentRate_spikeIn, "%"))
alignSummary
4)Visualizing the sequencing depth and alignment results
##=== R command ===##
## Generate sequencing depth boxplot
fig3A = alignResult %>% ggplot(aes(x = Histone, y = SequencingDepth/1000000, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("Sequencing Depth per Million") +
xlab("") +
ggtitle("A. Sequencing Depth")
fig3B = alignResult %>% ggplot(aes(x = Histone, y = MappedFragNum_hg38/1000000, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("Mapped Fragments per Million") +
xlab("") +
ggtitle("B. Alignable Fragment (hg38)")
fig3C = alignResult %>% ggplot(aes(x = Histone, y = AlignmentRate_hg38, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("% of Mapped Fragments") +
xlab("") +
ggtitle("C. Alignment Rate (hg38)")
fig3D = spikeAlign %>% ggplot(aes(x = Histone, y = AlignmentRate_spikeIn, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("Spike-in Alignment Rate") +
xlab("") +
ggtitle("D. Alignment Rate (E.coli)")
ggarrange(fig3A, fig3B, fig3C, fig3D, ncol = 2, nrow=2, common.legend = TRUE, legend="bottom")
三、去重
1.安装picard
conda install picard
在使用picard时出现了报错,Unable to access jarfile picard.jar
谷歌了一下,发现原因可能是因为环境变量的问题,系统不知道picard的具体位置,需要设置全局环境变量,遂查了一下picard的位置
应该是位于上一级share目录中,知道具体picard具体位置后,就去设置环境变量
cd
vim .bashrc
#将下面这两句命令添加至.bashrc文件末尾
PICARD=~/miniconda3/envs/rna_seq/share/picard-3.1.1-0/picard.jar
alias picard="java -jar $PICARD"
当我以为成功的时候,再用picard的时候再次出现了报错
Error: A JNI error has occurred, please check your installation and try again
Exception in thread “main” java.lang.UnsupportedClassVersionError: picard/cmdline/PicardCommandLine has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader
1.
r
u
n
(
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
363
)
a
t
j
a
v
a
.
s
e
c
u
r
i
t
y
.
A
c
c
e
s
s
C
o
n
t
r
o
l
l
e
r
.
d
o
P
r
i
v
i
l
e
g
e
d
(
N
a
t
i
v
e
M
e
t
h
o
d
)
a
t
j
a
v
a
.
n
e
t
.
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
f
i
n
d
C
l
a
s
s
(
U
R
L
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
362
)
a
t
j
a
v
a
.
l
a
n
g
.
C
l
a
s
s
L
o
a
d
e
r
.
l
o
a
d
C
l
a
s
s
(
C
l
a
s
s
L
o
a
d
e
r
.
j
a
v
a
:
418
)
a
t
s
u
n
.
m
i
s
c
.
L
a
u
n
c
h
e
r
1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher
1.run(URLClassLoader.java:363)atjava.security.AccessController.doPrivileged(NativeMethod)atjava.net.URLClassLoader.findClass(URLClassLoader.java:362)atjava.lang.ClassLoader.loadClass(ClassLoader.java:418)atsun.misc.LauncherAppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:621)
发现可能是因为picard版本太新的原因,遂卸载当前版本,下载旧版本2.25.5的picard
conda remove picard
conda install picard=2.25.5
然后终于成功了,可以使用了
这个时候再去更新.bashrc文件,就可以随处可用picard了
2.去重
#!/bin/bash
projPath=~/my_project/PRJNA606273
mkdir -p $projPath/alignment/removeDuplicate/picard_summary
ls *bowtie2.sam|while read histName; do
histName=${histName/_bowtie2.sam/} #将字符串 id 中的_bowtie2.sam 部分替换为空字符串,即将_bowtie2.sam删除
## Sort by coordinate
picard SortSam I=$projPath/alignment/sam/${histName}_bowtie2.sam O=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam SORT_ORDER=coordinate
## mark duplicates
picard MarkDuplicates I=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam O=$projPath/alignment/removeDuplicate/${histName}_bowtie2.sorted.dupMarked.sam METRICS_FILE=$projPath/alignment/removeDuplicate/picard_summary/${histName}_picard.dupMark.txt
## remove duplicates
picard MarkDuplicates I=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam O=$projPath/alignment/removeDuplicate/${histName}_bowtie2.sorted.rmDup.sam REMOVE_DUPLICATES=true METRICS_FILE=$projPath/alignment/removeDuplicate/picard_summary/${histName}_picard.rmDup.txt
done
3.汇报结果
##=== R command ===##
## Summarize the duplication information from the picard summary outputs.
dupResult = c()
for(hist in sampleList){
dupRes = read.table(paste0(projPath, "/alignment/removeDuplicate/picard_summary/", hist, "_picard.rmDup.txt"), header = TRUE, fill = TRUE)
histInfo = strsplit(hist, "_")[[1]]
dupResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], MappedFragNum_hg38 = dupRes$READ_PAIRS_EXAMINED[1] %>% as.character %>% as.numeric, DuplicationRate = dupRes$PERCENT_DUPLICATION[1] %>% as.character %>% as.numeric * 100, EstimatedLibrarySize = dupRes$ESTIMATED_LIBRARY_SIZE[1] %>% as.character %>% as.numeric) %>% mutate(UniqueFragNum = MappedFragNum_hg38 * (1-DuplicationRate/100)) %>% rbind(dupResult, .)
}
dupResult$Histone = factor(dupResult$Histone, levels = histList)
alignDupSummary = left_join(alignSummary, dupResult, by = c("Histone", "Replicate", "MappedFragNum_hg38")) %>% mutate(DuplicationRate = paste0(DuplicationRate, "%"))
alignDupSummary
上图是我的结果,下图是教程的结果,结果不太一致,不知道是具体的原因是什么。
可视化结果:
##=== R command ===##
## generate boxplot figure for the duplication rate
fig4A = dupResult %>% ggplot(aes(x = Histone, y = DuplicationRate, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("Duplication Rate (*100%)") +
xlab("")
fig4B = dupResult %>% ggplot(aes(x = Histone, y = EstimatedLibrarySize, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("Estimated Library Size") +
xlab("")
fig4C = dupResult %>% ggplot(aes(x = Histone, y = UniqueFragNum, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("# of Unique Fragments") +
xlab("")
ggarrange(fig4A, fig4B, fig4C, ncol = 3, common.legend = TRUE, legend="bottom")
四. Assess mapped fragment size distribution
#!/bin/bash
projPath=~/my_project/PRJNA606273
mkdir -p $projPath/alignment/sam/fragmentLen
ls *bowtie2.sam|while read histName; do
histName=${histName/_bowtie2.sam/} #将字符串 id 中的_bowtie2.sam 部分替换为空字符串,即将_bowtie2.sam删除
## Extract the 9th column from the alignment sam file which is the fragment length
samtools view -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam | awk -F'\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {print abs($9)}' | sort | uniq -c | awk -v OFS="\t" '{print $2, $1/2}' >$projPath/alignment/sam/fragmentLen/${histName}_fragmentLen.txt
done
##=== R command ===##
## Collect the fragment size information
fragLen = c()
for(hist in sampleList){
histInfo = strsplit(hist, "_")[[1]]
fragLen = read.table(paste0(projPath, "/alignment/sam/fragmentLen/", hist, "_fragmentLen.txt"), header = FALSE) %>% mutate(fragLen = V1 %>% as.numeric, fragCount = V2 %>% as.numeric, Weight = as.numeric(V2)/sum(as.numeric(V2)), Histone = histInfo[1], Replicate = histInfo[2], sampleInfo = hist) %>% rbind(fragLen, .)
}
fragLen$sampleInfo = factor(fragLen$sampleInfo, levels = sampleList)
fragLen$Histone = factor(fragLen$Histone, levels = histList)
## Generate the fragment size density plot (violin plot)
fig5A = fragLen %>% ggplot(aes(x = sampleInfo, y = fragLen, weight = Weight, fill = Histone)) +
geom_violin(bw = 5) +
scale_y_continuous(breaks = seq(0, 800, 50)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 20) +
ggpubr::rotate_x_text(angle = 20) +
ylab("Fragment Length") +
xlab("")
fig5B = fragLen %>% ggplot(aes(x = fragLen, y = fragCount, color = Histone, group = sampleInfo, linetype = Replicate)) +
geom_line(size = 1) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma") +
theme_bw(base_size = 20) +
xlab("Fragment Length") +
ylab("Count") +
coord_cartesian(xlim = c(0, 500))
ggarrange(fig5A, fig5B, ncol = 2)
五.Alignment results filtering and file format conversion
1.Filtering mapped reads by the mapping quality filtering (可选)
#!/bin/bash
projPath=~/my_project/PRJNA606273
minQualityScore=2
ls *bowtie2.sam|while read histName; do
histName=${histName/_bowtie2.sam/} #将字符串 id 中的_bowtie2.sam 部分替换为空字符串,即将_bowtie2.sam删除
samtools view -q $minQualityScore ${projPath}/alignment/sam/${histName}_bowtie2.sam >${projPath}/alignment/sam/${histName}_bowtie2.qualityScore$minQualityScore.sam
done
2.File format conversion
#!/bin/bash
projPath=~/my_project/PRJNA606273
ls *bowtie2.sam|while read histName; do
histName=${histName/_bowtie2.sam/} #将字符串 id 中的_bowtie2.sam 部分替换为空字符串,即将_bowtie2.sam删除
## Filter and keep the mapped read pairs
samtools view -bS -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam >$projPath/alignment/bam/${histName}_bowtie2.mapped.bam
## Convert into bed file format
bedtools bamtobed -i $projPath/alignment/bam/${histName}_bowtie2.mapped.bam -bedpe >$projPath/alignment/bed/${histName}_bowtie2.bed
## Keep the read pairs that are on the same chromosome and fragment length less than 1000bp.
awk '$1==$4 && $6-$2 < 1000 {print $0}' $projPath/alignment/bed/${histName}_bowtie2.bed >$projPath/alignment/bed/${histName}_bowtie2.clean.bed
## Only extract the fragment related columns
cut -f 1,2,6 $projPath/alignment/bed/${histName}_bowtie2.clean.bed | sort -k1,1 -k2,2n -k3,3n >$projPath/alignment/bed/${histName}_bowtie2.fragments.bed
done
3.Assess replicate reproducibility
#!/bin/bash
projPath=~/my_project/PRJNA606273
ls *bowtie2.sam|while read histName; do
histName=${histName/_bowtie2.sam/} #将字符串 id 中的_bowtie2.sam 部分替换为空字符串,即将_bowtie2.sam删除
## We use the mid point of each fragment to infer which 500bp bins does this fragment belong to.
binLen=500
awk -v w=$binLen '{print $1, int(($2 + $3)/(2*w))*w + w/2}' $projPath/alignment/bed/${histName}_bowtie2.fragments.bed | sort -k1,1V -k2,2n | uniq -c | awk -v OFS="\t" '{print $2, $3, $1}' | sort -k1,1V -k2,2n >$projPath/alignment/bed/${histName}_bowtie2.fragmentsCount.bin$binLen.bed
done
##== R command ==##
reprod = c()
fragCount = NULL
for(hist in sampleList){
if(is.null(fragCount)){
fragCount = read.table(paste0(projPath, "/alignment/bed/", hist, "_bowtie2.fragmentsCount.bin500.bed"), header = FALSE)
colnames(fragCount) = c("chrom", "bin", hist)
}else{
fragCountTmp = read.table(paste0(projPath, "/alignment/bed/", hist, "_bowtie2.fragmentsCount.bin500.bed"), header = FALSE)
colnames(fragCountTmp) = c("chrom", "bin", hist)
fragCount = full_join(fragCount, fragCountTmp, by = c("chrom", "bin"))
}
}
M = cor(fragCount %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs")
corrplot(M, method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 3, rect.col = "black", rect.lwd = 3,cl.pos = "b", tl.col = "indianred4", tl.cex = 1, cl.cex = 1, addCoef.col = "black", number.digits = 2, number.cex = 1, col = colorRampPalette(c("midnightblue","white","darkred"))(100))
六. Spike-in calibration
#!/bin/bash
projPath=~/my_project/PRJNA606273
chromSize=$projPath/ref/hg38.chrom.sizes
ls *bowtie2.sam|while read histName; do
histName=${histName/_bowtie2.sam/} #将字符串 id 中的_bowtie2.sam 部分替换为空字符串,即将_bowtie2.sam删除
seqDepth=`cat $projPath/alignment/sam/bowtie2_summary/${histName}_bowtie2_spikeIn.seqDepth`
if [[ "$seqDepth" -gt "1" ]]; then
mkdir -p $projPath/alignment/bedgraph
scale_factor=`echo "10000 / $seqDepth" | bc -l`
echo "Scaling factor for $histName is: $scale_factor!"
bedtools genomecov -bg -scale $scale_factor -i $projPath/alignment/bed/${histName}_bowtie2.fragments.bed -g $chromSize > $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph
fi
done
##=== R command ===##
scaleFactor = c()
multiplier = 10000
for(hist in sampleList){
spikeDepth = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2_spikeIn.seqDepth"), header = FALSE, fill = TRUE)$V1[1]
histInfo = strsplit(hist, "_")[[1]]
scaleFactor = data.frame(scaleFactor = multiplier/spikeDepth, Histone = histInfo[1], Replicate = histInfo[2]) %>% rbind(scaleFactor, .)
}
scaleFactor$Histone = factor(scaleFactor$Histone, levels = histList)
left_join(alignDupSummary, scaleFactor, by = c("Histone", "Replicate"))
##=== R command ===##
## Generate sequencing depth boxplot
fig6A = scaleFactor %>% ggplot(aes(x = Histone, y = scaleFactor, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 20) +
ylab("Spike-in Scalling Factor") +
xlab("")
normDepth = inner_join(scaleFactor, alignResult, by = c("Histone", "Replicate")) %>% mutate(normDepth = MappedFragNum_hg38 * scaleFactor)
fig6B = normDepth %>% ggplot(aes(x = Histone, y = normDepth, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 20) +
ylab("Normalization Fragment Count") +
xlab("") +
coord_cartesian(ylim = c(1000000, 130000000))
ggarrange(fig6A, fig6B, ncol = 2, common.legend = TRUE, legend="bottom")
七. Peak calling
#!/bin/bash
projPath=~/my_project/PRJNA606273
mkdir -p $projPath/peakCalling/SEACR
seacr="$projPath/SEACR_1.3.sh"
ls *bowtie2.sam|while read histName; do
histName=${histName/_bowtie2.sam/} #将字符串 id 中的_bowtie2.sam 部分替换为空字符串,即将_bowtie2.sam删除
histControl=IgG_rep1
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph \
$projPath/alignment/bedgraph/${histControl}_bowtie2.fragments.normalized.bedgraph \
non stringent $projPath/peakCalling/SEACR/${histName}_seacr_control.peaks
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks
done
1.Number of peaks called
##=== R command ===##
peakN = c()
peakWidth = c()
peakType = c("control", "top0.01")
for(hist in sampleList){
histInfo = strsplit(hist, "_")[[1]]
if(histInfo[1] != "IgG"){
for(type in peakType){
peakInfo = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE) %>% mutate(width = abs(V3-V2))
peakN = data.frame(peakN = nrow(peakInfo), peakType = type, Histone = histInfo[1], Replicate = histInfo[2]) %>% rbind(peakN, .)
peakWidth = data.frame(width = peakInfo$width, peakType = type, Histone = histInfo[1], Replicate = histInfo[2]) %>% rbind(peakWidth, .)
}
}
}
peakN %>% select(Histone, Replicate, peakType, peakN)
IgG_rep1
IgG_rep2
2.Reproducibility of the peak across biological replicates
##=== R command ===##
histL = c("K27me3", "K4me3")
repL = paste0("rep", 1:2)
peakType = c("control", "top0.01")
peakOverlap = c()
for(type in peakType){
for(hist in histL){
overlap.gr = GRanges()
for(rep in repL){
peakInfo = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_", rep, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE)
peakInfo.gr = GRanges(peakInfo$V1, IRanges(start = peakInfo$V2, end = peakInfo$V3), strand = "*")
if(length(overlap.gr) >0){
overlap.gr = overlap.gr[findOverlaps(overlap.gr, peakInfo.gr)@from]
}else{
overlap.gr = peakInfo.gr
}
}
peakOverlap = data.frame(peakReprod = length(overlap.gr), Histone = hist, peakType = type) %>% rbind(peakOverlap, .)
}
}
peakReprod = left_join(peakN, peakOverlap, by = c("Histone", "peakType")) %>% mutate(peakReprodRate = peakReprod/peakN * 100)
peakReprod %>% select(Histone, Replicate, peakType, peakN, peakReprodNum = peakReprod, peakReprodRate)
3.FRagment proportion in Peaks regions (FRiPs)
##=== R command ===##
library(chromVAR)
bamDir = paste0(projPath, "/alignment/bam")
inPeakData = c()
## overlap with bam file to get count
for(hist in histL){
for(rep in repL){
peakRes = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_", rep, "_seacr_control.peaks.stringent.bed"), header = FALSE, fill = TRUE)
peak.gr = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*")
bamFile = paste0(bamDir, "/", hist, "_", rep, "_bowtie2.mapped.bam")
fragment_counts <- getCounts(bamFile, peak.gr, paired = TRUE, by_rg = FALSE, format = "bam")
inPeakN = counts(fragment_counts)[,1] %>% sum
inPeakData = rbind(inPeakData, data.frame(inPeakN = inPeakN, Histone = hist, Replicate = rep))
}
}
frip = left_join(inPeakData, alignResult, by = c("Histone", "Replicate")) %>% mutate(frip = inPeakN/MappedFragNum_hg38 * 100)
frip %>% select(Histone, Replicate, SequencingDepth, MappedFragNum_hg38, AlignmentRate_hg38, FragInPeakNum = inPeakN, FRiPs = frip)
4.Visualization of peak number, peak width, peak reproducibility and FRiPs
fig7A = peakN %>% ggplot(aes(x = Histone, y = peakN, fill = Histone)) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
facet_grid(~peakType) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("Number of Peaks") +
xlab("")
fig7B = peakWidth %>% ggplot(aes(x = Histone, y = width, fill = Histone)) +
geom_violin() +
facet_grid(Replicate~peakType) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
scale_y_continuous(trans = "log", breaks = c(400, 3000, 22000)) +
theme_bw(base_size = 18) +
ylab("Width of Peaks") +
xlab("")
fig7C = peakReprod %>% ggplot(aes(x = Histone, y = peakReprodRate, fill = Histone, label = round(peakReprodRate, 2))) +
geom_bar(stat = "identity") +
geom_text(vjust = 0.1) +
facet_grid(Replicate~peakType) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("% of Peaks Reproduced") +
xlab("")
fig7D = frip %>% ggplot(aes(x = Histone, y = frip, fill = Histone, label = round(frip, 2))) +
geom_boxplot() +
geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.55, option = "magma", alpha = 0.8) +
scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
theme_bw(base_size = 18) +
ylab("% of Fragments in Peaks") +
xlab("")
ggarrange(fig7A, fig7B, fig7C, fig7D, ncol = 2, nrow=2, common.legend = TRUE, legend="bottom")
八.Visualization
1.Heatmap over transcription units
#!/bin/bash
projPath=~/my_project/PRJNA606273
mkdir -p $projPath/alignment/bigwig
ls *bowtie2.sam|while read histName; do
histName=${histName/_bowtie2.sam/} #将字符串 id 中的_bowtie2.sam 部分替换为空字符串,即将_bowtie2.sam删除
samtools sort -o $projPath/alignment/bam/${histName}.sorted.bam $projPath/alignment/bam/${histName}_bowtie2.mapped.bam
samtools index $projPath/alignment/bam/${histName}.sorted.bam
bamCoverage -b $projPath/alignment/bam/${histName}.sorted.bam -o $projPath/alignment/bigwig/${histName}_raw.bw
cores=8
computeMatrix scale-regions -S $projPath/alignment/bigwig/K27me3_rep1_raw.bw \
$projPath/alignment/bigwig/K27me3_rep2_raw.bw \
$projPath/alignment/bigwig/K4me3_rep1_raw.bw \
$projPath/alignment/bigwig/K4me3_rep2_raw.bw \
-R $projPath/hg38_gene \
--beforeRegionStartLength 3000 \
--regionBodyLength 5000 \
--afterRegionStartLength 3000 \
--skipZeros -o $projPath/hg38_heatmap/matrix_gene.mat.gz -p $cores
plotHeatmap -m $projPath/hg38_heatmap/matrix_gene.mat.gz -out $projPath/hg38_heatmap/Histone_gene.png --sortUsing sum
done
hg38_gene获取:
https://genome.ucsc.edu/cgi-bin/hgTables
output format改为BED,其他不变
跑出了一个奇奇怪怪的图,怎么那么多黑线啊
2.Heatmap on CUT&Tag peaks
#!/bin/bash
projPath=~/my_project/PRJNA606273
for histName in K27me3 K4me3
do
for repName in rep1 rep1 rep2
do
awk '{split($6, summit, ":"); split(summit[2], region, "-"); print summit[1]"\t"region[1]"\t"region[2]}' $projPath/peakCalling/SEACR/${histName}_${repName}_seacr_control.peaks.stringent.bed >$projPath/peakCalling/SEACR/${histName}_${repName}_seacr_control.peaks.summitRegion.bed
computeMatrix reference-point -S $projPath/alignment/bigwig/${histName}_${repName}_raw.bw \
-R $projPath/peakCalling/SEACR/${histName}_${repName}_seacr_control.peaks.summitRegion.bed \
--skipZeros -o $projPath/peakCalling/SEACR/${histName}_${repName}_SEACR.mat.gz -p 8 -a 3000 -b 3000 --referencePoint center
plotHeatmap -m $projPath/peakCalling/SEACR/${histName}_${repName}_SEACR.mat.gz -out $projPath/peakCalling/SEACR/${histName}_${repName}_SEACR_heatmap.png --sortUsing sum --startLabel "Peak Start" --endLabel "Peak End" --xAxisLabel "" --regionsLabel "Peaks" --samplesLabel "${histName} ${repName}"
done
done
九. Differential analysis
- Create a master peak list merging all the peaks called for each sample
##=== R command ===##
mPeak = GRanges()
## overlap with bam file to get count
for(hist in histL){
for(rep in repL){
peakRes = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_", rep, "_seacr_control.peaks.stringent.bed"), header = FALSE, fill = TRUE)
mPeak = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*") %>% append(mPeak, .)
}
}
masterPeak = reduce(mPeak)
- Get the fragment counts for each peak in the master peak list
##=== R command ===##
library(DESeq2)
bamDir = paste0(projPath, "/alignment/bam")
countMat = matrix(NA, length(masterPeak), length(histL)*length(repL))
## overlap with bam file to get count
i = 1
for(hist in histL){
for(rep in repL){
bamFile = paste0(bamDir, "/", hist, "_", rep, "_bowtie2.mapped.bam")
fragment_counts <- getCounts(bamFile, masterPeak, paired = TRUE, by_rg = FALSE, format = "bam")
countMat[, i] = counts(fragment_counts)[,1]
i = i + 1
}
}
colnames(countMat) = paste(rep(histL, 2), rep(repL, each = 2), sep = "_")
- Sequencing depth normalization and differential enriched peaks detection
##=== R command ===##
selectR = which(rowSums(countMat) > 5) ## remove low count genes
dataS = countMat[selectR,]
condition = factor(rep(histL, each = length(repL)))
dds = DESeqDataSetFromMatrix(countData = dataS,
colData = DataFrame(condition),
design = ~ condition)
DDS = DESeq(dds)
normDDS = counts(DDS, normalized = TRUE) ## normalization with respect to the sequencing depth
colnames(normDDS) = paste0(colnames(normDDS), "_norm")
res = results(DDS, independentFiltering = FALSE, altHypothesis = "greaterAbs")
countMatDiff = cbind(dataS, normDDS, res)
head(countMatDiff)
总算是跑完一轮没有报错了,但是结果有一些差异,还不清楚原因是什么,后面再看看是啥问题吧
长腿猴子请来的救兵写于2024.5.13