B cell lineage tracing
MiXCR allows V- and J-gene allele inference and somatic hypermutation lineage trees reconstruction. This guide covers these two major advances in BCR repertoire analysis and provides examples.
See full webinar video:
Data and experimental design
This tutorial uses the longitudinal data on IGH repertoires obtained for one healthy middle aged donor who experienced two COVID-19 infections during the course of the last two years. Peripheral blood samples were obtained at 8 timepoints in replicates.
PBMC was isolated using ficoll density gradient centrifugation and RNA was isolated for cDNA library preparation using MiLaboratories Human IG RNA Multiplex kit.
Upstream analysis
Running upstream analysis is easy with the mixcr analyze
command, which requires only a few arguments:
mixcr analyze milab-human-bcr-multiplex-full-length \
file_R1.fastq.gz \
file_R2.fastq.gz \
output_id
milab-human-bcr-multiplex-full-length
- is the preset name for MiLaboratories Human IG RNA Multiplex kit. The presets for most of the commercially available platforms are supported along with generic presets allowing running the pipeline for custom protocols.
file_R1.fastq.gz
andfile_R2.fastq.gz
- the filenames for raw files with sequencing reads, no preprocessing is required (e.g. primers/adapters trimming or UMI-processing).
output_id
- output prefix which will be used in the names of the output files.
The command above will generate the following set of output files:
> ls mixcr/DCh_T1_pbmc_H_2*
# human-readable reports for every step in txt and json format
DCh_T1_pbmc_H_2.align.report.json
DCh_T1_pbmc_H_2.align.report.txt
DCh_T1_pbmc_H_2.assemble.report.json
DCh_T1_pbmc_H_2.assemble.report.txt
DCh_T1_pbmc_H_2.refine.report.json
DCh_T1_pbmc_H_2.refine.report.txt
# alignments (highly compressed binary file)
DCh_T1_pbmc_H_2.vdjca
# alignments with corrected UMI barcode sequences
DCh_T1_pbmc_H_2.refined.vdjca
# IG clonotypes (highly compressed binary file)
DCh_T1_pbmc_H_2.clns
We can utilize GNU parallel to run the pipeline for multiple files simultaneously:
ls raw/*R1*.fastq.gz | parallel --line-buffer 'mixcr analyze \
milab-human-bcr-multiplex-full-length \
--remove-step exportClones \
{} {=s:R1:R2:=} \
mixcr/{=s:.*/:: ; s:_R1.fastq.gz:: =}'
Note that we also have added --remove-step exportClones
'mix-in' option. Initially this preset includes mixcr exportClones
step, but we will skip it for now, as we need to infer the allelic variants first and reassign clonotypes to a newly generated individualized gene segment reference library.
Quality control
Before we move on to lineage trees reconstruction, we will look at the basic QC plots.
First, let’s generate and investigate alignment rates QC plot.
mixcr exportQc align \
mixcr/*.clns \
figures/alignQc.pdf
We observe that all the libraries aligned very well, more than 90% of the reads were successfully aligned to the reference gene segment library.
Next we would like to inspect the number of sequencing reads for each of the libraries, we add --absolute-values
flag for the same command as above:
mixcr exportQc align \
--absolute-values \
mixcr/*.clns \
figures/alignQc.pdf
We see that depth by reads is a little uneven between our samples, however for each of them we have at least one million sequencing reads, which is generally enough for this analysis.
We can investigate this further exporting UMI coverage by reads plot:
mixcr exportQc tags \
mixcr/*.clns \
figures/readsPerUmiQc.pdf
This plot demonstrates the distribution of how many times unique cDNA molecules (each represented by a UMI) were covered by sequencing reads. It is common to observe a bimodal distribution, with the lower peak attributing to the dubious sequences, generated by PCR- and sequencing errors. MiXCR automatically calculates the optimal threshold and filters out UMI groups with low coverage by reads to account for this.
Allele inference
The next step, after obtaining clonotypes is to perform allele inference to separate true allelic variants from somatic hypermutations. This is done using mixcr findAlleles
command which works in two steps:
- a new personalized reference gene library is generated based on the provided samples;
- the newly generated reference is used to realign each clone. New .clns files are generated.
Only samples obtained from a single donor (or genetic mice strain) should be provided for allele inference.
mixcr findAlleles \
--report mixcr/DCh.findAlleles.report.txt \
--export-alleles-mutations mixcr/DCh_alleles.tsv \
--export-library mixcr/DCh_alleles.json \
--output-template {file_dir_path}/{file_name}.reassigned.clns \
mixcr/*.clns
--export-alleles-mutations
- specifies a path to write the descriptions and stats for all original and new alleles
--export-library
- specifies a path where to write the new library with found alleles. Supports exporting in
.fasta
or MiXCR-compatible json formats, which is defined by output filename extension --output-template {file_dir_path}/{file_name}_.reassigned.clns
- specifies the output file path.
{file_dir_path}/{file_name}_.reassigned.clns
means that each realigned.clns
file will be placed in the same folder and will have the same base name as the original.clns
file.
For more information on how to read the output tables see mixcr findAlleles
.
Export clones
Now, when the alleles have been reassigned we can export clonotype tables in human-readable format:
mixcr exportClones \
mixcr/DCh_T1_pbmc_H_2.reassigned.clns \
mixcr/DCh_T1_pbmc_H_2.txt
Generate lineage trees
Now we can reconstruct clonal lineage trees with mixcr findShmTrees
command:
mixcr findShmTrees \
--report mixcr/DCh_trees.log \
mixcr/*.reassigned.clns \
mixcr/DCh.shmt
To export information on lineage trees in human-readable format use mixcr exportShmTreesWithNodes
:
mixcr exportShmTreesWithNodes \
mixcr/DCh.shmt \
mixcr/DCh_trees.txt
Lineage analysis
For further downstream analysis we proceed with R language. Load necessary packages, define several handy helper functions, and color pallets:
Expand
#| code-fold: true
library(tidyr)
library(readr)
library(stringr)
library(stringdist)
library(ggplot2)
library(msa)
library(magrittr)
library(ggbeeswarm)
library(tibble)
library(dplyr)
library(plotly)
library(ggtree)
library(ggrepel)
library(data.table)
`%+%` <- function(a, b) paste0(a, b)
`+.uneval` <- function(a,b) {
`class<-`(modifyList(a,b), "uneval")
}
alignment2Fasta <- function(alignment, filename) {
sink(filename)
n <- length(rownames(alignment))
for(i in seq(1, n)) {
cat(paste0('>', rownames(alignment)[i]))
cat('\n')
the.sequence <- toString(unmasked(alignment)[[i]])
cat(the.sequence)
cat('\n')
}
sink(NULL)
}
mi_27<-c(
"#99E099","#198020","#42B842","#C1ADFF","#845CFF","#5F31CC","#FFCB8F",
"#FF9429","#C26A27","#90E0E0","#27C2C2","#068A94","#FAAAFA","#E553E5",
"#A324B2","#CBEB67","#95C700","#659406","#99CCFF","#2D93FA","#105BCC",
"#FFADBA","#F05670","#AD3757","#D3D7E0","#929BAD","#5E5E70")
mi_dark<-c(
"#198020",
"#5F31CC",
"#C26A27",
"#068A94",
"#A324B2",
"#659406",
"#105BCC",
"#AD3757",
"#5E5E70")
aminoacids_pal<-c(
"#99E099","#198020","#42B842","#C1ADFF","#845CFF","#5F31CC","#FFCB8F",
"#FF9429","#C26A27","#90E0E0","#27C2C2","#068A94","#FAAAFA","#E553E5",
"#A324B2","#CBEB67","#95C700","#659406","#99CCFF","#2D93FA","#105BCC",
"#FFADBA","#F05670","#AD3757","#D3D7E0","#929BAD","#5E5E70")
isotype_pal<-c(
IGHA1 = "#F05670",
IGHA2 = "#FFADBA",
IGHM = "#99CCFF",
IGHD = "#105BCC",
IGHG1 ="#198020",
IGHG2 ="#42B842",
IGHG3 ="#99E099",
IGHG4 ="#95C700")
Load clonesets, lineages and calculate statistics
Loading mixcr exportClones
command output and calculating number of clonotypes and unique cDNA molecules
folder<-"~/webinar/mixcr/"
cloneset_files<-dir(folder,".reassigned_IGH.tsv",full.names = T)
clonesets<-read_tsv(cloneset_files,id="fileName") %>%
mutate(id=str_remove_all(fileName,".*\\/|.reassigned_IGH.tsv"),
isotype=str_remove(allCHitsWithScore, "\\*.*"))
stats_clonesets<-clonesets %>%
group_by(id) %>%
summarise(totalCloneInSample=n(),
totalUMIsInSample=sum(uniqueUMICount))
Load mixcr exportShmTreesWithNodes
command output, parse metadata from filenames, and join with statistics table generated on previous step
lng_db<-read_tsv("~/ch_covid/mixcr/shmTrees/DCh_trees.tsv") %>%
mutate(id=str_remove_all(fileName,".*\\/|.reassigned.clns"),
isotype=str_remove(bestCHit,"\\*00")) %>%
separate("id",c("donor","timepoint","tissue","status","repnum"),
sep="_",
remove=F) %>%
left_join(stats_clonesets,by ="id")
# Note rows with NA in cloneId and fileName fields - these are the inferred nodes
For further processing it is useful to pivot to R's wide format:
Pivot to wide format
lngs_trees_wide<-lng_db %>%
filter(!fileName=="") %>%
group_by(treeId,timepoint,repnum) %>%
summarise(nClones=n(),
nClonesNormalized=n()/dplyr::first(totalCloneInSample),
nUMIsNormalized=sum(uniqueUMICount)/dplyr::first(totalUMIsInSample),
nUMIs=sum(uniqueUMICount),
isotypes=list(unique(isotype)),
nConditions=n_distinct(status),
conditions=paste(unique(status),collapse=","),
nTissues=n_distinct(tissue),
tissues=paste(unique(tissue),collapse=","),
totalCloneInSample=dplyr::first(totalCloneInSample),
totalUMIsInSample=dplyr::first(totalUMIsInSample)
) %>%
ungroup() %>%
group_by(treeId) %>%
mutate(timepoint=factor(timepoint,levels=c("T1","T2","T3","T4","T5","T6","T7","T8")),
timepoints=timepoint %>%unique() %>% sort() %>% paste(collapse=","),
isotypes=c(isotypes) %>% unlist() %>% unique() %>%sort() %>% paste(collapse=","),
totalNClones=sum(nClones),
totalCount=sum(nUMIs)) %>%
ungroup() %>%
arrange(timepoint) %>%
pivot_wider(id_cols = c(treeId,isotypes,timepoints,totalNClones,totalCount),
names_from = c(timepoint,repnum),
values_from = c(nClones,nUMIs,nClonesNormalized,nUMIsNormalized),
values_fill = 0.00) %>%
mutate(across(starts_with("nClonesN"),.fns = ~ ifelse(.x==0, min(.x[.x>0])/2, .x )),
across(starts_with("nUmisN"),.fns = ~ ifelse(.x==0, min(.x[.x>0])/2, .x ))) %>%
mutate(switchedIsotypePresent=ifelse(str_detect(isotypes, "IGHA|IGHG|IGHE"), TRUE, FALSE))
Visualizations
Define function to draw lineage abundance scatter plots
drawScatter<-function(sample1,sample2,metric,coloring){
#Two possible options for metric - nClonesNormalized and nUMIsNormalized
#Two possible options for coloring - changedTrees and switchedIsotypePresent
#Set limits for plots depending on metric
if (metric=="nClonesNormalized"){
upperLimit=0.2*1e-2
lowerLimit=0.8*1e-05
} else if (metric=="nUMIsNormalized"){
upperLimit=1*1e-1
lowerLimit=0.5*1e-05
}
g<-ggplot(lngs_trees_wide,
aes_string(x=metric %+% "_" %+% sample1,
y=metric %+% "_" %+% sample2) +
aes(text = "timepoints: " %+% timepoints %+% "\n" %+%
"isotypes: " %+% isotypes %+% "\n" %+%
"N Clones in " %+% eval(parse(text = "sample1")) %+% ": " %+%
eval(parse(text = "nClones" %+% "_" %+%
eval(parse(text = "sample1")))) %+% "\n" %+%
"N Clones in " %+% eval(parse(text = "sample2")) %+% ": " %+%
eval(parse(text = "nClones" %+% "_" %+%
eval(parse(text = "sample2")))) %+% "\n" %+%
"total N Clonotypes: " %+% totalNClones %+% "\n" %+%
"total N UMIs: " %+% totalCount
))+
geom_point(aes_string(color=coloring),
alpha=0.5,size=4)+
theme_bw()+
scale_x_log10()+
scale_y_log10()+
theme(legend.position = "none") +
geom_abline(slope=1, intercept=0, linetype=2,color="grey28")
if(exists("upperLimit")) {
g<-g+ expand_limits(y=c(lowerLimit,upperLimit),x=c(lowerLimit,upperLimit))
}
if(coloring=="changedTrees") {
g<-g+geom_point(data=lngs_trees_wide %>% filter(changedTrees),
color="#F05670",
alpha=0.4,
size=4) +
scale_color_manual(values= c("grey76","#F05670"))
} else if (coloring=="switchedIsotypePresent") {
g<-g+ scale_color_manual(values= c(mi_dark[2],mi_dark[1]))
}
}
Each point represents a lineage, values represent number of cDNA molecules (each represented by one UMI) normalized by total number of cDNA molecules in the corresponding sample. In the steady state time point (T6) many of the most abundant (by cDNA) lineages are not replicated. Presumably these lineages are derived from antibody-secreting cells (ASCs), with high expression of IGH mRNA, but with low number of cells in them. At peak of response timepoint (T7) ASCs subset is represented by lineages undergoing active clonal expansions, with many cells present in peripheral blood, that's why it is easier to detect them in replicates.
g1<-drawScatter("T6_1","T6_2","nUMIsNormalized",coloring="switchedIsotypePresent")+
labs(x="T6, pre-COVID19, replicate 1",y="T6 pre-COVID19, replicate 2")
g2<-drawScatter("T7_1","T7_2","nUMIsNormalized",coloring="switchedIsotypePresent")+
labs(x="T7, d7 COVID19, replicate 1",y="T7, d7 COVID19, replicate 2")
g3<-drawScatter("T6_1","T7_2","nUMIsNormalized",coloring="switchedIsotypePresent")+labs(x="T6, pre-COVID19, replicate 1",y="T7, d7 COVID19, replicate 2")
ggplotly(g1)
ggplotly(g2)
ggplotly(g3)
Now we can label the responding lineages, looking at the deviation of abundance between timepoints T6 and T7 and comparing it to deviation between replicates at peak of response (T7). We also filter lineages which have more than 5 different clonotypes at T7.
Define function to draw changed lineages
lngs_trees_wide %<>%
mutate(logFCp61_71=log10(nUMIsNormalized_T6_1/nUMIsNormalized_T7_1),
logFCp61_72=log10(nUMIsNormalized_T6_1/nUMIsNormalized_T7_2),
logFCp62_71=log10(nUMIsNormalized_T6_2/nUMIsNormalized_T7_1),
logFCp62_72=log10(nUMIsNormalized_T6_1/nUMIsNormalized_T7_2)) %>%
#Calculate standard deviation
mutate(
sd71_72=(nUMIsNormalized_T7_1-nUMIsNormalized_T7_2)^2 / ((nUMIsNormalized_T7_1+nUMIsNormalized_T7_2)/2),
sd61_71=(nUMIsNormalized_T6_1-nUMIsNormalized_T7_1)^2 / ((nUMIsNormalized_T6_1+nUMIsNormalized_T7_1)/2),
sd61_72=(nUMIsNormalized_T6_1-nUMIsNormalized_T7_2)^2 / ((nUMIsNormalized_T6_1+nUMIsNormalized_T7_2)/2),
sd62_71=(nUMIsNormalized_T6_2-nUMIsNormalized_T7_1)^2 / ((nUMIsNormalized_T6_2+nUMIsNormalized_T7_1)/2),
sd62_72=(nUMIsNormalized_T6_2-nUMIsNormalized_T7_2)^2 / ((nUMIsNormalized_T6_2+nUMIsNormalized_T7_2)/2)) %>%
#Label lineages that have sd between T6 and T7 higher than between replicates in T7
#and havr more than 5 unique clonotypes at T7
#and also increased abundance in T7 compared to T6 by comparison of all replicates
mutate(
changedTrees=sd71_72<sd61_71 & sd71_72<sd61_72 & sd71_72<sd62_71 & sd71_72<sd62_72 &(nClones_T7_1+nClones_T7_2>5) & logFCp61_71<0 & logFCp61_72<0 & logFCp62_71<0 & logFCp62_72<0)
Draw responding lineages:
g1<-drawScatter("T6_1","T6_2","nUMIsNormalized",coloring="changedTrees")+
labs(x="T6, pre-COVID19, replicate 1",y="T6 pre-COVID19, replicate 2")
g2<-drawScatter("T7_1","T7_2","nUMIsNormalized",coloring="changedTrees")+
labs(x="T7, d7 COVID19, replicate 1",y="T7, d7 COVID19, replicate 2")
g3<-drawScatter("T6_1","T7_2","nUMIsNormalized",coloring="changedTrees")+labs(x="T6, pre-COVID19, replicate 1",y="T7, d7 COVID19, replicate 2")
ggplotly(g1)
ggplotly(g2)
ggplotly(g3)
In pink we show lineages presumably responding to second COVID19 infection, defined at the previous step. We observe that some of these lineages also increased their abundance at first COVID19 response time point (T4).
Draw lineage tracing plot
#Reshape a little bit to account to the replicates present at some of the timepoints
lngs_trees_tracking<-lngs_trees_wide %>%
select(treeId,changedTrees, nUMIsNormalized_T1_1:nUMIsNormalized_T8_1) %>%
pivot_longer(cols=nUMIsNormalized_T1_1:nUMIsNormalized_T8_1,names_to= "timepoint",values_to = "nUMIsNormalized") %>%
mutate(timepoint=str_remove(timepoint,"nUMIsNormalized_")) %>%
separate("timepoint",into=c("timepoint","repnum")) %>%
group_by(timepoint) %>%
mutate(nReplicates=n_distinct(repnum)) %>%
ungroup() %>%
pivot_wider(id_cols = c(treeId,changedTrees,timepoint,nReplicates),
names_from = repnum,
names_prefix = "rep_",
values_from = nUMIsNormalized,
values_fill = 0.0) %>%
rowwise() %>%
mutate(meanNUMIsNormalized=ifelse(nReplicates==1, rep_1 , mean(c(rep_1,rep_2))),
minNUMIsNormalized=ifelse(nReplicates==1, rep_1 , min(rep_1,rep_2)),
maxNUMIsNormalized=ifelse(nReplicates==1, rep_1 , max(rep_1,rep_2)))
#Draw lineage tracing plot
lngs_trees_tracking %>%
ggplot()+
geom_line(data= lngs_trees_tracking %>% filter(!changedTrees),
aes(x=timepoint, y=meanNUMIsNormalized, group=treeId),
color="#929BAD",
alpha=0.2,size=1)+
geom_line(data= lngs_trees_tracking %>% filter(changedTrees),
aes(x=timepoint, y=meanNUMIsNormalized, group=treeId),
color="#F05670",
alpha=0.8,size=1)+
theme_bw()+
labs(y="Fraction of the lineage by cDNA molecules",
x="Time point")+
geom_segment(aes(x = 5, y = 0.085, xend = 4, yend = 0.08),
arrow = arrow(length = unit(0.2, "cm")))+
geom_segment(aes(x = 6, y = 0.085, xend =7, yend = 0.08),
arrow = arrow(length = unit(0.2, "cm")))+
annotate("text", x=5.5, y=0.086, label= "COVID19") +
geom_vline(xintercept = 4,linetype=2,color="grey76")+
geom_vline(xintercept = 7,linetype=2,color="grey76")+
scale_color_discrete(name = "",
labels = c("COVID19 responding", "Non-responding"))+
theme(legend.position = "bottom")
Analysis for individual trees
Define function for drawing phylogenetic trees with amino acid sequence alignments
drawTreeWithMSA<- function(trId){
library(ggtree)
tree<-read.tree("~/ch_covid/mixcr/shmTrees/newick/" %+% trId %+% ".tree")
tree_meta<-lng_db %>%
filter(treeId==trId) %>%
select(nodeId,
isotype,
timepoint,
tissue,
aaSeqCDR3,
uniqueUMICount,
cloneId) %>%
dplyr::rename(ID=nodeId)
## draw initial tree
g<-ggtree(tree) %<+% tree_meta +
geom_text(aes(label=timepoint), hjust=-.5,size=4)+
geom_tippoint(aes(color =isotype , size=uniqueUMICount))+
scale_color_manual(values=isotype_pal)+
theme(legend.position = 'bottom')+
scale_size_continuous(range = c(2, 7))
MiXCRtreeVDJ<-lng_db %>%
filter(treeId==trId) %>%
pull(`aaSeq{CDR1Begin:FR4End}`) %>% AAStringSet()
names(MiXCRtreeVDJ)<-lng_db %>%
filter(treeId==trId) %>%
pull(nodeId)
msaMiXCRtreeVDJ<-msa(MiXCRtreeVDJ)
alignment2Fasta(msaMiXCRtreeVDJ,"~/ch_covid/mixcr/shmTrees/export/" %+% trId %+% "_aln.fasta")
pdf(file = "~/ch_covid/mixcr/shmTrees/export/" %+% trId %+% ".pdf",
width = 16, # The width of the plot in inches
height = 12)
print(msaplot(g,"~/ch_covid/mixcr/shmTrees/export/" %+% trId %+% "_aln.fasta",
offset=8,width = 2,color=c("grey88",mi_27))
)
}
Define function for drawing SHM frequencies by position plot
mutationPositionPlot<-function(trId){
treeDb<-lng_db %>% filter(!is.na(cloneId),treeId==trId) %>%
mutate(CDR1BeginPosition=0,
FR2BeginPosition=positionInReferenceOfFR2Begin-positionInReferenceOfCDR1Begin,
CDR2BeginPosition=positionInReferenceOfCDR2Begin-positionInReferenceOfCDR1Begin,
FR3BeginPosition=positionInReferenceOfFR3Begin-positionInReferenceOfCDR1Begin,
CDR3BeginPosition=positionInReferenceOfCDR3Begin-positionInReferenceOfCDR1Begin,
FR4BeginPosition=positionInReferenceOfCDR3Begin-positionInReferenceOfCDR1Begin+nchar(nSeqCDR3))
FR2Pos<-treeDb %>%
pull(FR2BeginPosition) %>%
unique()%>% min()
CDR2Pos<-treeDb %>%
pull(CDR2BeginPosition) %>%
unique() %>% min()
FR3Pos<-treeDb %>%
pull(FR3BeginPosition) %>%
unique()%>% min()
CDR3Pos<-treeDb %>%
pull(CDR3BeginPosition) %>%
unique()%>% min()
FR4Pos<-treeDb %>%
pull(FR4BeginPosition) %>%
unique()%>% min()
mutByRegion<-str_extract_all(treeDb %>%
pull(`nMutations{CDR1Begin:FR3End}BasedOnGermline`),
"[0-9]*[0-9][0-9]*") %>% unlist() %>% table() %>%
enframe(name = "position",value="nMut") %>%
mutate(position=as.integer(position),nMut=as.integer(nMut),
region=case_when(
position<FR2Pos ~ "CDR1",
position<CDR2Pos ~ "FR2",
position<FR3Pos ~ "CDR2",
position<CDR3Pos ~ "FR3"
)) %>%
bind_rows(
str_extract_all(treeDb %>% pull(nMutationsCDR3BasedOnGermline),
"[0-9]*[0-9][0-9]*") %>% unlist() %>% table() %>%
enframe(name = "position",value="nMut") %>%
mutate(position=as.integer(position)+CDR3Pos,nMut=as.integer(nMut),
region="CDR3")
) %>%
bind_rows(
str_extract_all(treeDb %>% pull(nMutationsFR4BasedOnGermline),
"[0-9]*[0-9][0-9]*") %>% unlist() %>% table() %>%
enframe(name = "position",value="nMut") %>%
mutate(position=as.integer(position)+FR4Pos,nMut=as.integer(nMut),
region="FR4")
) %>%
mutate(freqMut=nMut/nrow(treeDb))
g<-mutByRegion%>%
ggplot(aes(x=position,y=freqMut,fill=region)) +
theme_bw()+
geom_bar(stat="identity")+
scale_fill_manual(values=mi_dark)+
geom_vline(xintercept = c(FR2Pos,CDR2Pos,FR3Pos,CDR3Pos,FR4Pos),color="grey28",linetype=2)+
theme(legend.position = "none")+
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(expand=c(0,0)) +
labs(y="Mutation freq.", x="Position")
print(g)
}
Define function for drawing lineage longitudinal trajectory
lineageTrajectoryPlot<-function(trId){
g<-lngs_trees_tracking %>%
filter(treeId==trId) %>%
ggplot(aes(x=timepoint, y=meanNUMIsNormalized, group=treeId))+
geom_line(color="#F05670",size=1)+
geom_point(color="grey20")+
theme_bw()+
labs(y="Lineage fraction in\nrepertoire by cDNA count",
x="Time point")+
geom_errorbar(aes(ymin=minNUMIsNormalized,
ymax=maxNUMIsNormalized),
width = 0.15,
color="grey26")
print(g)
}
Define function for drawing lineage isotype abundance by timepoints
lineageIsotypesPlot<-function(trId){
g<- lngs_trees_tracking %>%
filter(treeId==trId) %>%
#calculate isotype proportions in each timepoint and join to tracking data.frame
left_join(
lng_db %>%
filter(treeId==trId,!is.na(cloneId)) %>%
group_by(isotype,timepoint) %>%
summarise(UMICount=sum(uniqueUMICount)) %>%
ungroup() %>%
group_by(timepoint) %>%
mutate(UMIFraction=UMICount/sum(UMICount)),
by=c("timepoint")) %>%
mutate(isotypeUMIFraction=UMIFraction*meanNUMIsNormalized) %>%
ggplot(aes(x=timepoint,y=isotypeUMIFraction,fill=isotype))+
geom_histogram(stat="identity")+
theme_bw()+
scale_fill_manual(values=isotype_pal)+
labs(y="Isotype fraction in\nrepertoire by cDNA count",
x="Time point")+
theme(legend.position = "none")
print(g)
}
Draw all plots for lineage 13628
#| warning: false
drawTreeWithMSA(13628)
mutationPositionPlot(13628)
lineageTrajectoryPlot(13628)
lineageIsotypesPlot(13628)
Draw all plots for lineage 10511
#| warning: false
drawTreeWithMSA(10511)
mutationPositionPlot(10511)
lineageTrajectoryPlot(10511)
lineageIsotypesPlot(10511)
Draw all plots for lineage 12796
#| warning: false
drawTreeWithMSA(12796)
mutationPositionPlot(12796)
lineageTrajectoryPlot(12796)
lineageIsotypesPlot(12796)
Alternative representations for lineage trees
Lets calculate the number of synonymous and non-synonymous mutations in clonotypes within lineages by parsing mutationsDetailed
columns from 'mixcr exportShmTreesWithNodes
' command output:
Code
suppressWarnings(
lng_db %<>%
mutate(
across(starts_with("mutationsDetailed"),
.fns = ~ str_split(.x,pattern = ",") %>%
lapply(function(mut){
if (is.na(mut)) {
return(0)
} else {
str_split(mut,pattern=":") %>%
transpose() %>%
as.data.table() %>%
summarise(nMut=n()) %>%
pull(nMut)}}) %>%
unlist(),
.names = "mutNum_{.col}" ),
across(starts_with("mutationsDetailed"),
.fns = ~ str_split(.x,pattern = ",") %>%
lapply(function(mut){
if (is.na(mut)) {
return(0)
} else {
str_split(mut,pattern=":") %>%
transpose() %>%
as.data.table() %>%
summarise(nMutSyn=sum(V2=="")) %>%
pull(nMutSyn)}}) %>%
unlist(),
.names = "mutNumSyn_{.col}" ),
across(starts_with("mutationsDetailed"),
.fns = ~ str_split(.x,pattern = ",") %>%
lapply(function(mut){
if (is.na(mut)) {
return(0)
} else {
str_split(mut,pattern=":") %>%
transpose() %>%
as.data.table() %>%
summarise(nMutNonSyn=sum(V2!="")) %>%
pull(nMutNonSyn)}}) %>%
unlist(),
.names = "mutNumNonSyn_{.col}" )
) %>%
rename_at(vars(matches("^mutNum_|^mutNumSyn_|^mutNumNonSyn_")), ~ str_remove(., "_mutationsDetailed")) )
lng_db%<>%
rowwise() %>%
mutate(totalSynMut=sum(across(mutNumSynInCDR1BasedOnGermline:mutNumSynInFR4BasedOnGermline)),
totalNonSynMut=sum(across(mutNumNonSynInCDR1BasedOnGermline:mutNumNonSynInFR4BasedOnGermline))) %>%
ungroup()
Let's plot the phylogenetic tree for one of the lineages in three alternative colorings: by isotype, by timepoint and by number of non-synonymous mutations
trId=12796
tree<-read.tree("~/ch_covid/mixcr/shmTrees/newick/" %+% trId %+% ".tree")
tree_meta<-lng_db %>%
filter(treeId==trId) %>%
select(nodeId,
isotype,
timepoint,
totalNonSynMut,
uniqueUMICount,
cloneId) %>%
dplyr::rename(ID=nodeId)
## draw initial tree
g1<-ggtree(tree) %<+% tree_meta +
geom_text_repel(aes(label=isotype))+
geom_tippoint(aes(color =isotype, size=uniqueUMICount))+
scale_color_manual(values=isotype_pal)+
theme(legend.position = 'bottom')+
labs(color="Isotype")
g2<-ggtree(tree) %<+% tree_meta +
geom_text_repel(aes(label=timepoint))+
geom_tippoint(aes(color =timepoint, size=uniqueUMICount ))+
scale_colour_manual(values=mi_dark)+
theme(legend.position = 'bottom')
g3<-ggtree(tree) %<+% tree_meta +
geom_text_repel(aes(label=totalNonSynMut))+
geom_tippoint(aes(color =totalNonSynMut, size=uniqueUMICount ))+
scale_colour_gradientn(colours=c("#DFFADC", "#611347"))+
theme(legend.position = 'bottom')+
labs(color="# of non-syn. SHM")
cowplot::plot_grid(g1,g2,g3,nrow=1)