MiLaboratories RNA 5'RACE TCR repertoire kit
In this tutorial we are going to analyze data generated using MiLaboratories HUMAN RNA 5'RACE TCR repertoire kit. The data for this tutorial consists of 12 samples. For each sample two cDNA libraries were prepared one for each TCR chain (TRA,TRB). Samples differ among ech other by the amount of RNA used for 1st strand cDNA synthesis.
Bellow you can see the structure of cDNA library.
Upstream analysis
One-line solution
The most straightforward way to get clonotype tables is to use a universal mixcr analyze
command.
According to the library preparation protocol, the library has V primers on 5'-end and C primers on 3', so the command for a single sample is the following:
> mixcr analyze amplicon \
--species mmu \
--starting-material hsa \
--receptor-type tcr \
--umi-pattern-name MiLaboratoriesRACETCR \
--5-end no-v-primers \
--3-end c-primers \
--adapters adapters-present \
--report Multi_TRA_10ng_3.report:= \
fastq/RACE_TRA_10ng_3_S167_R1_001.fastq.gz \
fastq/RACE_TRA_10ng_3_S167_R1_001.fastq.gz \
results/RACE_TRA_10ng_3
The meaning of these options is the following.
--species
- is set to
hsa
for Homo Sapiens. You want to set this parameter tommu
for Mus Musculus --starting-material
- is set to
rna
and corresponds toVTranscriptWithout5UTRWithP
alignment feature for V-gene ( see Gene features and anchor points for details) --receptor-type
tcr
. It affects the choice of alignment algorithms.--umi-pattern-name
MiLaboratoriesRACETCR
specifies a build in UMI pattern for MiLaboratories RNA 5'RACE TCR repertoire kit. This name stands for the following pattern:"^(R1:*) \ ^tggtatcaacgcagagt(UMI:NNNNTNNNNTNNNN)(R2:*)"
. It specifies the position of UMI barcode. MiXCR provides a powerful regex-like language allowing to specify almost arbitrary barcode structure.--5-end
- is set to
no-v-primers
because the data was generated using 5'RACE protocol. This choice leads to a global alignment algorithm to align the left bound of V. --3-end-primers
- is set to
c-primers
since the primer used for library preparation is complimentary to C-region of TCR genes. This leads to a global alignment algorithms to align the right bound of J and a local alignment at the right bound of C gene. --adapers
adapterspresent
because primer sequences were not cut off prior to.
Running the command above will generate the following files:
> ls result/
# human-readable reports
RACE_TRA_10ng_3.report
# raw alignments (highly compressed binary file)
RACE_TRA_10ng_3.vdjca
# alignments with corrected UMI barcode sequences
RACE_TRA_10ng_3.corrected.vdjca
# TCRα & TCRβ CDR3 clonotypes (highly compressed binary file)
RACE_TRA_10ng_3.clns
# TCRα & TCRβ CDR3 clonotypes exported in tab-delimited txt
RACE_TRA_10ng_3.clonotypes.TRA.tsv
RACE_TRA_10ng_3.clonotypes.TRB.tsv
RACE_TRA_10ng_3.clonotypes.TRD.tsv
RACE_TRA_10ng_3.clonotypes.TRG.tsv
Clonotype tables is the main result of the upstream analysis. They are stored in a highly compressed and efficient binary .clns
file and can be exported in many ways: detailed tab-delimited format with dozens of customizable columns, human readable for manual inspection, and AIRR format suitable for many scientific downstream analysis tools. By default, MiXCR exports clonotypes in a tab-delimited format separately for each immunological chain.
In order to run the analysis for all samples in the project on Linux we can for example use GNU Parallel in the following way:
> ls /fastq/*R1* | \
parallel -j2 \
'mixcr analyze amplicon \
--species hsa \
--starting-material rna \
--receptor-type tcr \
--umi-pattern-name MiLaboratoriesRACETCR \
--5-end no-v-primers \
--3-end c-primers \
--adapters adapters-present \
--report {=s:.*/:results/:;s:_S.*:.report:=} \
{} \
{=s:R1:R2:=} \
Under the hood pipeline
Under the hood, mixcr analyze amplicon
executes the following pipeline of MiXCR actions:
align
- alignment of raw sequencing reads against reference database of V-, D-, J- and C- gene segments
- pattern matching of tag pattern sequence and extraction of barcodes
> mixcr align \
--species hsa \
--tag-pattern '^(R1:*) \ ^tggtatcaacgcagagt(UMI:NNNNTNNNNTNNNN)(R2:*)' \
--report result/RACE_TRA_10ng_3.report \
--json-report result/RACE_TRA_10ng_3.report.json \
-OvParameters.geneFeatureToAlign="VTranscriptWithout5UTRWithP" \
-OvParameters.parameters.floatingLeftBound=false \
-OjParameters.parameters.floatingRightBound=false \
-OcParameters.parameters.floatingRightBound=true \
fastq/RACE_TRA_10ng_3.fastq.gz \
fastq/RACE_TRA_10ng_3.fastq.gz \
results/RACE_TRA_10ng_3.vdjca
Options --report
and --json-report
are specified here explicitly. Since we start from RNA data we use VTranscriptWithout5UTRWithP
for the alignment of V segments (see Gene features and anchor points. Because no primers were used for V segment, we use global alignment on the left bound of V and since we have primers on C segment, we use global alignment for J and local on the right bound of C. This behavior is defined by the following options: -OvParameters.parameters.floatingLeftBound=false
, -OjParameters.parameters.floatingRightBound=false
, -OcParameters.parameters.floatingRightBound=true
.
This step utilizes all available CPUs and scales perfectly. When there are a lot of CPUs, the only limiting factor is the speed of disk I/O. To limit the number of used CPUs one can pass --threads N
option.
correctAndSortTags
Corrects sequencing and PCR errors inside barcode sequences. This step does extremely important job by correcting artificial diversity caused by errors in barcodes. In the considered example project it corrects only sequences of UMIs.
> mixcr correctAndSortTags \
--report results/RACE_TRA_10ng_3.report \
--json-report results/RACE_TRA_10ng_3.report.json \
RACE_TRA_10ng_3.vdjca \
RACE_TRA_10ng_3.corrected.vdjca
Options --report
and --json-report
are specified here explicitly so that the report files will be appended with the barcode correction report.
assemble
Assembles clonotypes and applies several layers of errors correction. In the current example project we consider TCRα & TCRβ separately and clonotype by its CDR3 sequence. The layers of correction applied in this example are:
- assembly consensus CDR3 sequence for each UMI
- quality-awared correction for sequencing errors
- clustering to correct for PCR errors, which still may present even in the case of UMI data, since a error may be introduced e.g. on the first reverse-transcription cycle
> mixcr assemble \
--report results/RACE_TRA_10ng_3.report \
--json-report results/RACE_TRA_10ng_3.report.json \
RACE_TRA_10ng_3.corrected.vdjca \
RACE_TRA_10ng_3.clns
Options --report
and --json-report
are specified here explicitly so that the report files will be appended with assembly report.
exportClones
Finally, to export clonotype tables in tabular form exportClones
is used:
> mixcr exportClones \
-p full \
-uniqueTagCount UMI \
RACE_TRA_10ng_3.clns \
RACE_TRA_10ng_3.tsv
Here -p full
is a shorthand for the full preset of common export columns and -uniqueTagCount UMI
adds a column with the UMI count for each clone.
Quality control
Now when the upstream analysis is finished we can move on to quality control. First lets look at the alignment report plot.
# obtain alignment quality control
> mixcr exportQc align \
result/*.vdjca \
alignQc.pdf
Let's examine the plot.