mixcr assemblePartial
Overlaps alignments coming from the same molecule which partially cover CDR3 regions:
This step is used in two cases:
- non targeted RNA-Seq data where there is only a tiny fraction of TCR/BCR reads and this step allows to resque more informative data from the input
- fragmented TCR/BCR data from e.g. 10x VDJ protocols, where each read covers random part of VDJ region
In order to efficiently extract repertoire from such data one have to reconstruct initial CDR3s from fragments scattered all over the initial sequencing dataset.
Depending on whether the initial data have or not have UMI and cell-barcodes MiXCR uses either sufficient part of NDN region (which gives high enough entropy) or, in addition to NDN, UMI and cell barcodes to find pairs of alignments derived from the same molecule. Once determined such pairs MiXCR aggregates them in a single alignment fully covering CDR3
region. Default thresholds in this procedure were optimized to assemble as many contigs as possible while producing zero false overlaps.
To use assemblePartial
step one has to specify the following parameters for align
:
mixcr align -p <name>
--keep-non-CDR3-alignments
[...]
input_R1.fastq[.gz] [input_R2.fastq[.gz]]
alignments.vdjca
where
--keep-non-CDR3-alignments
- required to prevent MiXCR from filtering out partial alignments, that donβt fully cover CDR3
Command line options
mixcr assemblePartial
[--overlapped-only]
[--drop-partial]
[--cell-level]
[-O <key=value>]...
[--report <path>]
[--json-report <path>]
[--force-overwrite]
[--no-warnings]
[--verbose]
[--help]
alignments.vdjca alignments.recovered.vdjca
.vdjca
file containing initial alignments as input and writes new .vdjca
file with corrected alignments. Sometimes it may be useful to inspect resulting alignments with exportAlignmentsPretty
. Additionally, MiXCR produces a comprehensive report which provides a detailed summary of each stage of this partial assembly pipeline. Basic command line options are:
alignments.vdjca
- Path to input alignments file.
alignments.recovered.vdjca
- Path where to write recovered alignments.
-o, --overlapped-only
- Write only overlapped sequences (needed for testing). Default value determined by the preset.
-d, --drop-partial
- Drop partial sequences which were not assembled. Can be used to reduce output file size if no additional rounds of
assemblePartial
are required. Default value determined by the preset. --cell-level
- Overlap sequences on the cell level instead of UMIs for tagged data with molecular and cell barcodes. Default value determined by the preset.
-O <key=value>
- Overrides default partial assembler parameter values.
-r, --report <path>
- Report file (human readable version, see
-j / --json-report
for machine readable report). -j, --json-report <path>
- JSON formatted report file.
-f, --force-overwrite
- Force overwrite of output file(s).
-nw, --no-warnings
- Suppress all warning messages.
--verbose
- Verbose messages.
-h, --help
- Show this help message and exit.
Partial assembler parameters
The following options are available for assemblePartial
:
-OkValue=12
- Length of k-mer taken from VJ junction region and used for searching potentially overlapping sequences.
-OkOffset=-7
- Offset taken from
VEndTrimmed
/JBeginTrimmed
-OminimalAssembleOverlap=12
- Minimal length of the overlapped VJ region: two sequences can be potentially merged only if they have at least
minimalAssembleOverlap
-wide overlap in the VJJunction region. No mismatches are allowed in the overlapped region. -OminimalNOverlap=5
- Minimal number of non-template nucleotides (N region) that overlap region must cover to accept the overlap.
Example usage:
> mixcr assemblePartial -OminimalAssembleOverlap=10 alignments.vdjca alignmentsRescued.vdjca
Multiple runs
Partial assembly algorithm works in a pairwise manner, aggregating a pair of alignments at a time. Sometimes the efficiency is increased if you perform two consecutive rounds of assembplePartial
.
> mixcr assemblePartial alignments.vdjca alignments_rescued_1.vdjca
> mixcr assemblePartial alignments_rescued_1.vdjca alignments_rescued_2.vdjca
Very short reads
In case of short reads input, even after assemblePartial
some contigs/reads still only partially cover CDR3
. A substantial fraction of such contigs needs only several nucleotides on the 5β or the 3β end to fill up the sequence up to a complete CDR3
. These sequence parts can be taken from the germline, if corresponding V or J gene for the contig is uniquely determined (e.g. from second mate of a read pair). Such procedure is not safe for IGs, because of hypermutations, but for TCRs which have relatively conservative sequence near conserved Cys and Phe/Trp, it can reconstruct additional clonotypes with relatively small chance to introduce false ones. Described procedure is implemented in the mixcr extend
action, by default it acts only on TCR sequences.