Overlaps alignments coming from the same molecule which partially cover CDR3 regions:
This step is used in two cases:
- non targeted RNA-Seq data where there is only a tiny fraction of TCR/BCR reads and this step allows to resque more informative data from the input
- fragmented TCR/BCR data from e.g. 10x VDJ protocols, where each read covers random part of VDJ region
In order to efficiently extract repertoire from such data one have to reconstruct initial CDR3s from fragments scattered all over the initial sequencing dataset.
Depending on whether the initial data have or not have UMI and cell-barcodes MiXCR uses either sufficient part of NDN region (which gives high enough entropy) or, in addition to NDN, UMI and cell barcodes to find pairs of alignments derived from the same molecule. Once determined such pairs MiXCR aggregates them in a single alignment fully covering
CDR3 region. Default thresholds in this procedure were optimized to assemble as many contigs as possible while producing zero false overlaps.
assemblePartial step one has to specify the following parameters for
mixcr align -p <name> --keep-non-CDR3-alignments [...] input_R1.fastq[.gz] [input_R2.fastq[.gz]] alignments.vdjca
- required to prevent MiXCR from filtering out partial alignments, that don’t fully cover CDR3
Command line options
mixcr assemblePartial [--overlapped-only] [--drop-partial] [--cell-level] [-O <key=value>]... [--report <path>] [--json-report <path>] [--force-overwrite] [--no-warnings] [--verbose] [--help] alignments.vdjca alignments.recovered.vdjca
.vdjcafile containing initial alignments as input and writes new
.vdjcafile with corrected alignments. Sometimes it may be useful to inspect resulting alignments with
exportAlignmentsPretty. Additionally, MiXCR produces a comprehensive report which provides a detailed summary of each stage of this partial assembly pipeline.
Basic command line options are:
- Path to input alignments file.
- Path where to write recovered alignments.
- Write only overlapped sequences (needed for testing). Default value determined by the preset.
- Drop partial sequences which were not assembled. Can be used to reduce output file size if no additional rounds of
assemblePartialare required. Default value determined by the preset.
- Overlap sequences on the cell level instead of UMIs for tagged data with molecular and cell barcodes. Default value determined by the preset.
- Overrides default partial assembler parameter values.
-r, --report <path>
- Report file (human readable version, see
-j / --json-reportfor machine readable report).
-j, --json-report <path>
- JSON formatted report file.
- Force overwrite of output file(s).
- Suppress all warning messages.
- Verbose messages.
- Show this help message and exit.
Partial assembler parameters
The following options are available for
- Length of k-mer taken from VJ junction region and used for searching potentially overlapping sequences.
- Offset taken from
- Minimal length of the overlapped VJ region: two sequences can be potentially merged only if they have at least
minimalAssembleOverlap-wide overlap in the VJJunction region. No mismatches are allowed in the overlapped region.
- Minimal number of non-template nucleotides (N region) that overlap region must cover to accept the overlap.
> mixcr assemblePartial -OminimalAssembleOverlap=10 alignments.vdjca alignmentsRescued.vdjca
Partial assembly algorithm works in a pairwise manner, aggregating a pair of alignments at a time. Sometimes the efficiency is increased if you perform two consecutive rounds of
> mixcr assemblePartial alignments.vdjca alignments_rescued_1.vdjca > mixcr assemblePartial alignments_rescued_1.vdjca alignments_rescued_2.vdjca
Very short reads
In case of short reads input, even after
assemblePartial some contigs/reads still only partially cover
CDR3. A substantial fraction of such contigs needs only several nucleotides on the 5’ or the 3’ end to fill up the sequence up to a complete
CDR3. These sequence parts can be taken from the germline, if corresponding V or J gene for the contig is uniquely determined (e.g. from second mate of a read pair). Such procedure is not safe for IGs, because of hypermutations, but for TCRs which have relatively conservative sequence near conserved Cys and Phe/Trp, it can reconstruct additional clonotypes with relatively small chance to introduce false ones. Described procedure is implemented in the
mixcr extend action, by default it acts only on TCR sequences.