Barcode pattern syntax
Barcode patterns are used to extract various barcodes (sample barcodes, UMIs, cell barcodes) from raw sequences, trim sequencing reads or filter some sequences out. MiXCR/MiTool provides a powerful pattern-matching regex-like language to specify almost arbitrary barcodes structure. In MiXCR one can specify pattern directly at align
step or analyze
using --tag-pattern
option. In MiTool patten must be provided for parse
step.
Example:
> mixcr align -s mmu \
--tag-pattern "^N{0:3}(UMI:N{12})attcGCCA(R1:*)\^N{18}(R2:*)" \
input_R1.fastq input_R2.fastq output.vdjca
Basic Syntax Elements
Uppercase/lowercase letters
Uppercase and lowercase letters are used to specify the sequence that must be matched. Capital letters imply perfect match. Lowercase letters allow fuzzy match, where max total mismatches is determined by --tag-max-error-budget (indels are not supported). The --tag-max-error-budget
value (default 10
) is defined as a total penalty score in bits: - one mismatch with normal nucleotide (a,t,g,c) costs 2 bits, - one mismatch with IUPAC wildcards ((N, w, s, m, etc)) costs 1 bit.
Each pattern has to start with ^
which defines the read beginning. Additionally, one can use $
to specify the end of the read.
Examples:
"^ATGCсtaggcTTCGA"
matches:
ATGCCTAGGCTTCGA....
ATGCGTAGGCTTCGA....
ATGCCAACGCTTCGA....
"^ATGCсtaggcTTCGA$"
matches:
ATGCCTAGGCTTCGA
ATGCGTAGGCTTCGA
ATGCCAACGCTTCGA
Backslash \
Backslash symbol \
is a mate-pair separator. The patter to the left side of the separator will be matched against the first file provided in input (usually R1
), the pattern to the right side will be matched against the second input file (usually R2
). In case of single-read input file, \
character should be omitted. By default, during barcode extraction MiTool will check input reads in the order in which they are specified in --input argument. If --tag-parse-unstranded
argument is specified, it will also try to match pattern in the other read file. Examples:
"^ATTAGACA \ ^CACATATA"
By default matches:
R1: ATTAGACA.......
R2: CACATATA.......
but not:
R2: ATTAGACA......
R1: CACATATA......
With --tag-parse-unstranded the last match is also allowed.
Wildcard *
Asterisk symbol *
means any nucleotide any number of times. Examples:
"^* \ ^CACATATA"
matches:
R1 : TGGATTCAGCGC...
R2: CACATATA...
Repeat {<X>}
{<X>}
- repeats the last symbol X number of times. Examples:
"^G{4}*"
matches:
GGGGTCCACAT
"^A{1:3}*"
matches:
ATGGGCAT
AATGGGCAT
AAATGGGCAT
"^A{:3}*"
matches:
TGGGCAT
ATGGGCAT
AATGGGCAT
AAATGGGCAT
"^A{3:}*"
matches:
AAATGGGCAT
AAAATGGGCAT
AAAAATGGGCAT
*
matches the same pattern as N{:}
Trimming
<{<X>}
and >{<X>}
allows to trim 0 to X number of nucleotides from left or right side of the pattern respectively. Examples:
"^<{3}ATTAGACA"
equals:
^<<<ATTAGACA
Matches:
ATTAGACAATTAGACAATTAGACA...
TTAGACAATTAGACAATTAGACA...
TAGACAATTAGACAATTAGACA...
AGACAATTAGACAATTAGACA...
"ATTAGACA>{3}(R1:*)"
Matches:
GATGTATTAGACAGACGAGTCATGCGTATT...
========[------R1-----------
GATGTATTAGACGACGAGTCATGCGTATT....
=======[-------R1-----------
GATGTATTAGAGACGAGTCATGCGTATT.....
======[-------R1------------
GATGTATTAGGACGAGTCATGCGTATT......
=====[--------R1------------
Capture groups
MiTool allows to extract multiple number of groups and assigns them to the read (or alignment in case of MiXCR). Typical groups are different types of barcodes: molecular barcode, sample barcode, cell barcode etc.
Group is defined inside round brackets ()
in the following manner:
(GROUM_NAME:pattern)
examples:
(UMI:NNNNANNNNNANNNN)
(SMPL:NNNN)
(CELL:atgcTTGANNNNNNNNTGAATCCNN)
(R1:*)
(SMPL:NNNN)(UMI:N{12}(R1:*)
Some rules apply to group names: - Everything that starts with CELL
is treated as a cell barcode - Everything that starts with UMI
or MI
(ex. MIG
) is used as a molecular barcode - Everything that starts with S
is a sample barcode. - R1
, R2
etc. groups define the payload read sequence. - Groups with names that don't fall under the rules above will be ignored
Important: sequences outside R1
, R2
, etc. groups will be ignored and will not be used in analysis.
Examples:
"^(CELL:N{4})(UMI:N{5})\^(R2:*)"
matches:
R1: ATGCGGGTGACCTTGAGGTGGACC...
R2: TGGGGTAGCCTACCGTGGACACTG...
The whole sequence of the read from the second file will be extracted with R2 tag and will be used in the downstream analysis. This pattern is commonly used when only one read has CDR3 sequence in it (R2
in this case) and the other one is used for extracting molecular and/or cell barcode.
Logical OR
There are two levels at which logical "OR" can be applied:
|
- single read level "or" \ ||
- whole pattern level "or"
Constrains: - There must be the same set of matching groups on both sides of "|" and "||" - There must be the same number of sub-read patterns on both sides of "||"
Examples:
"^ATTAGACA(UMI:NNNN)(R1:*) | ^TGCTTGCA(UMI:NNNN)(R1:*) \ ^(R2:*)"
matches:
R1: ATTAGACATTGCCCTGGGATCCG...
R2: TGCCGTGATTATGCCGTGATTGT...
and
R1: TGCTTGCATTGCCCTGGGATCCG...
R2: TGCCGTGATTATGCCGTGATTGT...
"^ATTAGACA(UMI:NNNN) | ^AGGACACA(UMI:NNNN) \ ^GATACGA || ^GATAGAC \ ^TAGCA(UMI:NNNNNNN)"
matches:
R1: ATTAGACAtgctaagc....
R2: GATACgtacgttgtta....
R1: AGGACACAgctaagct....
R2: GATACgtacgttgtta....
R1: GATAGACtgctaagc....
R2: TAGCAgtacgttgtt....
The following patterns will result in an error due to violation of the constraints mentioned above:
^ATTAGACA(UMI:NNNN) | ^ATTACACA \ ^GATACGA || ^GATAGAC \ ^TAGCA(UMI:NNNNNNN)
^ATTAGACA(UMI:NNNN) | ^ATTACACA(UMI:NNNNNNN) \ ^GATACGA || ^GATAGAC(UMI1:NNNNNNN) \ ^TAGCA(UMI:NNNNNNN)
^ATTAGACA(UMI:NNNN) | ^ATTACACA(UMI:NNNNNNN) \ ^GATACGA || ^GATAGAC \ ^TAGCA
^ATTAGACA | ^ATTACACA \ ^GATACGA || ^GATAGAC