Skip to content

Creating RepSeq.io formatted JSON library

Suppose we have a bunch of de-novo discovered V/D/J/C sequences in fasta files with the following content:

my_genes.v.fasta (contain VRegion, i.e. from the FR1 begin to the last nucleotide right before RSS, normally somewhere after conserved cysteine)

>TRBV12-348*00|F
GATGCTGGAGTTATCCAGTCACCCCGCCATGAGGTGACAGAGATGGGACAAGAAGTGACTCTGAGATGTAAACCA
ATTTCAGGCCACAACTCCCTTTTCTGGTACAGACAGACCATGATGCGGGGACTGGAGTTGCTCATTTACTTTAAC
AACAACGTTCCGATAGATGATTCAGGGATGCCCGAGGATCGATTCTCAGCTAAGATGCCTAATGCATCATTCTCC
ACTCTGAAGATCCAGCCCTCAGAACCCAGGGACTCAGCTGTGTACTTCTGTGCCAGCAGTTTAGC

my_genes.j.fasta (contains JRegion, i.e. from first J gene nucleotide, right after RSS, until FR4 end)

>TRBJ1-528*00|F
TAACAACCAGGCCCAGTATTTTGGAGAAGGGACTCGGCTCTCTGTTCTAG

To use these sequences in MiXCR or any other repseqio-based software, we have to create JSON library file for them (see format description here).

There are two main options of creating library file: - create repseqio-JSON formatted library using two automated steps and then, if required, fill in information that was not automatically detected - from scratch, manually provide JSON file with meta information and positions of CDRs (complementarity determining regions) and FRs (framework regions) along with positions of other important gene features required for downstream software (see list of available anchor points here).

Here we will cover automatic import procedure. Please see library format description for more details.

Automatically create boilerplate library

We can do this using fromFasta action:

> repseqio fromFasta --taxon-id 9606 \
    --species-name hs --species-name homsap \
    --chain TRB --name-index 0 \
    --gene-type V --gene-feature VRegion \
    my_genes.v.fasta my_library.v.json

> repseqio fromFasta --taxon-id 9606 \
    --species-name hs --species-name homsap \
    --chain TRB --name-index 0 \
    --gene-type D --gene-feature DRegion \
    my_genes.d.fasta my_library.d.json

> repseqio fromFasta --taxon-id 9606 \
    --species-name hs --species-name homsap \
    --chain TRB --name-index 0 \
    --gene-type J --gene-feature JRegion \
    my_genes.j.fasta my_library.j.json

repseqio merge my_library.v.json my_library.d.json my_library.j.json my_library.json

rm my_library.v.json my_library.d.json my_library.j.json

To check the library file, we have so far, we can run debug command:

> repseqio debug my_library.json

this will print the following (supposing we used files mentioned above):

TRBV12-38*00 (F) TRB

WARNINGS:
Unable to find CDR3 start


V5UTRGermline
N:   Not Available
AA:  Not Available

...

GermlineVCDR3Part
N:   Not Available
AA:  Not Available

VRegion
N:   GATGCTGGAGTTATCCAGTCACCCCGCCATGAGGTGACAGAGATGGGACAAGAAGTGACTCTGAGATGTAAACCAATTTCAGGCCACAACTCCCTTTTCTGGTACAGACAGACCATGATGCGGGGACTGGAGTTGCTCATTTACTTTAACAACAACGTTCCGATAGATGATTCAGGGATGCCCGAGGATCGATTCTCAGCTAAGATGCCTAATGCATCATTCTCCACTCTGAAGATCCAGCCCTCAGAACCCAGGGACTCAGCTGTGTACTTCTGTGCCAGCAGTTTAGC
AA:  DAGVIQSPRHEVTEMGQEVTLRCKPISGHNSLFWYRQTMMRGLELLIYFNNNVPIDDSGMPEDRFSAKMPNASFSTLKIQPSEPRDSAVYFCASSL_

...

=========

TRBJ1-528*00 (F) TRB

WARNINGS:
Unable to find CDR3 end


JRegion
N:   TAACAACCAGGCCCAGTATTTTGGAGAAGGGACTCGGCTCTCTGTTCTAG
AA:  Not Available

...

FR4
N:   Not Available
AA:  Not Available
=========

basically this shows us how repseqio see the library content. After fromFasta action library contains information only on begin and end positions of genes (strictly speaking begin and end positions of gene feature we specified using --gene-feature option), so the only regions it can extract are VRegion for V gene and JRegion for J (see illustration here). For normal repertoire extraction we, at least, must specify positions of CDR3Begin (in V gene) and CDR3End (in J gene), and probably also need FRs, if we plan to extract corresponding regions from repertoire data. Here we again have two options:

  • manually specify corresponding positions by adding new items to the anchorPoints field (see library format description)
  • let repseqio find sequence with known anchor points homologous to our sequences from other library (built-in library in this case) and infer missing anchor point from them.

The first option may be the only way if target 'V'/'J' segments are not homologous to any sequences from available library.

For the second approach we can use inferPoint action from repseqio utility and built-in repseqio library as a reference (used by default) (see library repo here):

repseqio inferPoints -g VRegion -g JRegion -f my_library.json my_library.json

here we inferred points for V genes based on alignment of VRegion with V genes from built-in repseqio library, and for J genes base on alignment of JRegion. my_library.json specified both as input and output file, with -f option, so it will be in-place overwritten with the result (!! don't use such execution pattern for libraries containing any manual edits or other hands-on time investments, this command may delete or corrupt the file). !! The output (alignments) of this commands should be carefully analysed to detect possible inconsistencies this automated procedure may introduce, or to spot genes for that repseqio failed to find homologous genes.

The output file (my_library2.json) will contain library with inferred points:

[ {
  "taxonId": 9606,
  "speciesNames": [ "homosapiens", "homsap", "hs", "hsa", "human" ],
  "genes": [ {
    "baseSequence": "file://my_genes.fasta#TRBV12-348*00",
    "name": "TRBV12-38*00",
    "geneType": "V",
    "isFunctional": true,
    "chains": [ "TRB" ],
    "anchorPoints": {
      "FR1Begin": 0,
      "CDR1Begin": 78,
      "FR2Begin": 93,
      "CDR2Begin": 144,
      "FR3Begin": 162,
      "CDR3Begin": 273,
      "VEnd": 290
    }
  }, {
    "baseSequence": "file://my_genes.fasta#TRBJ1-528*00",
    "name": "TRBJ1-528*00",
    "geneType": "J",
    "isFunctional": true,
    "chains": [ "TRB" ],
    "anchorPoints": {
      "JBegin": 0,
      "FR4Begin": 22,
      "FR4End": 50
    }
  } ]
} ]

After final library is built, consider running repseqio debug -p my_library.json. This will check the library and print information on the problems it detected in the library.

To simplify further distribution of the library one may want to compile library into a single file, containing all required sequence information, see repseqio compile docs.

Creating library from IMGT-style padded fasta file

(please notice) You can download already converted IMGT library here.

repseqio util contain special action fromPaddedFasta to convert IMGT-style libraries to json format.

Example input file with V genes (say imgt_lib_v.fasta):

>AE000659|TRAV12-3*01|Homo sapiens|F|V-REGION|221187..221463|277 nt|1| | | | |277+45=322| | |
cagaaggaggtggagcaggatcctggaccactcagtgttccagagggagccattgtttct
ctcaactgcacttacagcaacagtgct..................tttcaatacttcatg
tggtacagacagtattccagaaaaggccctgagttgctgatgtacacatactcc......
......agtggtaacaaagaagat...............ggaaggtttacagcacaggtc
gataaatccagcaagtatatctccttgttcatcagagactcacagcccagtgattcagcc
acctacctctgtgcaatgagcg
>M17656|TRAV12-3*02|Homo sapiens|(F)|V-REGION|67..343|277 nt|1| | | | |277+45=322| | |
cagaaggaggtggagcaggatcctggaccactcagtgttccagagggagccattgtttct
ctcaactgcacttacagcaacagtgct..................tttcaatacttcatg
tggtacagacagtattccagaataggccctgagttgctgatgtacacatactcc......
......agtggtaacaaagaagat...............ggaaggtttacagcacaggtc
gataaatccagcaagtatatctccttgttcatcagagactcacagcccagtgattcagcc
acctacctctgtgcaatgagcg

Example input file with J genes (say imgt_lib_j.fasta):

>X02885|TRAJ12*01|Homo sapiens|F|J-REGION|53..112|60 nt|3| | | | |60+0=60| | |
ggatggatagcagctataaattgatcttcgggagtgggaccagactgctggtcaggcctg
>M94081|TRAJ13*01|Homo sapiens|F|J-REGION|71280..71342|63 nt|3| | | | |63+0=63| | |
tgaattctgggggttaccagaaagttacctttggaattggaacaaagctccaagtcatcc
caa
>AC023226|TRAJ13*02|Homo sapiens|F|J-REGION|51292..51354|63 nt|3| | | | |63+0=63| |rev-compl|
tgaattctgggggttaccagaaagttacctttggaactggaacaaagctccaagtcatcc
caa

To use fromPaddedFasta action, you should specify positions of anchor points (see here) in padded file. Here is the most common options for V genes in IMGT:

-PFR1Begin=0 -PCDR1Begin=78 -PFR2Begin=114 -PCDR2Begin=165 -PFR3Begin=195 -PCDR3Begin=309 -PVEnd=-1
and J genes
-PJBegin=0 -PFR4Begin=-31 -LFR4Begin='[WF](G.G)' -PFR4End=-1

Here are example commands for input files provided above:

> repseqio fromPaddedFasta -t 9606 -c TRA -j 3 -n 1 -g V -PFR1Begin=0 -PCDR1Begin=78 -PFR2Begin=114 -PCDR2Begin=165 -PFR3Begin=195 -PCDR3Begin=309 -PVEnd=-1 imgt_lib_v.fasta imgt_lib_v.json.fasta imgt_lib_v.json

> repseqio fromPaddedFasta -t 9606 -c TRA -j 3 -n 1 -g J -PJBegin=0 -PFR4Begin=-31 -LFR4Begin='[WF](G.G)' -PFR4End=-1 imgt_lib_j.fasta imgt_lib_j.json.fasta imgt_lib_j.json
this will create library files imgt_lib_j.json and imgt_lib_v.json, along with un-padded imgt_lib_j.json.fasta and imgt_lib_v.json.fasta that libraries refers to (see section above for more information on json library format).

Using the library

To use your library with MiXCR, just copy json file and all referenced fasta files to ~/.mixcr/libraries folder (example for files form "Creating library from scratch, based on fasta file"):

> mkdir -p ~/.mixcr/libraries
> cp my_library2.json ~/.mixcr/libraries/my_library.json

run mixcr as follows:

> mixcr align --library my_library -s homsap ...

To simplify library distribution, library can be packed into a single file along with all sequence information (notice, this procedure will incorporate only regions of the sequences that are used inside the library, so it will not pack the whole chromosome sequence, but only parts referenced in the library):

> repseqio compile my_library2.json my_library.compiled.json.gz
(repseqio also supports direct reading from gzipped files)

Now just single file must be copied to the library folder

> cp my_library2.json ~/.mixcr/libraries/my_library.compiled.json.gz