2015-07-04

NCBI working on SAM output from BLAST+

Recently NCBI BLAST+ 2.2.31 was released, and it contains an undocumented "Easter Egg" - this is still very rough around the edges but they're working on SAM format output!

The command line help in BLAST+ 2.2.31 only describes output formats 0 to 14:

$ ~/Downloads/ncbi-blast-2.2.31+/bin/blastp -help
USAGE
...

DESCRIPTION
   Protein-Protein BLAST 2.2.31+

...

 *** Formatting options
 -outfmt 
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1),
    12 = JSON Seqalign output,
    13 = JSON Blast output,
    14 = XML2 Blast output
   
   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
...

I discovered by accident that 15 offers SAM format output. First, using the sample files from this repository, here's an example using -outfmt 6, the concise tabular output:

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 6
gi|57163783|ref|NP_001009242.1| sp|P08100|OPSD_HUMAN 96.55 348 12 0 1 348 1 348 0.0 701
gi|3024260|sp|P56514.1|OPSD_BUFBU sp|P08100|OPSD_HUMAN 83.33 354 53 2 1 354 1 348 0.0 605
gi|283855846|gb|ADB45242.1| sp|P08100|OPSD_HUMAN 94.82 328 17 0 1 328 11 338 0.0 630
gi|283855823|gb|ADB45229.1| sp|P08100|OPSD_HUMAN 94.82 328 17 0 1 328 11 338 0.0 630
gi|223523|prf||0811197A sp|P08100|OPSD_HUMAN 93.10 348 23 1 1 347 1 348 0.0 651
gi|12583665|dbj|BAB21486.1| sp|P08100|OPSD_HUMAN 81.09 349 65 1 1 349 1 348 0.0 587

Here's what BLAST+ 2.2.31 gives with -outfmt 15 instead:

$ blastp -query rhodopsin_proteins.fasta -db four_human_proteins.fasta -evalue 0.0001 -outfmt 15
@HD VN:1.2 GO:query
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
@SQ SN:gnl|BL_ORD_ID|3 LN:348
lcl|Query_1 0 gnl|BL_ORD_ID|3 1 255 348M * 0 0 * * AS:i:1808 EV:f:NM:i:0 PI:f:96.55 BS:f:701.049
lcl|Query_2 0 gnl|BL_ORD_ID|3 1 255 333M1I8M5I7M * 0 0 * * AS:i:1560 EV:f:0 NM:i:6 PI:f:84.77 BS:f:605.52
lcl|Query_3 0 gnl|BL_ORD_ID|3 11 255 328M * 0 0 * * AS:i:1625 EV:f:NM:i:0 PI:f:94.82 BS:f:630.558
lcl|Query_4 0 gnl|BL_ORD_ID|3 11 255 328M * 0 0 * * AS:i:1625 EV:f:NM:i:0 PI:f:94.82 BS:f:630.558
lcl|Query_5 0 gnl|BL_ORD_ID|3 1 255 190M1D157M * 0 0 * * AS:i:1680 EV:f:0 NM:i:1 PI:f:93.37 BS:f:651.744
lcl|Query_6 0 gnl|BL_ORD_ID|3 1 255 328M1I20M5H * 0 0 * * AS:i:1512 EV:f:0 NM:i:1 PI:f:81.32 BS:f:587.03

Note the use of standard SAM/BAM format tags - looking at the first match:
  • AS:i:1808 - integer alignment score, here 1808
  • NM:i:0 - integer edit distance, here 0
Plus non-standard tags:
  • EV:f:0 - float e-value, here 0
  • PI:f:96.55 - float percent identify, here 96.55
  • BS:f:701.049 - float bit-score, here 701.049

Sadly rather than the real names for the query and matches, BLAST's internal names are being used (e.g. lcl|Query_1 and gnl|BL_ORD_ID|3) rather than gi|57163783|ref|NP_001009242.1| and sp|P08100|OPSD_HUMAN). This reminds me of earlier problems, see my older post "BLAST+ should keep its BL_ORD_ID identifiers to itself".

Also there's an obvious bug in the duplication of the "reference" @SQ lines in the header.

For bonus points they should add a @PG line to the header as well, giving the BLAST+ version etc.

Still, as long as these niggles are sorted out for the first official release to offer SAM output, I think this will be surprisingly useful. For now, there are options like the BLAST XML to SAM tool Pierre Lindenbaum's student Aurélien Guy-Duché wrote (blog post, repository).

Update (6 July 2015)

Last night on Zach Charlop-powers got in touch on Twitter (@zach_cp) to report using -parse_deflines fixes the query names, and if the database was built using makeblastdb -parse_seqids ... then this fixes the match names. He put together an example too.

Unfortunately that doesn't work if you are not using the NCBI specific pipe-character based naming scheme (see also My IDs not good enough for NCBI BLAST+).

P.S. To date SAM/BAM has only been used for nucleotides, but my example above was from a protein BLAST so it was apparently using SAM format working in amino acid residues rather than base pairs.

Update (31 December 2015)

SAM output is now officially documented as a beta feature in BLAST+ 2.3.0 (released 21 December 2015).


Update (2 June 2017)

Note in the official support, SAM output is available via -outfmt 17 (not 15 as in the examples above), and is only in blastn. It is not in blastp etc (which makes sense as SAM is about nucleotide alignments).

7 comments:

  1. Thanks for keeping this updated with the latest news:)!

    Also...

    >P.S. To date SAM/BAM has only been used for nucleotides, but my example above was from a protein BLAST so it was apparently using SAM format working in amino acid residues rather than base pairs.

    I get this error when using it with BLASTP

    "Error BLAST query/options error: SAM format is only applicable to blastn results"

    ReplyDelete
    Replies
    1. Its probably deliberate - SAM output makes sense for blastn (nucleotides versus nucleotides), but is a bit of a stretch when you have protein sequences.

      Delete
  2. Thanks for this very informative post.
    I'm working with blastn and the SAM output, I've noticed that the version in 2.2.31 works perfectly for my needs, while the newer versions are broken!
    Specifically, in the newer versions, in the sam output the query is taken to be the reference and the subject the reads, which makes no sense.

    ReplyDelete
  3. Query as the reference works for my purposes, because I can then view the results in IGV. I can see that both could be useful though. Each should have its own outfmt number, query as reads and query as reference.

    The reference IDs are set to Query_# rather than the query sequence's FASTA ID, which is not not at all helpful!

    ReplyDelete
  4. I just submitted a bug report to NCBI regarding the "Query_#" issue, which still exists...

    ReplyDelete
  5. Thanks Tomer, post what you hear.

    ReplyDelete
  6. The -parse_deflines still works to fix the query_# issue

    ReplyDelete