Introduction to BLAST

The Basic Local Alignment Search Tool (BLAST) a heuristic system, meaning it uses clever shortcuts to speed up the search process. "Local" alignments are performed by BLAST. Functional domains are frequently repeated within the same protein as well as across various proteins from different species, indicating that most proteins are modular in nature. The BLAST algorithm has been fine-tuned to find these domains or shorter sequence similarity stretches. An mRNA can also be associated with a piece of genomic DNA using the local alignment method, which is usually utilized in genome assembly and analysis. Rather than trying to align two sequences over their entire lengths (known as a global alignment), BLAST would identify lesser similarities, especially in terms of domains and motifs.

 

When a query is posted through one of the BLAST Web pages, the sequence is fed to the BLAST server's algorithm, along with any other input data such as the database to be scanned, word size, expect value, and so on. First, BLAST creates a look-up table of all the "words" (short subsequences, which for proteins the default is three letters) and "neighboring words", i.e., similar terms in the query sequence. These "hot spots" are then searched for in the sequence database. When a match is found, it's used to start gapless and gapped extensions of the "word."

 

BLAST does not perform a direct search of GenBank flatfiles or any subset of GenBank flatfiles. Sequences are instead entered into BLAST databases. Each entry is split into two files, one containing only the header information and the other containing the entire entry. The algorithm relies on this information. If BLAST is to be run in "stand-alone" mode, the data file could be made up of local, private data, downloaded NCBI BLAST databases, or a mix of the two.

 

The algorithm organizes the best alignment for each query–sequence pair and writes this information to a SeqAlign data structure after looking up all possible "words" from the query sequence and extending them maximally. The sequence data is not contained in the SeqAlign structure; rather, it refers to the sequences in the BLAST dataset.

 

The BLAST Formatter on the BLAST server can utilize the data in the SeqAlign to collect similar sequences and exhibit them in a variety of ways using the data in the SeqAlign. As a result, once a query is finished, the findings can be reformatted without having to re-run the search. Because of the QBLAST system, this is feasible.

 

BLAST Scores and Statistics

Once BLAST has found a sequence in the database that is similar to the query, it is useful to know whether the alignment is "good" and depicts a possible biological relationship, or whether the similarity detected is due to chance alone. BLAST generates a bit score and an expect value (E-value) for each alignment pair using statistical theory (query to hit).

 

The alignment's bit score indicates how good it is; the higher the score, the better the alignment. This score is computed using a formula that considers the alignment of similar or identical residues as well as any gaps presented to align the sequences. The "substitution matrix," which allocates a score for aligning any probable pair of residues, is an important part of this calculation. Most BLAST programs utilize the BLOSUM62 matrix by default, with the exception of blastn and MegaBLAST (which perform nucleotide–nucleotide comparisons and thus do not use protein-specific matrices). Bit scores are normalized, which means that even if different scoring matrices were used, the bit scores from different alignments can be evaluated.

 

The E-value illustrates the size of the database and the scoring system used, and it indicates the statistical significance of a given pairwise alignment. The greater the impact, the lower the E-value. An E-value of 0.05 indicates that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance. Even if a statistician considers this important, it may not be a biologically meaningful outcome, and an assessment of the alignments is necessary to determine "biological" significance.

 

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.

 

References

  1. Liu S, Fan L, Sun J, et al. Computational resources and tools for antimicrobial peptides. Journal of Peptide Science. 2017, 23(1).
  2. Neumann RS, Kumar S, Haverkamp TH, et al. BLASTGrabber: a bioinformatic tool for visualization, analysis and sequence selection of massive BLAST data. BMC bioinformatics. 2014, 15(1).
  3. Salah A, Li K. PAR‐3D‐BLAST: A parallel tool for searching and aligning protein structures. Concurrency and Computation: Practice and Experience. 2014, 26(10).