bio-local-blast

Name: bio-local-blast
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-local-blast

Execute unlimited local BLAST searches without NCBI dependencies.

Enables custom database building and large-scale sequence analysis.
Depends on NCBI BLAST+ CLI tools and Python subprocess.
Selects algorithms based on query type and database format.
Outputs tabular results with identity and e-value metrics.

SKILL.md

.github/skills/bio-local-blastView on GitHub ↗

---
name: bio-local-blast
description: Run local BLAST searches using BLAST+ command-line tools. Use when running fast unlimited searches, building custom databases, performing large-scale analysis, or when NCBI servers are slow or unavailable.
tool_type: cli
primary_tool: BLAST+
---

## Version Compatibility

Reference examples tested with: NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Local BLAST

**"Run a BLAST search against my custom database"** → Build a local BLAST database and search it with query sequences, returning tabular results with identity and e-value.
- CLI: `makeblastdb`, `blastn`/`blastp` (NCBI BLAST+)
- Python: `subprocess` wrapper for BLAST+

Run BLAST searches locally using NCBI BLAST+ command-line tools.

## Installation (NCBI BLAST+)

```bash
# macOS
brew install blast

# Ubuntu/Debian
sudo apt install ncbi-blast+

# conda
conda install -c bioconda blast

# Verify installation
blastn -version
```

## BLAST+ Programs

| Command | Query | Database | Description |
|---------|-------|----------|-------------|
| `blastn` | DNA | DNA | Nucleotide-nucleotide |
| `blastp` | Protein | Protein | Protein-protein |
| `blastx` | DNA | Protein | Translated query vs protein |
| `tblastn` | Protein | DNA | Protein vs translated DB |
| `tblastx` | DNA | DNA | Translated vs translated |
| `makeblastdb` | - | - | Create BLAST database |

## Creating BLAST Databases

### makeblastdb - Create Database (NCBI BLAST+)

```bash
# Create nucleotide database
makeblastdb -in sequences.fasta -dbtype nucl -out my_db

# Create protein database
makeblastdb -in proteins.fasta -dbtype prot -out my_proteins

# With title and parse sequence IDs
makeblastdb -in sequences.fasta -dbtype nucl -out my_db \
    -title "My Reference Database" -parse_seqids
```

**Key Options:**
| Option | Description | Values |
|--------|-------------|--------|
| `-in` | Input FASTA file | Path |
| `-dbtype` | Database type | `nucl`, `prot` |
| `-out` | Output database name | Path prefix |
| `-title` | Database title | String |
| `-parse_seqids` | Enable ID-based retrieval | Flag |
| `-taxid` | Assign taxonomy ID | Integer |
| `-taxid_map` | Taxonomy ID mapping file | Path |

### Database Files Created

```
my_db.nhr  # Header file (nucl) / .phr (prot)
my_db.nin  # Index file (nucl) / .pin (prot)
my_db.nsq  # Sequence file (nucl) / .psq (prot)
my_db.ndb  # Alias file (optional)
my_db.not  # ID index (if parse_seqids)
my_db.ntf  # Index (if parse_seqids)
my_db.nto  # Index (if parse_seqids)
```

## Running BLAST Searches

### Basic Usage (NCBI BLAST+)

```bash
# BLASTN
blastn -query query.fasta -db my_db -out results.txt

# BLASTP
blastp -query proteins.fasta -db my_proteins -out results.txt

# BLASTX (translate query, search protein DB)
blastx -query genes.fasta -db nr -out results.txt
```

### Common Options

| Option | Description | Example |
|--------|-------------|---------|
| `-query` | Query FASTA file | `-query seq.fa` |
| `-db` | Database name | `-db nt` |
| `-out` | Output file | `-out results.txt` |
| `-outfmt` | Output format | `-outfmt 6` |
| `-evalue` | E-value threshold | `-evalue 1e-5` |
| `-num_threads` | CPU threads | `-num_threads 8` |
| `-max_target_seqs` | Max hits | `-max_target_seqs 100` |
| `-max_hsps` | Max HSPs per hit | `-max_hsps 1` |
| `-word_size` | Word size | `-word_size 11` |
| `-dust` | Filter low complexity (nucl) | `-dust yes` |
| `-seg` | Filter low complexity (prot) | `-seg yes` |

### Output Formats (-outfmt)

| Value | Format |
|-------|--------|
| `0` | Pairwise (default) |
| `1` | Query-anchored with identities |
| `5` | BLAST XML |
| `6` | Tabular |
| `7` | Tabular with comments |
| `10` | CSV |

### Tabular Output Fields (-outfmt 6)

Default columns: `qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore`

Custom columns:
```bash
blastn -query query.fa -db my_db -outfmt "6 qseqid sseqid pident length evalue stitle"
```

**Available Fields:**
| Field | Description |
|-------|-------------|
| `qseqid` | Query ID |
| `sseqid` | Subject ID |
| `pident` | Percent identity |
| `length` | Alignment length |
| `mismatch` | Mismatches |
| `gapopen` | Gap openings |
| `qstart` | Query start |
| `qend` | Query end |
| `sstart` | Subject start |
| `send` | Subject end |
| `evalue` | E-value |
| `bitscore` | Bit score |
| `stitle` | Subject title |
| `qcovs` | Query coverage |
| `qcovhsp` | Query coverage per HSP |

## Code Patterns

### Create Database and Search

**Goal:** Build a custom BLAST database from reference sequences and search it with a query.

**Approach:** Index the reference with `makeblastdb`, then run the appropriate BLAST program with tabular output.

**Reference (NCBI BLAST+ 2.15+):**
```bash
#!/bin/bash
makeblastdb -in reference.fasta -dbtype nucl -out ref_db -parse_seqids

blastn -query query.fasta -db ref_db -out results.txt \
    -outfmt 6 -evalue 1e-10 -num_threads 4

head results.txt
```

### BLAST with Tabular Output (NCBI BLAST+)

```bash
#!/bin/bash
blastn -query query.fasta -db my_db \
    -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" \
    -evalue 1e-5 \
    -max_target_seqs 10 \
    -num_threads 8 \
    -out results.tsv
```

### Filter and Sort Results (NCBI BLAST+)

```bash
# Get hits with >90% identity
awk -F'\t' '$3 >= 90' results.tsv

# Sort by E-value
sort -t$'\t' -k11 -g results.tsv

# Get best hit per query
sort -t$'\t' -k1,1 -k11,11g results.tsv | sort -t$'\t' -k1,1 -u
```

### Batch BLAST Multiple Files (NCBI BLAST+)

```bash
#!/bin/bash
for query_file in queries/*.fasta; do
    base=$(basename "$query_file" .fasta)
    echo "Processing $base..."

    blastn -query "$query_file" -db my_db \
        -outfmt 6 -evalue 1e-5 -num_threads 4 \
        -out "results/${base}_blast.tsv"
done
```

### Python Wrapper

**Goal:** Run a complete local BLAST workflow (build database, search, parse results) from Python.

**Approach:** Wrap `makeblastdb` and BLAST programs via `subprocess`, then parse the default tabular output into structured dictionaries.

**Reference (NCBI BLAST+ 2.15+):**
```python
import subprocess
import os

def make_blast_db(fasta_file, db_name, db_type='nucl'):
    cmd = ['makeblastdb', '-in', fasta_file, '-dbtype', db_type, '-out', db_name, '-parse_seqids']
    subprocess.run(cmd, check=True)

def run_blast(query, db, output, program='blastn', evalue=1e-5, threads=4, outfmt=6):
    cmd = [program, '-query', query, '-db', db, '-out', output,
           '-outfmt', str(outfmt), '-evalue', str(evalue), '-num_threads', str(threads)]
    subprocess.run(cmd, check=True)

def parse_blast_tabular(filename):
    columns = ['qseqid', 'sseqid', 'pident', 'length', 'mismatch', 'gapopen',
               'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']
    hits = []
    with open(filename) as f:
        for line in f:
            values = line.strip().split('\t')
            hit = dict(zip(columns, values))
            hit['pident'] = float(hit['pident'])
            hit['evalue'] = float(hit['evalue'])
            hit['length'] = int(hit['length'])
            hits.append(hit)
    return hits

make_blast_db('reference.fasta', 'ref_db')
run_blast('query.fasta', 'ref_db', 'results.tsv')
hits = parse_blast_tabular('results.tsv')
for hit in hits[:5]:
    print(f"{hit['qseqid']} -> {hit['sseqid']}: {hit['pident']}% identity, E={hit['evalue']}")
```

### Reciprocal Best BLAST

**Goal:** Identify putative orthologs between two species using reciprocal best BLAST hits.

**Approach:** BLAST species A against B and B against A (keeping best hit each), then find pairs where each is the other's top hit.

**Reference (NCBI BLAST+ 2.15+):**
```bash
#!/bin/bash
blastp -query species_A.fasta -db species_B_db -outfmt 6 -evalue 1e-5 \
    -max_target_seqs 1 -out A_vs_B.tsv

blastp -query species_B.fasta -db species_A_db -outfmt 6 -evalue 1e-5 \
    -max_target_seqs 1 -out B_vs_A.tsv

awk 'NR==FNR {a[$1]=$2; next} $2 in a && a[$2]==$1' A_vs_B.tsv B_vs_A.tsv
```

### Extract Hit Sequences (NCBI BLAST+)

```bash
# Get subject sequence by ID (requires -parse_seqids)
blastdbcmd -db my_db -entry "sequence_id" -out hit.fasta

# Get multiple sequences
blastdbcmd -db my_db -entry_batch ids.txt -out hits.fasta

# Get all sequences from database
blastdbcmd -db my_db -entry all -out all_seqs.fasta
```

## Prebuilt Databases (NCBI BLAST+)

Download from NCBI:
```bash
# Download and extract (uses update_blastdb.pl)
update_blastdb.pl --decompress nt

# Or download manually from:
# https://ftp.ncbi.nlm.nih.gov/blast/db/
```

Common databases:
- `nt` - All nucleotide sequences
- `nr` - Non-redundant protein
- `refseq_rna` - RefSeq RNA
- `swissprot` - UniProt SwissProt

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `BLAST Database error` | Database not found | Check path, rebuild database |
| `No hits found` | No matches or wrong DB type | Verify database type matches query |
| `Sequence too short` | Query below word size | Lower word_size or use longer query |
| `Out of memory` | Large database | Reduce threads, use -num_threads 1 |

## Local vs Remote BLAST

| Aspect | Local | Remote |
|--------|-------|--------|
| Speed | Fast | Can be slow |
| Databases | Must download/create | All NCBI DBs available |
| Throughput | Unlimited | Rate limited |
| Setup | Requires installation | Just Biopython |
| Updates | Manual | Automatic |

## Decision Tree

```
Running BLAST locally?
├── Have reference sequences?
│   └── makeblastdb to create database
├── Download NCBI database?
│   └── update_blastdb.pl or manual download
├── Need tabular output?
│   └── -outfmt 6 (or 7 with headers)
├── Filter low-complexity?
│   └── -dust yes (nucl) or -seg yes (prot)
├── Multiple queries?
│   └── Put all in one FASTA, use -num_threads
├── Need XML output?
│   └── -outfmt 5
└── Extract hit sequences?
    └── blastdbcmd -entry
```

## Related Skills

- blast-searches - Remote BLAST via NCBI (no installation needed)
- sequence-io - Read/write FASTA files for queries
- batch-downloads - Download sequences to build local databases

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.