bio-local-blast
$
npx mdskill add GPTomics/bioSkills/bio-local-blastExecute unlimited local BLAST searches without NCBI dependencies.
- Enables custom database building and large-scale sequence analysis.
- Depends on NCBI BLAST+ CLI tools and Python subprocess.
- Selects algorithms based on query type and database format.
- Outputs tabular results with identity and e-value metrics.
SKILL.md
.github/skills/bio-local-blastView on GitHub ↗
---
name: bio-local-blast
description: Run local BLAST searches using BLAST+ command-line tools. Use when running fast unlimited searches, building custom databases, performing large-scale analysis, or when NCBI servers are slow or unavailable.
tool_type: cli
primary_tool: BLAST+
---
## Version Compatibility
Reference examples tested with: NCBI BLAST+ 2.15+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Local BLAST
**"Run a BLAST search against my custom database"** → Build a local BLAST database and search it with query sequences, returning tabular results with identity and e-value.
- CLI: `makeblastdb`, `blastn`/`blastp` (NCBI BLAST+)
- Python: `subprocess` wrapper for BLAST+
Run BLAST searches locally using NCBI BLAST+ command-line tools.
## Installation (NCBI BLAST+)
```bash
# macOS
brew install blast
# Ubuntu/Debian
sudo apt install ncbi-blast+
# conda
conda install -c bioconda blast
# Verify installation
blastn -version
```
## BLAST+ Programs
| Command | Query | Database | Description |
|---------|-------|----------|-------------|
| `blastn` | DNA | DNA | Nucleotide-nucleotide |
| `blastp` | Protein | Protein | Protein-protein |
| `blastx` | DNA | Protein | Translated query vs protein |
| `tblastn` | Protein | DNA | Protein vs translated DB |
| `tblastx` | DNA | DNA | Translated vs translated |
| `makeblastdb` | - | - | Create BLAST database |
## Creating BLAST Databases
### makeblastdb - Create Database (NCBI BLAST+)
```bash
# Create nucleotide database
makeblastdb -in sequences.fasta -dbtype nucl -out my_db
# Create protein database
makeblastdb -in proteins.fasta -dbtype prot -out my_proteins
# With title and parse sequence IDs
makeblastdb -in sequences.fasta -dbtype nucl -out my_db \
-title "My Reference Database" -parse_seqids
```
**Key Options:**
| Option | Description | Values |
|--------|-------------|--------|
| `-in` | Input FASTA file | Path |
| `-dbtype` | Database type | `nucl`, `prot` |
| `-out` | Output database name | Path prefix |
| `-title` | Database title | String |
| `-parse_seqids` | Enable ID-based retrieval | Flag |
| `-taxid` | Assign taxonomy ID | Integer |
| `-taxid_map` | Taxonomy ID mapping file | Path |
### Database Files Created
```
my_db.nhr # Header file (nucl) / .phr (prot)
my_db.nin # Index file (nucl) / .pin (prot)
my_db.nsq # Sequence file (nucl) / .psq (prot)
my_db.ndb # Alias file (optional)
my_db.not # ID index (if parse_seqids)
my_db.ntf # Index (if parse_seqids)
my_db.nto # Index (if parse_seqids)
```
## Running BLAST Searches
### Basic Usage (NCBI BLAST+)
```bash
# BLASTN
blastn -query query.fasta -db my_db -out results.txt
# BLASTP
blastp -query proteins.fasta -db my_proteins -out results.txt
# BLASTX (translate query, search protein DB)
blastx -query genes.fasta -db nr -out results.txt
```
### Common Options
| Option | Description | Example |
|--------|-------------|---------|
| `-query` | Query FASTA file | `-query seq.fa` |
| `-db` | Database name | `-db nt` |
| `-out` | Output file | `-out results.txt` |
| `-outfmt` | Output format | `-outfmt 6` |
| `-evalue` | E-value threshold | `-evalue 1e-5` |
| `-num_threads` | CPU threads | `-num_threads 8` |
| `-max_target_seqs` | Max hits | `-max_target_seqs 100` |
| `-max_hsps` | Max HSPs per hit | `-max_hsps 1` |
| `-word_size` | Word size | `-word_size 11` |
| `-dust` | Filter low complexity (nucl) | `-dust yes` |
| `-seg` | Filter low complexity (prot) | `-seg yes` |
### Output Formats (-outfmt)
| Value | Format |
|-------|--------|
| `0` | Pairwise (default) |
| `1` | Query-anchored with identities |
| `5` | BLAST XML |
| `6` | Tabular |
| `7` | Tabular with comments |
| `10` | CSV |
### Tabular Output Fields (-outfmt 6)
Default columns: `qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore`
Custom columns:
```bash
blastn -query query.fa -db my_db -outfmt "6 qseqid sseqid pident length evalue stitle"
```
**Available Fields:**
| Field | Description |
|-------|-------------|
| `qseqid` | Query ID |
| `sseqid` | Subject ID |
| `pident` | Percent identity |
| `length` | Alignment length |
| `mismatch` | Mismatches |
| `gapopen` | Gap openings |
| `qstart` | Query start |
| `qend` | Query end |
| `sstart` | Subject start |
| `send` | Subject end |
| `evalue` | E-value |
| `bitscore` | Bit score |
| `stitle` | Subject title |
| `qcovs` | Query coverage |
| `qcovhsp` | Query coverage per HSP |
## Code Patterns
### Create Database and Search
**Goal:** Build a custom BLAST database from reference sequences and search it with a query.
**Approach:** Index the reference with `makeblastdb`, then run the appropriate BLAST program with tabular output.
**Reference (NCBI BLAST+ 2.15+):**
```bash
#!/bin/bash
makeblastdb -in reference.fasta -dbtype nucl -out ref_db -parse_seqids
blastn -query query.fasta -db ref_db -out results.txt \
-outfmt 6 -evalue 1e-10 -num_threads 4
head results.txt
```
### BLAST with Tabular Output (NCBI BLAST+)
```bash
#!/bin/bash
blastn -query query.fasta -db my_db \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" \
-evalue 1e-5 \
-max_target_seqs 10 \
-num_threads 8 \
-out results.tsv
```
### Filter and Sort Results (NCBI BLAST+)
```bash
# Get hits with >90% identity
awk -F'\t' '$3 >= 90' results.tsv
# Sort by E-value
sort -t$'\t' -k11 -g results.tsv
# Get best hit per query
sort -t$'\t' -k1,1 -k11,11g results.tsv | sort -t$'\t' -k1,1 -u
```
### Batch BLAST Multiple Files (NCBI BLAST+)
```bash
#!/bin/bash
for query_file in queries/*.fasta; do
base=$(basename "$query_file" .fasta)
echo "Processing $base..."
blastn -query "$query_file" -db my_db \
-outfmt 6 -evalue 1e-5 -num_threads 4 \
-out "results/${base}_blast.tsv"
done
```
### Python Wrapper
**Goal:** Run a complete local BLAST workflow (build database, search, parse results) from Python.
**Approach:** Wrap `makeblastdb` and BLAST programs via `subprocess`, then parse the default tabular output into structured dictionaries.
**Reference (NCBI BLAST+ 2.15+):**
```python
import subprocess
import os
def make_blast_db(fasta_file, db_name, db_type='nucl'):
cmd = ['makeblastdb', '-in', fasta_file, '-dbtype', db_type, '-out', db_name, '-parse_seqids']
subprocess.run(cmd, check=True)
def run_blast(query, db, output, program='blastn', evalue=1e-5, threads=4, outfmt=6):
cmd = [program, '-query', query, '-db', db, '-out', output,
'-outfmt', str(outfmt), '-evalue', str(evalue), '-num_threads', str(threads)]
subprocess.run(cmd, check=True)
def parse_blast_tabular(filename):
columns = ['qseqid', 'sseqid', 'pident', 'length', 'mismatch', 'gapopen',
'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']
hits = []
with open(filename) as f:
for line in f:
values = line.strip().split('\t')
hit = dict(zip(columns, values))
hit['pident'] = float(hit['pident'])
hit['evalue'] = float(hit['evalue'])
hit['length'] = int(hit['length'])
hits.append(hit)
return hits
make_blast_db('reference.fasta', 'ref_db')
run_blast('query.fasta', 'ref_db', 'results.tsv')
hits = parse_blast_tabular('results.tsv')
for hit in hits[:5]:
print(f"{hit['qseqid']} -> {hit['sseqid']}: {hit['pident']}% identity, E={hit['evalue']}")
```
### Reciprocal Best BLAST
**Goal:** Identify putative orthologs between two species using reciprocal best BLAST hits.
**Approach:** BLAST species A against B and B against A (keeping best hit each), then find pairs where each is the other's top hit.
**Reference (NCBI BLAST+ 2.15+):**
```bash
#!/bin/bash
blastp -query species_A.fasta -db species_B_db -outfmt 6 -evalue 1e-5 \
-max_target_seqs 1 -out A_vs_B.tsv
blastp -query species_B.fasta -db species_A_db -outfmt 6 -evalue 1e-5 \
-max_target_seqs 1 -out B_vs_A.tsv
awk 'NR==FNR {a[$1]=$2; next} $2 in a && a[$2]==$1' A_vs_B.tsv B_vs_A.tsv
```
### Extract Hit Sequences (NCBI BLAST+)
```bash
# Get subject sequence by ID (requires -parse_seqids)
blastdbcmd -db my_db -entry "sequence_id" -out hit.fasta
# Get multiple sequences
blastdbcmd -db my_db -entry_batch ids.txt -out hits.fasta
# Get all sequences from database
blastdbcmd -db my_db -entry all -out all_seqs.fasta
```
## Prebuilt Databases (NCBI BLAST+)
Download from NCBI:
```bash
# Download and extract (uses update_blastdb.pl)
update_blastdb.pl --decompress nt
# Or download manually from:
# https://ftp.ncbi.nlm.nih.gov/blast/db/
```
Common databases:
- `nt` - All nucleotide sequences
- `nr` - Non-redundant protein
- `refseq_rna` - RefSeq RNA
- `swissprot` - UniProt SwissProt
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `BLAST Database error` | Database not found | Check path, rebuild database |
| `No hits found` | No matches or wrong DB type | Verify database type matches query |
| `Sequence too short` | Query below word size | Lower word_size or use longer query |
| `Out of memory` | Large database | Reduce threads, use -num_threads 1 |
## Local vs Remote BLAST
| Aspect | Local | Remote |
|--------|-------|--------|
| Speed | Fast | Can be slow |
| Databases | Must download/create | All NCBI DBs available |
| Throughput | Unlimited | Rate limited |
| Setup | Requires installation | Just Biopython |
| Updates | Manual | Automatic |
## Decision Tree
```
Running BLAST locally?
├── Have reference sequences?
│ └── makeblastdb to create database
├── Download NCBI database?
│ └── update_blastdb.pl or manual download
├── Need tabular output?
│ └── -outfmt 6 (or 7 with headers)
├── Filter low-complexity?
│ └── -dust yes (nucl) or -seg yes (prot)
├── Multiple queries?
│ └── Put all in one FASTA, use -num_threads
├── Need XML output?
│ └── -outfmt 5
└── Extract hit sequences?
└── blastdbcmd -entry
```
## Related Skills
- blast-searches - Remote BLAST via NCBI (no installation needed)
- sequence-io - Read/write FASTA files for queries
- batch-downloads - Download sequences to build local databases