bio-sequence-slicing
$
npx mdskill add GPTomics/bioSkills/bio-sequence-slicingSlice, extract, and concatenate biological sequences using Biopython.
- Handles subsequence extraction, exon joining, and region manipulation.
- Depends on Bio.Seq objects and Python slicing syntax.
- Adapts code patterns to match installed Biopython and samtools versions.
- Returns sliced or joined sequence strings via Python objects.
SKILL.md
.github/skills/bio-sequence-slicingView on GitHub ↗
---
name: bio-sequence-slicing
description: Slice, extract, and concatenate biological sequences using Biopython. Use when extracting subsequences, joining sequences, or manipulating sequence regions by position.
tool_type: python
primary_tool: Bio.Seq
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Sequence Slicing
Extract, slice, and concatenate sequences using Biopython's Seq objects.
**"Extract a subsequence"** → Use Python slicing on Seq objects with 0-based half-open coordinates.
- Python: `seq[start:end]` (BioPython Seq)
**"Join exons into mRNA"** → Extract multiple regions and concatenate them.
- Python: `sum((seq[s:e] for s, e in coords), Seq(''))` or `+` operator
## Required Import
```python
from Bio.Seq import Seq
```
## Core Operations
### Indexing (Single Position)
```python
seq = Seq('ATGCGATCG')
seq[0] # 'A' - first base (0-indexed)
seq[-1] # 'G' - last base
seq[3] # 'C' - fourth base
```
### Slicing (Extract Region)
```python
seq = Seq('ATGCGATCGATCG')
seq[0:3] # Seq('ATG') - first 3 bases
seq[3:6] # Seq('CGA') - positions 3-5
seq[:5] # Seq('ATGCG') - first 5
seq[-5:] # Seq('GATCG') - last 5
seq[::2] # Seq('AGGTGTG') - every 2nd base
seq[::-1] # Seq('GCTAGCTAGCGTA') - reversed
```
**Note:** Slicing returns a Seq object, not a string.
### Concatenation
```python
seq1 = Seq('ATGC')
seq2 = Seq('GGGG')
combined = seq1 + seq2 # Seq('ATGCGGGG')
```
Can also concatenate with strings:
```python
seq = Seq('ATGC')
extended = seq + 'NNNN' # Seq('ATGCNNNN')
```
## Code Patterns
### Extract CDS by Coordinates
```python
genome = Seq('NNNNATGCGATCGATCGTAANNN')
cds_start, cds_end = 4, 21
cds = genome[cds_start:cds_end]
```
### Extract with 1-Based Coordinates
Biology often uses 1-based coordinates. Convert to 0-based:
```python
def extract_1based(seq, start, end):
'''Extract using 1-based inclusive coordinates'''
return seq[start - 1:end]
genome = Seq('ATGCGATCGATCG')
region = extract_1based(genome, 1, 3) # Seq('ATG')
```
### Split Sequence into Codons
```python
def split_codons(seq):
return [seq[i:i+3] for i in range(0, len(seq) - len(seq) % 3, 3)]
seq = Seq('ATGCGATCGATCG')
codons = split_codons(seq) # [Seq('ATG'), Seq('CGA'), ...]
```
### Split into Fixed-Length Chunks
```python
def chunk_sequence(seq, size):
return [seq[i:i+size] for i in range(0, len(seq), size)]
seq = Seq('ATGCGATCGATCGATCGATCG')
chunks = chunk_sequence(seq, 10)
```
### Join Sequences with Linker
```python
seqs = [Seq('ATGC'), Seq('GGGG'), Seq('TTTT')]
linker = Seq('NNN')
joined = linker.join(seqs) # Seq('ATGCNNNGGGGNNTTTT')
```
Or manually:
```python
linker = 'NNN'
joined = Seq(linker.join(str(s) for s in seqs))
```
### Extract Multiple Regions
**Goal:** Splice non-contiguous regions (e.g., exons) into a single continuous sequence.
**Approach:** Extract each region by coordinates and concatenate with `+` or `sum()`.
```python
def extract_regions(seq, regions):
'''Extract and concatenate multiple regions'''
return sum((seq[start:end] for start, end in regions), Seq(''))
exon_coords = [(0, 50), (100, 150), (200, 250)]
mrna = extract_regions(genomic_seq, exon_coords)
```
### Extract Flanking Regions
```python
def get_flanking(seq, position, flank_size):
'''Get sequence around a position'''
start = max(0, position - flank_size)
end = min(len(seq), position + flank_size + 1)
return seq[start:end]
seq = Seq('ATGCGATCGATCGATCGATCG')
flanking = get_flanking(seq, 10, 5) # 5 bp on each side of position 10
```
### Tile Sequence into Overlapping Windows
```python
def sliding_windows(seq, window_size, step=1):
for i in range(0, len(seq) - window_size + 1, step):
yield seq[i:i + window_size]
seq = Seq('ATGCGATCGATCG')
for window in sliding_windows(seq, 5, 2):
print(window)
```
### Extract Feature from SeqRecord
```python
from Bio import SeqIO
for record in SeqIO.parse('sequence.gb', 'genbank'):
for feature in record.features:
if feature.type == 'CDS':
cds_seq = feature.extract(record.seq)
print(f'{feature.qualifiers.get("gene", ["?"])[0]}: {cds_seq[:30]}...')
```
### Create New SeqRecord from Slice
```python
from Bio.SeqRecord import SeqRecord
original = SeqRecord(Seq('ATGCGATCGATCGATCG'), id='full', description='Full sequence')
subset = SeqRecord(original.seq[5:15], id='subset', description=f'Positions 5-15 of {original.id}')
```
## Coordinate Systems
| System | Position 1 | Example |
|--------|------------|---------|
| 0-based (Python) | Index 0 | `seq[0:3]` gets positions 0, 1, 2 |
| 1-based (Biology) | Index 1 | Position 1-3 = `seq[0:3]` |
| 0-based half-open | Start inclusive, end exclusive | Standard Python slicing |
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `IndexError` | Index out of range | Check sequence length first |
| Unexpected length | Off-by-one error | Remember end index is exclusive |
| Empty result | Start >= end | Check coordinate order |
| Wrong positions | 1-based vs 0-based confusion | Convert coordinates explicitly |
## Decision Tree
```
Need to extract or combine sequences?
├── Single position?
│ └── Use indexing: seq[i]
├── Contiguous region?
│ └── Use slicing: seq[start:end]
├── Multiple non-contiguous regions?
│ └── Extract each, concatenate with +
├── Join sequences?
│ ├── No linker: seq1 + seq2
│ └── With linker: linker.join(seqs)
├── Split into parts?
│ └── List comprehension with slicing
└── From GenBank features?
└── Use feature.extract(record.seq)
```
## Related Skills
- seq-objects - Create Seq and SeqRecord objects
- sequence-io/read-sequences - Parse GenBank files with features to extract
- transcription-translation - Translate extracted CDS regions
- alignment-files - Extract sequences from BAM using samtools fasta/fastq