bio-expression-matrix-metadata-joins

Name: bio-expression-matrix-metadata-joins
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-expression-matrix-metadata-joins

Merge sample metadata into count matrices for analysis

Aligns gene counts and attributes to enable differential expression studies.
Depends on pandas library with version compatibility checks required.
Intersects sample identifiers between data sources before merging tables.
Returns unified DataFrame ready for downstream statistical computations.

SKILL.md

.github/skills/bio-expression-matrix-metadata-joinsView on GitHub ↗

---
name: bio-expression-matrix-metadata-joins
description: Merge sample metadata with count matrices and add gene annotations. Use when preparing data for differential expression analysis or visualization.
tool_type: mixed
primary_tool: pandas
---

## Version Compatibility

Reference examples tested with: pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Metadata Joins

## Load Sample Metadata

**Goal:** Read sample metadata into a DataFrame aligned with the count matrix columns.

**Approach:** Load metadata CSV with sample IDs as the index, matching count matrix column names.

```python
import pandas as pd

# Load metadata
metadata = pd.read_csv('sample_info.csv', index_col=0)

# Metadata should have samples as rows, attributes as columns
# Index should match count matrix column names
```

## Basic Join

**Goal:** Align count matrix columns with metadata rows so samples are in matching order.

**Approach:** Find common samples between both data sources, subset and reorder to ensure alignment.

**"Match my sample metadata to my count matrix"** → Intersect sample identifiers between the count matrix and metadata, then reorder both to match.

```python
import pandas as pd

# Count matrix: genes x samples
counts = pd.read_csv('counts.tsv', sep='\t', index_col=0)

# Metadata: samples x attributes
metadata = pd.read_csv('metadata.csv', index_col=0)

# Ensure sample order matches
common_samples = counts.columns.intersection(metadata.index)
counts = counts[common_samples]
metadata = metadata.loc[common_samples]

# Verify alignment
assert all(counts.columns == metadata.index)
```

## Handle Sample Name Mismatches

**Goal:** Identify and resolve discrepancies between count matrix column names and metadata row names.

**Approach:** Report samples present in only one data source and subset to the intersection.

```python
def harmonize_sample_names(counts, metadata):
    '''Match sample names between counts and metadata.'''
    count_samples = set(counts.columns)
    meta_samples = set(metadata.index)

    common = count_samples & meta_samples
    only_counts = count_samples - meta_samples
    only_meta = meta_samples - count_samples

    if only_counts:
        print(f'Samples in counts but not metadata: {only_counts}')
    if only_meta:
        print(f'Samples in metadata but not counts: {only_meta}')

    counts = counts[sorted(common)]
    metadata = metadata.loc[sorted(common)]
    return counts, metadata

counts, metadata = harmonize_sample_names(counts, metadata)
```

## Flexible Sample Name Matching

**Goal:** Automatically reconcile sample names that differ by common formatting variations.

**Approach:** Try a series of string transformations (underscore/dash swap, suffix removal, case change) until names match.

```python
def fuzzy_match_samples(counts, metadata):
    '''Try to match sample names with common transformations.'''
    count_cols = counts.columns.tolist()
    meta_idx = metadata.index.tolist()

    # Try exact match first
    if set(count_cols) == set(meta_idx):
        return counts, metadata

    # Common transformations
    transformations = [
        lambda x: x.replace('_', '-'),
        lambda x: x.replace('-', '_'),
        lambda x: x.split('_')[0],
        lambda x: x.replace('.bam', ''),
        lambda x: x.upper(),
        lambda x: x.lower(),
    ]

    for transform in transformations:
        transformed = {transform(c): c for c in count_cols}
        matches = {m: transformed[transform(m)] for m in meta_idx if transform(m) in transformed}
        if len(matches) == len(meta_idx):
            print(f'Matched using transformation')
            counts = counts[[matches[m] for m in meta_idx]]
            return counts, metadata

    raise ValueError('Could not match sample names')
```

## Add Gene Annotations

**Goal:** Enrich the count matrix with gene-level annotations (symbol, name, biotype).

**Approach:** Query mygene for annotation fields and merge with the count matrix on gene IDs.

```python
import mygene

def add_gene_annotations(counts, fields=['symbol', 'name', 'type_of_gene']):
    '''Add gene annotation columns to count matrix.'''
    mg = mygene.MyGeneInfo()

    clean_ids = [g.split('.')[0] for g in counts.index]
    results = mg.querymany(clean_ids, scopes='ensembl.gene',
        fields=fields, species='human', as_dataframe=True)

    # Merge annotations
    results = results.reset_index().rename(columns={'query': 'gene_id'})
    counts_reset = counts.reset_index().rename(columns={counts.index.name: 'gene_id'})
    counts_reset['clean_id'] = counts_reset['gene_id'].str.split('.').str[0]

    annotated = counts_reset.merge(
        results[['gene_id'] + fields].drop_duplicates(),
        left_on='clean_id', right_on='gene_id', how='left', suffixes=('', '_anno'))

    annotated = annotated.drop(['clean_id', 'gene_id_anno'], axis=1, errors='ignore')
    annotated = annotated.set_index('gene_id')

    return annotated
```

## R: Create DESeq2 Data

```r
library(DESeq2)

# Load data
counts <- read.delim('counts.tsv', row.names=1)
metadata <- read.csv('metadata.csv', row.names=1)

# Ensure matching samples
common <- intersect(colnames(counts), rownames(metadata))
counts <- counts[, common]
metadata <- metadata[common, , drop=FALSE]

# Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(
    countData=as.matrix(counts),
    colData=metadata,
    design=~condition  # Adjust to your design
)
```

## R: Create edgeR DGEList

```r
library(edgeR)

# Load data
counts <- read.delim('counts.tsv', row.names=1)
metadata <- read.csv('metadata.csv', row.names=1)

# Match samples
common <- intersect(colnames(counts), rownames(metadata))
counts <- counts[, common]
metadata <- metadata[common, , drop=FALSE]

# Create DGEList
y <- DGEList(counts=as.matrix(counts), group=metadata$condition)
y$samples <- cbind(y$samples, metadata)
```

## Create AnnData with Metadata

```python
import anndata as ad
import pandas as pd

def create_annotated_anndata(counts, sample_metadata, gene_metadata=None):
    '''Create AnnData object with full metadata.'''
    # AnnData expects samples as rows
    adata = ad.AnnData(X=counts.T)

    # Add sample metadata (obs)
    adata.obs = sample_metadata.loc[counts.columns].copy()

    # Add gene metadata (var)
    if gene_metadata is not None:
        adata.var = gene_metadata.loc[counts.index].copy()
    else:
        adata.var_names = counts.index

    return adata

# Usage
adata = create_annotated_anndata(counts, metadata)
adata.write_h5ad('annotated_counts.h5ad')
```

## Validate Metadata

```python
def validate_metadata(counts, metadata, required_columns=['condition']):
    '''Check metadata validity.'''
    issues = []

    count_samples = set(counts.columns)
    meta_samples = set(metadata.index)

    if count_samples != meta_samples:
        missing = count_samples - meta_samples
        extra = meta_samples - count_samples
        if missing:
            issues.append(f'Samples missing metadata: {missing}')
        if extra:
            issues.append(f'Extra metadata samples: {extra}')

    for col in required_columns:
        if col not in metadata.columns:
            issues.append(f'Missing required column: {col}')
        elif metadata[col].isna().any():
            n_na = metadata[col].isna().sum()
            issues.append(f'Column {col} has {n_na} missing values')

    if issues:
        for issue in issues:
            print(f'WARNING: {issue}')
        return False

    print('Metadata validation passed')
    return True
```

## Sample Swap Detection

**Goal:** Identify mislabeled samples before they contaminate downstream analysis.

**Approach:** Use sex-linked gene expression (XIST for females, Y-chromosome genes for males) and within-pair clustering to detect swaps.

### Sex Check

```python
import pandas as pd
import numpy as np

def sex_check(counts, metadata, sex_column='sex'):
    '''Verify reported sex matches gene expression.

    Males: high DDX3Y/RPS4Y1/UTY, low XIST
    Females: high XIST, low/zero Y-linked genes
    '''
    female_markers = ['XIST']
    male_markers = ['DDX3Y', 'RPS4Y1', 'UTY', 'KDM5D', 'EIF1AY']

    available_female = [g for g in female_markers if g in counts.index]
    available_male = [g for g in male_markers if g in counts.index]

    if not available_female or not available_male:
        print('Sex marker genes not found -- are gene symbols in the index?')
        return None

    female_expr = counts.loc[available_female].sum()
    male_expr = counts.loc[available_male].sum()

    predicted_sex = pd.Series('unknown', index=counts.columns)
    predicted_sex[male_expr > female_expr] = 'M'
    predicted_sex[female_expr > male_expr] = 'F'

    if sex_column in metadata.columns:
        mismatches = predicted_sex != metadata[sex_column]
        if mismatches.any():
            print(f'SEX MISMATCHES DETECTED ({mismatches.sum()} samples):')
            for sample in metadata.index[mismatches]:
                print(f'  {sample}: reported={metadata.loc[sample, sex_column]}, predicted={predicted_sex[sample]}')
        else:
            print('Sex check passed -- all samples match')
    return predicted_sex
```

```r
# R sex check
sex_check <- function(counts, coldata, sex_col='sex') {
    xist <- counts['XIST', ]
    y_genes <- c('DDX3Y', 'RPS4Y1', 'UTY', 'KDM5D', 'EIF1AY')
    y_expr <- colSums(counts[intersect(y_genes, rownames(counts)), , drop=FALSE])
    predicted <- ifelse(y_expr > xist, 'M', 'F')
    mismatches <- predicted != coldata[[sex_col]]
    if (any(mismatches)) cat('Sex mismatches:', colnames(counts)[mismatches], '\n')
    return(predicted)
}
```

## Experimental Design Guidance

### Reference Level Selection

The reference level determines the direction of fold changes and the baseline for model coefficients. Set it explicitly **before** creating the DE dataset.

```r
# Always set reference explicitly -- do not rely on alphabetical order
metadata$condition <- factor(metadata$condition, levels=c('control', 'treated'))
# or
metadata$condition <- relevel(factor(metadata$condition), ref='control')
```

```python
metadata['condition'] = pd.Categorical(metadata['condition'], categories=['control', 'treated'], ordered=True)
```

Common mistake: forgetting to relevel, resulting in fold changes in the wrong direction (e.g., control vs treated instead of treated vs control).

### Paired Designs

For paired samples (e.g., tumor vs normal from the same patient), include the pairing variable in the model **before** the condition of interest. This absorbs inter-subject variability and dramatically increases statistical power.

```r
# Pairing variable MUST come before condition
design = ~ patient + condition

# edgeR alternative: blocking factor in design matrix
design_matrix <- model.matrix(~ patient + condition, data=coldata)
```

### Interaction Terms

Interaction models test whether the effect of one factor differs across levels of another (e.g., does drug response differ between genotypes?).

```r
# Full interaction model
design = ~ genotype + treatment + genotype:treatment

# The interaction coefficient represents the DIFFERENCE in treatment effect
# between genotypes -- NOT the overall treatment effect
# Main effect 'treatment' = treatment effect at REFERENCE level of genotype only
```

When interactions are present, the main effect coefficient applies only at the reference level of the other factor. This is a common misinterpretation.

### Confounding vs Batch Effects

If all treated samples were processed in batch 1 and all controls in batch 2, the batch effect is **perfectly confounded** with treatment and cannot be corrected by any statistical method. Check for confounding early:

```python
# Check for confounding between condition and batch
ct = pd.crosstab(metadata['condition'], metadata['batch'])
print(ct)
# Rows/columns with all zeros indicate complete confounding
```

## Merge Multiple Metadata Files

```python
def merge_metadata_files(files, on='sample_id'):
    '''Merge multiple metadata files.'''
    dfs = [pd.read_csv(f) for f in files]
    merged = dfs[0]
    for df in dfs[1:]:
        merged = merged.merge(df, on=on, how='outer')
    return merged.set_index(on)

# Usage
metadata = merge_metadata_files(['clinical.csv', 'sequencing.csv', 'qc.csv'])
```

## Related Skills

- expression-matrix/counts-ingest - Load count data
- expression-matrix/gene-id-mapping - Convert gene IDs
- expression-matrix/normalization - Normalize before visualization
- differential-expression/deseq2-basics - Downstream DE analysis
- differential-expression/batch-correction - Batch effect correction
- single-cell/preprocessing - Single-cell metadata handling

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.