reference-finder

$npx mdskill add aipoch/medical-research-skills/reference-finder

Match scientific claims to PubMed papers instantly.

  • Grounds every sentence with top-ranked citation evidence.
  • Connects to the official PubMed E-utilities API exclusively.
  • Ranks matches by keyword overlap, year, and citation count.
  • Delivers structured titles, DOIs, PMIDs, and reasoning per sentence.

SKILL.md

.github/skills/reference-finderView on GitHub ↗
---
name: reference-finder
description: Automatically finds and ranks PubMed references for each sentence in scientific text; use when you need titles, DOIs, and brief recommendation reasons from the PubMed E-utilities API.
license: MIT
author: aipoch
---
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

## When to Use

- You have a scientific paragraph and want suggested PubMed papers for **each sentence**.
- You need **top-ranked references** with **title, DOI, PMID, year**, and a short **why recommended** explanation.
- You are drafting or reviewing a manuscript and want quick **literature grounding** for key claims.
- You want a lightweight reference matcher that uses **only the official PubMed E-utilities API** (no third-party services).
- You need a scriptable tool for batch or CLI workflows to generate candidate citations.

## Key Features

- Sentence-level reference matching for scientific text.
- Returns the **top N (default: 3)** most relevant PubMed records per sentence.
- Outputs structured fields: **title, DOI, PMID, year, recommendation reason**.
- Relevance ranking based on:
  - keyword overlap / match strength,
  - publication year preference,
  - citation-count signal (when available/derivable).
- Safety constraints:
  - Network access restricted to `eutils.ncbi.nlm.nih.gov`.
  - No local filesystem writes except to `outputs/` during execution.
  - Request timeout set to **30 seconds** with clear error messages.
- Supports Python API usage and CLI usage (including interactive mode).

## Dependencies

- Python **3.x** (standard library only; no third-party packages required)

## Example Usage

### Python (direct call)

```python
from reference_finder import find_references

text = "CRISPR-Cas9 gene editing has revolutionized biomedical research."

results = find_references(text)

for ref in results[:3]:
    print(f"- {ref['title']} ({ref['year']})")
    print(f"  DOI: {ref['doi']}")
    print(f"  PMID: {ref['pmid']}")
    print(f"  Reason: {ref['reason']}")
```

### CLI (single input)

```bash
python scripts/find_refs.py "CRISPR-Cas9 gene editing has revolutionized biomedical research."
```

### CLI (interactive mode)

```bash
python scripts/find_refs.py
```

### Example output (JSON)

```json
[
  {
    "pmid": "PMID:",
    "title": "A Programmable Dual-RNA-Guided DNA Endonuclease in Vitro",
    "doi": "10.1126/science.1225829",
    "year": 2012,
    "reason": "Highest keyword match for 'CRISPR-Cas9', foundational paper"
  }
]
```

## Implementation Details

### Data flow

1. **Sentence splitting**: The input text is split into sentences (implementation-defined; typically punctuation-based).
2. **PubMed search (ESearch)**: For each sentence, a query is sent to:
   - `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi`
3. **Record retrieval (EFetch)**: The top candidate PMIDs are fetched via:
   - `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi`
4. **Field extraction**: Title, year, PMID, and DOI (when present) are extracted from the returned metadata.
5. **Ranking and selection**: Candidates are scored and the top **N** are returned with a short recommendation reason.

### Ranking signals

- **Keyword match**: Measures overlap between sentence terms and retrieved record metadata (e.g., title/abstract terms when available).
- **Publication year**: Used as a preference signal (e.g., favoring more recent work unless a classic/foundational match is strong).
- **Citation count**: Incorporated when available/derivable; otherwise treated as missing without failing the run.

### Operational constraints and safety

- **Allowed network host**: `eutils.ncbi.nlm.nih.gov` only.
- **Prohibited**: Any third-party URLs.
- **Filesystem**: Do not write outside `outputs/` during execution.
- **Rate limiting**: Use a reasonable request cadence (e.g., **~0.5s** between requests) to respect API limits.
- **Timeout**: **30 seconds** per request.
- **Error handling**: Return semantic, user-readable error messages for network/API/parse failures.

### Defaults

- **Top references per sentence**: 3
- **Endpoints**:
  - ESearch: `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi`
  - EFetch: `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi`

### Related project files

- Main script: `scripts/find_refs.py`
- Tests: `tests/test_finder.py`
- Evaluation checklist: `references/evaluation-checklist.md`
- PubMed E-utilities documentation: https://www.ncbi.nlm.nih.gov/books/NBK25504/

More from aipoch/medical-research-skills

SkillDescription
3d-molecule-ray-tracerGenerate photorealistic rendering scripts for PyMOL and UCSF ChimeraX.
abstract-summarizerTransform lengthy academic papers into concise, structured 250-word abstracts.
abstract-trimmerPrecision editing tool that reduces abstract word count through intelligent compression techniques, maintaining scientific rigor while meeting strict journal and conference requirements.
academic-abstract-refinerRefines long medical academic texts into SCI-style unstructured Chinese and English abstracts; use when you need to condense drafts/reports/summaries into bilingual abstracts and generate Summary_Report.md.
academic-cv-generatorGenerate structured academic CVs from free-form Chinese/English text and export to Word (.docx). Use this skill when you are asked to organize, generate, or optimize an academic CV (e.g., publications/projects/awards) into a consistent, formatted document with uniform-colored section headers and optional bilingual output.
academic-highlight-generatorGenerates submission-ready Elsevier/SCI Highlights from manuscript text or extracted PDF/DOCX/TXT content. Use when a user needs 3-5 concise, evidence-grounded highlight bullets for a research paper, review, meta-analysis, case report, or bioinformatics manuscript.
academic-norm-reviewDetects content similarity, verifies standardized citations and abbreviations, and flags potential academic integrity risks; use it before submission, during academic writing QA, or for compliance reviews.
academic-poster-generatorComplete workflow for generating academic research posters from PDF literature; use when you need to extract paper content from PDFs and produce a LaTeX-based poster (beamerposter/tikzposter/baposter) with mandatory figure generation and a final rendered HTML deliverable.
acronym-unpackerIntelligent medical abbreviation disambiguation tool that resolves ambiguous acronyms using clinical context, specialty-specific knowledge, and document-level semantic analysis.
active-comparator-single-soc-faers-safety-comparisonGenerates complete FAERS pharmacovigilance study designs for multi-drug or class-level safety comparison inside one predefined SOC or AE family using active comparators, disproportionality analysis, subgroup characterization, and reviewer-facing evidence control.