baseline-extraction-for-clinical-trials

Name: baseline-extraction-for-clinical-trials
Author: aipoch/medical-research-skills

$npx mdskill add aipoch/medical-research-skills/baseline-extraction-for-clinical-trials

Extracts clinical trial baseline data from text or PMID.

Delivers structured study details for analytics workflows.
Integrates PubMed API with large language models.
Validates existence before generating detailed outputs.
Returns structured data ready for downstream processing.

SKILL.md

.github/skills/baseline-extraction-for-clinical-trialsView on GitHub ↗

---
name: baseline-extraction-for-clinical-trials
description: Extracts clinical trial baseline data (study, region, participants, etc.) from article text or PMID. Checks PubMed for metadata; always falls back to LLM extraction for full details.
license: MIT
author: aipoch
---
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)
# Baseline Extraction (RCT)

This skill extracts 10 key baseline characteristics from clinical trial articles. It implements a hybrid workflow:
1. **PubMed Lookup**: Checks PubMed API using the PMID to verify existence and get basic metadata.
2. **LLM Extraction**: Analyzes the article text to extract detailed baseline data (since PubMed metadata is limited).

## When to Use

- Use this skill when you need extracts clinical trial baseline data (study, region, participants, etc.) from article text or pmid. checks pubmed for metadata; always falls back to llm extraction for full details in a reproducible workflow.
- Use this skill when a data analytics task needs a packaged method instead of ad-hoc freeform output.
- Use this skill when the user expects a concrete deliverable, validation step, or file-based result.
- Use this skill when `scripts/extract_pdf.py` is the most direct path to complete the request.
- Use this skill when you need the `baseline-extraction for clinical trials` package behavior rather than a generic answer.

## Key Features

- Scope-focused workflow aligned to: Extracts clinical trial baseline data (study, region, participants, etc.) from article text or PMID. Checks PubMed for metadata; always falls back to LLM extraction for full details.
- Packaged executable path(s): `scripts/extract_pdf.py`.
- Reference material available in `references/` for task-specific guidance.
- Structured execution path designed to keep outputs consistent and reviewable.

## Dependencies

- `Python`: `3.10+`. Repository baseline for current packaged skills.
- `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control.

## Example Usage

```bash
cd "20260316/scientific-skills/Data Analytics/baseline-extraction-for-clinical-trials"
python -m py_compile scripts/extract_pdf.py
python scripts/extract_pdf.py --help
```

Example run plan:
1. Confirm the user input, output path, and any required config values.
2. Edit the in-file `CONFIG` block or documented parameters if the script uses fixed settings.
3. Run `python scripts/extract_pdf.py` with the validated inputs.
4. Review the generated output and return the final artifact with any assumptions called out.

## Implementation Details

See `## Workflow` above for related details.

- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface: `scripts/extract_pdf.py`.
- Reference guidance: `references/` contains supporting rules, prompts, or checklists.
- Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

## Workflow

### Step 1: Check PubMed (Deterministic)

If the user provides a PMID, use the `baseline_extractor.py` script to check PubMed.

```python
import subprocess
import json

# Replace <PMID> with actual PMID
result = subprocess.run(["python", "scripts/baseline_extractor.py", "<PMID>"], capture_output=True, text=True)
print(result.stdout)
```

**Analyze the Script Output:**
* If `status` is `"success"`: **Stop here.** Return the `data` JSON to the user.
* If `status` is `"not_found"`, `"incomplete"`, or `"error"` (or if no PMID was provided): Proceed to Step 2.

### Step 2: LLM Extraction (Fallback)

If Step 1 did not yield a complete result, use the LLM to extract the information from the **full article text**.

**Input:**
* Full article text provided by the user.

**Instructions:**
1. Read the [Extraction Schema](references/extraction_schema.md) carefully.
2. Analyze the text to identify all 10 required fields.
3. Ensure the output is strictly in the JSON format defined in the schema.
4. **Constraint**: Do not hallucinate. If a field is not mentioned in the text, set it to `null` or an empty string.

## Output

Return the final result as a Markdown code block containing the JSON object.

```json
{
"study": "...",
"region": "...",
...
}
```

## Helper Scripts

### PDF Text Extraction

When the user provides a PDF file path, use `extract_pdf.py` to extract the text content before assessment:

More from aipoch/medical-research-skills

Skill	Description
3d-molecule-ray-tracer	Generate photorealistic rendering scripts for PyMOL and UCSF ChimeraX.
abstract-summarizer	Transform lengthy academic papers into concise, structured 250-word abstracts.
abstract-trimmer	Precision editing tool that reduces abstract word count through intelligent compression techniques, maintaining scientific rigor while meeting strict journal and conference requirements.
academic-abstract-refiner	Refines long medical academic texts into SCI-style unstructured Chinese and English abstracts; use when you need to condense drafts/reports/summaries into bilingual abstracts and generate Summary_Report.md.
academic-cv-generator	Generate structured academic CVs from free-form Chinese/English text and export to Word (.docx). Use this skill when you are asked to organize, generate, or optimize an academic CV (e.g., publications/projects/awards) into a consistent, formatted document with uniform-colored section headers and optional bilingual output.
academic-highlight-generator	Generates submission-ready Elsevier/SCI Highlights from manuscript text or extracted PDF/DOCX/TXT content. Use when a user needs 3-5 concise, evidence-grounded highlight bullets for a research paper, review, meta-analysis, case report, or bioinformatics manuscript.
academic-norm-review	Detects content similarity, verifies standardized citations and abbreviations, and flags potential academic integrity risks; use it before submission, during academic writing QA, or for compliance reviews.
academic-poster-generator	Complete workflow for generating academic research posters from PDF literature; use when you need to extract paper content from PDFs and produce a LaTeX-based poster (beamerposter/tikzposter/baposter) with mandatory figure generation and a final rendered HTML deliverable.
acronym-unpacker	Intelligent medical abbreviation disambiguation tool that resolves ambiguous acronyms using clinical context, specialty-specific knowledge, and document-level semantic analysis.
active-comparator-single-soc-faers-safety-comparison	Generates complete FAERS pharmacovigilance study designs for multi-drug or class-level safety comparison inside one predefined SOC or AE family using active comparators, disproportionality analysis, subgroup characterization, and reviewer-facing evidence control.