aliyun-wan-digital-human
$
npx mdskill add cinience/alicloud-skills/aliyun-wan-digital-humanGenerates talking, singing, or presentation videos from image and audio using Alibaba Cloud Model Studio
- Solves the task of creating animated characters for narration, singing, or presentations
- Relies on Alibaba Cloud Model Studio APIs and the `wan2.2-s2v` digital-human model
- Validates input image with `wan2.2-s2v-detect` before video generation
- Delivers output videos by processing image and audio inputs into synchronized character animations
SKILL.md
.github/skills/aliyun-wan-digital-humanView on GitHub ↗
--- name: aliyun-wan-digital-human description: Use when generating talking, singing, or presentation videos from a single character image and audio with Alibaba Cloud Model Studio digital-human model `wan2.2-s2v`. Use when creating narrated avatar videos, singing portraits, or broadcast-style talking-head clips. version: 1.0.0 --- Category: provider # Model Studio Digital Human ## Validation ```bash mkdir -p output/aliyun-wan-digital-human python -m py_compile skills/ai/video/aliyun-wan-digital-human/scripts/prepare_digital_human_request.py && echo "py_compile_ok" > output/aliyun-wan-digital-human/validate.txt ``` Pass criteria: command exits 0 and `output/aliyun-wan-digital-human/validate.txt` is generated. ## Output And Evidence - Save normalized request payloads, chosen resolution, and task polling snapshots under `output/aliyun-wan-digital-human/`. - Record image/audio URLs and whether the input image passed detection. Use this skill for image + audio driven speaking, singing, or presenting characters. ## Critical model names Use these exact model strings: - `wan2.2-s2v-detect` - `wan2.2-s2v` Selection guidance: - Run `wan2.2-s2v-detect` first to validate the image. - Use `wan2.2-s2v` for the actual video generation job. ## Prerequisites - China mainland (Beijing) only. - Set `DASHSCOPE_API_KEY` in your environment, or add `dashscope_api_key` to `~/.alibabacloud/credentials`. - Input audio should contain clear speech or singing, and input image should depict a clear subject. ## Normalized interface (video.digital_human) ### Detect Request - `model` (string, optional): default `wan2.2-s2v-detect` - `image_url` (string, required) ### Generate Request - `model` (string, optional): default `wan2.2-s2v` - `image_url` (string, required) - `audio_url` (string, required) - `resolution` (string, optional): `480P` or `720P` - `scenario` (string, optional): `talk`, `sing`, or `perform` ### Response - `task_id` (string) - `task_status` (string) - `video_url` (string, when finished) ## Quick start ```bash python skills/ai/video/aliyun-wan-digital-human/scripts/prepare_digital_human_request.py \ --image-url "https://example.com/anchor.png" \ --audio-url "https://example.com/voice.mp3" \ --resolution 720P \ --scenario talk ``` ## Operational guidance - Use a portrait, half-body, or full-body image with a clear face and stable framing. - Match audio length to the desired output duration; the output follows the audio length up to the model limit. - Keep image and audio as public HTTP/HTTPS URLs. - If the image fails detection, do not proceed directly to video generation. ## Output location - Default output: `output/aliyun-wan-digital-human/request.json` - Override base dir with `OUTPUT_DIR`. ## References - `references/sources.md`