Text to Image with Locked Variations
Welcome to the Cadillac of ComfyUI workflows — this one’s designed to give you stunning image variations while preserving your original composition like it owes you rent. With a ControlNet depth map and strategic prompt conditioning, this setup enables reliable scene structure while letting your creativity run wild in style, lighting, or mood. Perfect for when you want to say “mountains,” but in 16 different dialects of awesome.
🧠 What This Workflow Does
This ComfyUI workflow:
- Generates a base image using a stable prompt and ControlNet depth conditioning.
- Reuses the same latent and depth structure across multiple prompt variations.
- Produces visually consistent scenes with stylistic and time-of-day variations (think: “mountains by day” vs “mountains at sunset”).
- Saves all outputs for your convenience (because we're civilized).
🗺️ Workflow Overview
The pipeline can be conceptually broken down into 3 main stages:
-
Core Image Composition
- Prompt:
"mountain landscape, digital painting, masterpiece"
- ControlNet (Depth Preprocessor): Enforces structure via depth map
- Generates an initial image and latent state.
- Prompt:
-
Prompt Variation with Latent Reuse
- Prompt changes (e.g.,
"mountain landscape, at night..."
and"mountain landscape, at sunset..."
) - Reuse of the same latent and ControlNet map
- Creates stylistic variations with identical composition.
- Prompt changes (e.g.,
-
Output & Preview
- Each variation decoded and saved
- Optional image preview node included (because seeing is believing).
🧩 Node-by-Node Breakdown
Setup & Base Latent Creation
- CheckpointLoaderSimple (Node 2)
Loads
dreamshaper_8.safetensors
. Also supplies model + CLIP backbone. - EmptyLatentImage (Node 5) Sets the image dimensions to 512x768, batch size of 1. Think of it as the blank canvas—before we start slapping paint on.
- VAELoader (Node 7)
Uses
vae-ft-mse-840000-ema-pruned.safetensors
for image decoding. - CLIPTextEncode (Node 3)
Encodes the main prompt
"mountain landscape, digital painting, masterpiece"
. - CLIPTextEncode (Node 4)
Encodes a negative prompt
"ugly, deformed"
—because no one asked for cursed mountain goblins.
The Breakdown
- CheckpointLoaderSimple (Node 2)
- CLIPTextEncode (Nodes 3, 4, 13, 17)
- EmptyLatentImage (Node 5)
- KSampler (Nodes 1, 11, 15)
- VAELoader (Node 7)
- VAEDecode (Nodes 6, 12, 16)
- SaveImage (Nodes 9, 10, 14)
- ControlNetLoader (Nodes 22, 26)
- AV_ControlNetPreprocessor (Node 18)
- ControlNetApplyAdvanced (Nodes 24, 25)
- PreviewImage (Node 19)
- MODELS USED
Purpose: Loads the base model that actually knows how to paint pixels into dreams.
Model: dreamshaper_8.safetensors
Outputs:
MODEL
→ Used by allKSampler
nodesCLIP
→ Used for text encodingVAE
→ (Optional; not used here since VAE is loaded explicitly)-
Notes: DreamShaper is popular for striking a nice balance between realism and stylization. Good for both fantasy and photorealistic content.
Purpose: Converts text prompts into vectorized concepts. It's the translator from human to AI whisperer.
Inputs:
CLIP
→ Comes from CheckpointLoadertext
→ Your juicy prompts
Outputs:
CONDITIONING
→ Goes into samplers and ControlNet magic
Key Prompts Used:
"mountain landscape, digital painting, masterpiece"
"ugly, deformed"
(neg prompt)"mountain landscape, at night, digital painting, masterpiece"
"mountain landscape, at sunset, digital painting, masterpiece"
Purpose: Converts text prompts into vectorized concepts. It's the translator from human to AI whisperer.
Settings:
- Width:
512
- Height:
768
- Batch Size:
1
Outputs: LATENT tensor → Used in all KSampler
passes.
Purpose: The engine room. Takes prompts, latents, and models to produce new latent images.
Settings Shared Across All Instances:
- Sampler:
dpmpp_2m
- Scheduler:
karras
- Steps:
25
- CFG:
7
- Denoise:
1
- Seed: Random (unless you want reproducibility)
Input Triplets:
- MODEL + CONDITIONING (positive/negative) + LATENT → LATENT
Purpose: Loads the VAE model used to decode latent images back into full-res output.
Model: vae-ft-mse-840000-ema-pruned.safetensors
(yes, she’s got a long name, but she delivers)
Output: VAE → Connected to all VAEDecode
nodes
Purpose: Translates final latent tensors back into actual images. This is where your art comes alive.
Input: LATENT + VAE
Output: IMAGE → saved or previewed
Purpose: Saves generated images with filenames like “ComfyUI_####.png”
Input: IMAGE
Output: To your filesystem, obviously.
Purpose: Loads a ControlNet module for enforcing structure using an auxiliary signal (depth, in this case).
Model: control_v11f1p_sd15_depth_fp16.safetensors
Output: CONTROL_NET → fed into the advanced ControlNet processor
Purpose: Generates a depth map from the base image using a fancy pants preprocessor.
Settings:
- Preprocessor:
depth_midas
- SD Version:
sd15
- Resolution:
512
Output: IMAGE (depth map) → sent to ControlNetApplyAdvanced
Purpose: Applies ControlNet conditioning to your prompt vectors.
Inputs:
- CONDITIONING (positive + negative)
- CONTROL_NET (from Loader)
- IMAGE (depth map from Preprocessor)
Strength: 0.83
Start/End %: 0 → 1
(applies throughout the entire diffusion process)
Output: New conditioned prompts → fed to KSampler
Purpose: Displays the ControlNet depth map as a quick visual sanity check. Optional, but helpful.
Model Type | File Used | Purpose |
---|---|---|
Checkpoint | dreamshaper_8.safetensors | Core image generation model |
VAE | vae-ft-mse-840000-ema-pruned.safetensors | Decoding latent to image |
ControlNet | control_v11f1p_sd15_depth_fp16.safetensors | Depth conditioning |
CLIP Text Encoder | Included in the base checkpoint | Text-to-conditioning encoder |
Preprocessor | depth_midas (via AV_ControlNetPreprocessor) | Generates the depth input image |
🔄 First Image Generation Pass (Baseline)
- KSampler (Node 1) Takes in the base latent, positive + negative conditioning, and outputs latent image.
- VAEDecode (Node 6) Decodes the latent into an actual image.
- SaveImage (Node 9) Saves the image. You're welcome.
- AV_ControlNetPreprocessor (Node 18)
Extracts a depth map using
depth_midas
preprocessor from the decoded base image. Resolution: 512.
🎨 Prompt Variations (Same Composition, Different Mood)
Each variation follows this trio:
➕ New Prompt Conditioning
- CLIPTextEncode (Nodes 13 & 17)
New positive prompts:
"mountain landscape, at night, digital painting, masterpiece"
"mountain landscape, at sunset, digital painting, masterpiece"
🔗 ControlNet Conditioning
- ControlNetLoader (Nodes 22 & 26)
Loads
control_v11f1p_sd15_depth_fp16.safetensors
for both variations. - ControlNetApplyAdvanced (Nodes 24 & 25)
Applies ControlNet to each prompt with:
- Strength: 0.83
- Range: 0 to 1 (full generation span)
- Shares preprocessed depth image from Node 18.
🌀 Sampling Passes (Reusing Latent)
- KSampler (Nodes 11 & 15)
Feeds in:
- Same latent from Node 5
- Prompt variations + negative conditioning
- Outputs new latent samples for decoding
- VAEDecode (Nodes 12 & 16) Converts those latents back into images.
- SaveImage (Nodes 10 & 14) Saves those glorious variations.
🔍 Bonus: Image Preview
- PreviewImage (Node 19) Linked to the ControlNet-preprocessed image. Let’s you visually confirm the depth map. Optional but helpful when tweaking.
🛠️ Recommended Usage Tips
- Change only the text prompt on the variation CLIP encoders (Nodes 13/17) to explore lighting, color styles, or artistic direction without breaking composition.
- Keep the latent image and depth ControlNet the same to retain scene structure.
- Adjust denoise strength (default = 1) in KSamplers (Nodes 11 & 15) for more or less adherence to prompts.
- Seed randomization is enabled. Lock it if you want reproducibility.
📦 Output Summary
Image Type | Description | Saved? |
---|---|---|
Base image | Pure prompt output | ✅ |
Depth map preview | Preprocessed ControlNet input | 👁️ |
Night variation | Prompt: "at night" | ✅ |
Sunset variation | Prompt: "at sunset" | ✅ |
🔥 What Not to Do Unless You Want a Fire
⚠️ Go rogue with dimensions: Changing the image size mid-workflow (in EmptyLatentImage or ControlNet Preprocessor) breaks alignment. You’ll get Picasso faces in a Dali background.
⚠️ Mix ControlNet types mid-stream:
Don’t swap depth_midas
for pose
, lineart
, or anything else unless you’re also updating the conditioning method, prompts, and probably sacrificing a goat.
⚠️ Use wildly unrelated style prompts: Throwing "cyberpunk chicken nugget tornado" at a base image of a serene forest won’t result in inspired fusion — just chaotic soup.
⚠️ Mismatch VAEs and checkpoints: Some VAEs work better with certain model families. If you mix and match, expect weird color shifts or melted features.
⚠️ Overcook CFG or Steps: CFG > 15? You’re asking for prompt obsession. Steps > 50? Diminishing returns and slower gen for zero payoff.
⚠️ Don’t forget the negative prompt:
Seriously, use "ugly, deformed"
or your mountains will have six eyeballs.
🚀 Conclusion
This workflow is a power user’s dream: it gives you structured, repeatable image generation with the flexibility to explore multiple artistic angles. And thanks to ControlNet’s depth preservation and ComfyUI’s node magic, you can get Pinterest-perfect results with just a prompt tweak.
So go forth, vary your vibes—but keep your mountains steady.