CLIP Text Encode (Prompt)
Welcome to the beautiful mess of natural language encoding in machine learning, where “a fox wearing sunglasses in the style of Blade Runner” is magically converted into something the model can actually understand. The CLIP Text Encode (prompt)
node in ComfyUI is your front door to this black box of sorcery.
🧠 What Does This Node Do?
The CLIP Text Encode (prompt)
node takes human-readable text prompts and encodes them into a numerical representation (also called an embedding) using the CLIP (Contrastive Language–Image Pretraining) model. This embedding is what downstream nodes use to guide image generation.
In other words, this node turns “cyberpunk samurai with glowing katana” into multi-dimensional fairy dust that the diffusion model will happily interpret as art. No, it doesn’t make coffee — yet.
🔧 Inputs
• clip
(CLIP model)
- Type:
CLIP
- Required: Yes
- Description: The CLIP model used to process the text prompt. This typically comes from the
Load Checkpoint
node or can be overridden using aCLIP Set Last Layer
node. - Gotchas:
- Must be a CLIP model that is compatible with the checkpoint used in your pipeline.
- Mismatching CLIP and Checkpoint will result in output weirdness or complete garbage. You've been warned.
📤 Outputs
• CONDITIONING
- Type:
CONDITIONING
- Description: The encoded result of your text prompt, which is used by samplers (like
KSampler
) to generate the actual image.
🛠️ Settings & Parameters
• text
(Prompt Input)
- Type:
string
- Required: Yes
- Description: This is your prompt — the creative command center of your image. This string is sent through the CLIP model for embedding.
- Tips:
- More descriptive prompts yield better results (usually).
- Use commas to break concepts cleanly:
portrait, cyberpunk lighting, intense gaze, soft shadows
. - Avoid overly complex grammar. The model’s not writing a novel — it just needs concept clarity.
💡 Recommended Use Cases
- Standard Prompt Encoding: Feeding your primary prompt into the diffusion pipeline.
- Multi-Condition Workflows: Combine multiple encoded prompts using nodes like
Combine Conditioning
. - CLIP Layer Customization: Works with
CLIP Set Last Layer
for advanced prompt tuning via specific layer truncation.
🔄 Workflow Setup Example
Here’s a simple chain to illustrate how this node fits into your workflow:
pgsql
`[Load Checkpoint] ↓ [CLIP] ↓ [CLIP Text Encode (prompt)] ↓ [KSampler or Sampler]`
For workflows using both positive and negative prompts:
css
`[CLIP Text Encode (prompt)] [CLIP Text Encode (negative prompt)] ↓ ↓ [CONDITIONING] [CONDITIONING] ↓ ↓ [KSampler: prompt / negative prompt inputs]`
✨ Prompting Tips
- Keyword Order Matters:
a blue robot with wings
≠wings with a blue robot
. - Use Art Style Tags: Things like
oil painting
,low-poly
,cinematic lighting
can drastically affect results. - Weighting: While the base node doesn’t support weights in text directly, you can modulate prompt strength using LoRA or prompt mixing techniques.
- Negatives: Pair with
CLIP Text Encode (negative prompt)
to tell the model what you don’t want (e.g., “ugly, blurry, extra limbs”).
🔥 What-Not-To-Do-Unless-You-Want-a-Fire
Welcome to the section where we lovingly walk you through the most common (and catastrophic) mistakes made with this node. Do these at your own risk. Or better yet — don’t.
🚫 Mismatch the CLIP Model and Checkpoint
Why it's a problem:
CLIP embeddings are not universal. Using a CLIP model that doesn’t match your active checkpoint is like asking an Italian chef to cook sushi — wrong tools, wrong expectations, disaster imminent.
What happens:
- Weird generations
- Model ignores parts of the prompt
- Images that look like AI forgot what it was doing halfway through
Solution:
Always use the CLIP
output that matches your checkpoint from the Load Checkpoint
node — don’t get clever unless you really know what you’re doing.
🧪 Expect Prompt Weighting to Work Magically
Why it's a problem:
Typing something like ((cyberpunk city:1.4))
and expecting it to just "get it" will not end well.
What happens:
CLIP will treat it as a literal phrase — not weighted — so your generation might be overrun with weird syntax interpretation or simply ignore weighting cues altogether.
Solution:
Use proper conditioning combination nodes (like Combine Conditioning
) or external prompt weighting techniques if you need nuanced emphasis.
💥 Assume This Node Understands Grammar Like Shakespeare
Why it's a problem:
CLIP isn’t parsing sentence structure the way humans do. Complex grammar and nested ideas confuse it.
What happens:
You get images where a “man holding a cat wearing a hat riding a horse” becomes a terrifying three-headed creature.
Solution:
Break your prompt into short, clear, comma-separated concepts. Think of it like feeding a toddler — one idea at a time.
🔧 Ignore Layer Tweaking with CLIP Set Last Layer
Why it's a problem:
If you're using CLIP Set Last Layer and don’t understand what truncating to layer 6 vs. layer 12 means, you might unknowingly sabotage your outputs.
What happens:
You get "less semantic" or "more literal" generations than expected, and you have no idea why.
Solution:
Only truncate CLIP layers if you’re customizing behavior with intent. Otherwise, leave it alone — the defaults exist for a reason.
🧩 Forget the Difference Between Positive and Negative Conditioning
Why it's a problem:
Confusing this node with its evil twin — CLIP Text Encode (negative prompt)
— leads to flipped intentions.
What happens:
Your image gets worse the more you try to make it better.
Solution:
Keep your CLIP Text Encode (prompt)
for the stuff you want and its sibling for the stuff you don't. Keep them in their lanes.
🪤 Use Unicode, Emojis, or Fancy Punctuation
Why it's a problem:
CLIP isn't winning a Unicode beauty pageant. Emojis and special characters can break tokenization.
What happens:
- The prompt becomes unrecognizable
- You get output that’s oddly irrelevant
- 🦄 suddenly turns into...a toaster?
Solution:
Stick to plain ASCII text. Keep it simple, clean, and emoji-free.
Bottom line? If you want your prompt to sing and not explode into an interpretive mess, stick to compatible models, clear phrases, and intentional design. Otherwise… well, enjoy the fire. 🔥
🧪 Advanced Notes
- Layer Tweaking with CLIP Set Last Layer: You can truncate the CLIP encoding to use only up to a specific transformer layer. Lower layers favor literal/textual understanding, while higher layers emphasize more abstract/semantic interpretations.
- Reuse for Prompt Injection: Useful in scenarios where you want to encode control prompts for ControlNet or multi-modal workflows.
🧼 TL;DR
Feature | Summary |
---|---|
Primary Role | Converts text to CLIP embeddings for guiding generation |
Input | clip (CLIP model), text prompt |
Output | CONDITIONING |
Required for | Almost every image generation workflow |
Supports weights? | Not natively, but pair with LoRA or Combine nodes |
Known Issues | Checkpoint/CLIP mismatch, no sub-prompt weighting |