CLIP Vision Encode
"When your image needs to speak fluent CLIP, this is the translator."
The CLIP Vision Encode node is a powerful utility that encodes an image into CLIP Vision's latent embedding space. This allows you to leverage CLIP’s understanding of visual content for all kinds of fun and/or chaotic AI generation tasks — from style transfer and similarity search to image-to-image guidance and multi-modal workflows.
This node takes your input image, compresses it via a VAE (if needed), optionally augments it, and runs it through a CLIP Vision model to output image embeddings — both positive and negative — plus latent samples if you need to push things further.
Essential for AI artists who don’t just want pretty pixels — they want semantically meaningful ones.
🧩 Node Inputs
Input | Type | Description |
---|---|---|
clip_vision | CLIP Vision | The loaded CLIP Vision model. Must be initialized with a CLIP Vision Loader . This is the core encoder that makes the magic happen. |
init_image | IMAGE | The image you want to encode. Garbage in, garbage embeddings out — high-quality images are highly recommended. |
vae | VAE | Variational Autoencoder used to encode the image into latent space. Required for downstream workflows that operate in latent format. |
width | INT | Width (in pixels) to resize the image before encoding. Must match model expectations. Default is typically 512. |
height | INT | Height (in pixels) to resize the image before encoding. Use the same guidance as width. |
video_frames | INT | (Optional) Number of frames to generate for video workflows. Only needed if you’re encoding frame sequences. |
motion_bucket_id | INT | (Optional) Identifier used for motion sequence conditioning in video tasks. Groups sequences together. |
fps | INT | (Optional) Frames per second metadata for video playback. |
augmentation_level | FLOAT (0.0–1.0) | Controls how much noise/augmentation is applied during encoding. Helps improve generalization. Too high and you’ll just confuse the model. |
🎯 Node Outputs
Output | Type | Description |
---|---|---|
positive | LIST | Embeddings and metadata representing positive conditioning. These are used to guide generation toward your input image’s features. |
negative | LIST | Embeddings and metadata representing negative conditioning. Great for telling the model what not to do. |
samples | LATENT | Latent space tensor derived from the input image. Feed this into other nodes for diffusion, transformation, or image generation. |
🛠️ Recommended Use Cases
- Image-to-text alignment: Use embeddings to match images with prompts.
- Image-guided generation: Feed the
positive
output into workflows where you want image features to guide the result. - Reference style matching: Encode the "vibe" of an image and apply it elsewhere.
- Training augmentation: Use
augmentation_level
to simulate variation in the same input for robust downstream training. - Video frame encoding: Turn sequences of frames into CLIP embeddings for multi-frame workflows.
⚙️ Usage Tips
- Keep your
init_image
clean and high-quality. CLIP is good, not psychic. - Use matching dimensions for
width
andheight
. Usually divisible by 8 or 64 — 512x512 and 768x768 are safe bets. - For multi-modal workflows, try moderate
augmentation_level
values (0.2–0.4). Higher values inject useful noise but can distort the image’s intent. - When using video-related inputs, ensure
video_frames
> 1 and your CLIP model can handle batch inputs. - Not using video? Leave
video_frames
,motion_bucket_id
, andfps
at default or 0. They won’t hurt anything.
🔥 What-Not-To-Do-Unless-You-Want-a-Fire
- ❌ Feed in mismatched resolutions: If the width and height don’t play nicely with your CLIP model, expect crashes or distorted embeddings.
- ❌ Skip the VAE: You need it for anything latent-related. No VAE, no samples.
- ❌ Abuse augmentation: Setting
augmentation_level
to 1.0 will mangle your image into unrecognizable mush. Not ideal unless you’re trying to invent glitchcore. - ❌ Use negative embeddings without understanding them: Negative outputs are not the opposite of positive — they’re meant for contrastive use in workflows that support them.
⚠️ Known Issues
- VRAM usage can spike if you feed in large image resolutions or lots of video frames. Resize before encoding if needed.
- Some CLIP Vision models are picky — make sure the one you loaded supports the resolution you’re using.
- Augmentation noise isn’t standardized. What’s “moderate” for one model might be “absolute chaos” for another.
🧪 Example Node Setup
{
"clip_vision": "ViT-H-14-CLIP-Vision",
"init_image": "reference_image.png",
"vae": "vae_model.safetensors",
"width": 512,
"height": 512,
"video_frames": 0,
"motion_bucket_id": 0,
"fps": 0,
"augmentation_level": 0.3
}
This setup encodes a 512x512 image into CLIP Vision space using a standard VAE with light augmentation.
📝 Notes
- This node plays beautifully with DualCLIP, Prompt Conditioners, and KSampler workflows that take positive/negative embeddings.
- If you're doing anything with image + text matching, this node is practically a requirement.
- This is not a text encoder. Only visual embeddings here, folks.