CLIP Vision Encode

"When your image needs to speak fluent CLIP, this is the translator."

The CLIP Vision Encode node is a powerful utility that encodes an image into CLIP Vision's latent embedding space. This allows you to leverage CLIP’s understanding of visual content for all kinds of fun and/or chaotic AI generation tasks — from style transfer and similarity search to image-to-image guidance and multi-modal workflows.

This node takes your input image, compresses it via a VAE (if needed), optionally augments it, and runs it through a CLIP Vision model to output image embeddings — both positive and negative — plus latent samples if you need to push things further.

Essential for AI artists who don’t just want pretty pixels — they want semantically meaningful ones.

CLIP Vision Encode

🧩 Node Inputs

Input	Type	Description
`clip_vision`	CLIP Vision	The loaded CLIP Vision model. Must be initialized with a `CLIP Vision Loader`. This is the core encoder that makes the magic happen.
`init_image`	IMAGE	The image you want to encode. Garbage in, garbage embeddings out — high-quality images are highly recommended.
`vae`	VAE	Variational Autoencoder used to encode the image into latent space. Required for downstream workflows that operate in latent format.
`width`	INT	Width (in pixels) to resize the image before encoding. Must match model expectations. Default is typically 512.
`height`	INT	Height (in pixels) to resize the image before encoding. Use the same guidance as width.
`video_frames`	INT	(Optional) Number of frames to generate for video workflows. Only needed if you’re encoding frame sequences.
`motion_bucket_id`	INT	(Optional) Identifier used for motion sequence conditioning in video tasks. Groups sequences together.
`fps`	INT	(Optional) Frames per second metadata for video playback.
`augmentation_level`	FLOAT (0.0–1.0)	Controls how much noise/augmentation is applied during encoding. Helps improve generalization. Too high and you’ll just confuse the model.

🎯 Node Outputs

Output	Type	Description
`positive`	LIST	Embeddings and metadata representing positive conditioning. These are used to guide generation toward your input image’s features.
`negative`	LIST	Embeddings and metadata representing negative conditioning. Great for telling the model what not to do.
`samples`	LATENT	Latent space tensor derived from the input image. Feed this into other nodes for diffusion, transformation, or image generation.

🛠️ Recommended Use Cases

Image-to-text alignment: Use embeddings to match images with prompts.
Image-guided generation: Feed the positive output into workflows where you want image features to guide the result.
Reference style matching: Encode the "vibe" of an image and apply it elsewhere.
Training augmentation: Use augmentation_level to simulate variation in the same input for robust downstream training.
Video frame encoding: Turn sequences of frames into CLIP embeddings for multi-frame workflows.

⚙️ Usage Tips

Keep your init_image clean and high-quality. CLIP is good, not psychic.
Use matching dimensions for width and height. Usually divisible by 8 or 64 — 512x512 and 768x768 are safe bets.
For multi-modal workflows, try moderate augmentation_level values (0.2–0.4). Higher values inject useful noise but can distort the image’s intent.
When using video-related inputs, ensure video_frames > 1 and your CLIP model can handle batch inputs.
Not using video? Leave video_frames, motion_bucket_id, and fps at default or 0. They won’t hurt anything.

🔥 What-Not-To-Do-Unless-You-Want-a-Fire

❌ Feed in mismatched resolutions: If the width and height don’t play nicely with your CLIP model, expect crashes or distorted embeddings.
❌ Skip the VAE: You need it for anything latent-related. No VAE, no samples.
❌ Abuse augmentation: Setting augmentation_level to 1.0 will mangle your image into unrecognizable mush. Not ideal unless you’re trying to invent glitchcore.
❌ Use negative embeddings without understanding them: Negative outputs are not the opposite of positive — they’re meant for contrastive use in workflows that support them.

⚠️ Known Issues

VRAM usage can spike if you feed in large image resolutions or lots of video frames. Resize before encoding if needed.
Some CLIP Vision models are picky — make sure the one you loaded supports the resolution you’re using.
Augmentation noise isn’t standardized. What’s “moderate” for one model might be “absolute chaos” for another.

🧪 Example Node Setup

{
  "clip_vision": "ViT-H-14-CLIP-Vision",
  "init_image": "reference_image.png",
  "vae": "vae_model.safetensors",
  "width": 512,
  "height": 512,
  "video_frames": 0,
  "motion_bucket_id": 0,
  "fps": 0,
  "augmentation_level": 0.3
}

This setup encodes a 512x512 image into CLIP Vision space using a standard VAE with light augmentation.

📝 Notes

This node plays beautifully with DualCLIP, Prompt Conditioners, and KSampler workflows that take positive/negative embeddings.
If you're doing anything with image + text matching, this node is practically a requirement.
This is not a text encoder. Only visual embeddings here, folks.

🧩 Node Inputs​

🎯 Node Outputs​

🛠️ Recommended Use Cases​

⚙️ Usage Tips​

🔥 What-Not-To-Do-Unless-You-Want-a-Fire​

⚠️ Known Issues​

🧪 Example Node Setup​

📝 Notes​