Skip to main content

CLIP Vision Encode

"When your image needs to speak fluent CLIP, this is the translator."

The CLIP Vision Encode node is a powerful utility that encodes an image into CLIP Vision's latent embedding space. This allows you to leverage CLIP’s understanding of visual content for all kinds of fun and/or chaotic AI generation tasks — from style transfer and similarity search to image-to-image guidance and multi-modal workflows.

This node takes your input image, compresses it via a VAE (if needed), optionally augments it, and runs it through a CLIP Vision model to output image embeddings — both positive and negative — plus latent samples if you need to push things further.

Essential for AI artists who don’t just want pretty pixels — they want semantically meaningful ones.

CLIP Vision Encode

🧩 Node Inputs

InputTypeDescription
clip_visionCLIP VisionThe loaded CLIP Vision model. Must be initialized with a CLIP Vision Loader. This is the core encoder that makes the magic happen.
init_imageIMAGEThe image you want to encode. Garbage in, garbage embeddings out — high-quality images are highly recommended.
vaeVAEVariational Autoencoder used to encode the image into latent space. Required for downstream workflows that operate in latent format.
widthINTWidth (in pixels) to resize the image before encoding. Must match model expectations. Default is typically 512.
heightINTHeight (in pixels) to resize the image before encoding. Use the same guidance as width.
video_framesINT(Optional) Number of frames to generate for video workflows. Only needed if you’re encoding frame sequences.
motion_bucket_idINT(Optional) Identifier used for motion sequence conditioning in video tasks. Groups sequences together.
fpsINT(Optional) Frames per second metadata for video playback.
augmentation_levelFLOAT (0.0–1.0)Controls how much noise/augmentation is applied during encoding. Helps improve generalization. Too high and you’ll just confuse the model.

🎯 Node Outputs

OutputTypeDescription
positiveLISTEmbeddings and metadata representing positive conditioning. These are used to guide generation toward your input image’s features.
negativeLISTEmbeddings and metadata representing negative conditioning. Great for telling the model what not to do.
samplesLATENTLatent space tensor derived from the input image. Feed this into other nodes for diffusion, transformation, or image generation.
  • Image-to-text alignment: Use embeddings to match images with prompts.
  • Image-guided generation: Feed the positive output into workflows where you want image features to guide the result.
  • Reference style matching: Encode the "vibe" of an image and apply it elsewhere.
  • Training augmentation: Use augmentation_level to simulate variation in the same input for robust downstream training.
  • Video frame encoding: Turn sequences of frames into CLIP embeddings for multi-frame workflows.

⚙️ Usage Tips

  • Keep your init_image clean and high-quality. CLIP is good, not psychic.
  • Use matching dimensions for width and height. Usually divisible by 8 or 64 — 512x512 and 768x768 are safe bets.
  • For multi-modal workflows, try moderate augmentation_level values (0.2–0.4). Higher values inject useful noise but can distort the image’s intent.
  • When using video-related inputs, ensure video_frames > 1 and your CLIP model can handle batch inputs.
  • Not using video? Leave video_frames, motion_bucket_id, and fps at default or 0. They won’t hurt anything.

🔥 What-Not-To-Do-Unless-You-Want-a-Fire

  • ❌ Feed in mismatched resolutions: If the width and height don’t play nicely with your CLIP model, expect crashes or distorted embeddings.
  • ❌ Skip the VAE: You need it for anything latent-related. No VAE, no samples.
  • ❌ Abuse augmentation: Setting augmentation_level to 1.0 will mangle your image into unrecognizable mush. Not ideal unless you’re trying to invent glitchcore.
  • ❌ Use negative embeddings without understanding them: Negative outputs are not the opposite of positive — they’re meant for contrastive use in workflows that support them.

⚠️ Known Issues

  • VRAM usage can spike if you feed in large image resolutions or lots of video frames. Resize before encoding if needed.
  • Some CLIP Vision models are picky — make sure the one you loaded supports the resolution you’re using.
  • Augmentation noise isn’t standardized. What’s “moderate” for one model might be “absolute chaos” for another.

🧪 Example Node Setup

{
"clip_vision": "ViT-H-14-CLIP-Vision",
"init_image": "reference_image.png",
"vae": "vae_model.safetensors",
"width": 512,
"height": 512,
"video_frames": 0,
"motion_bucket_id": 0,
"fps": 0,
"augmentation_level": 0.3
}

This setup encodes a 512x512 image into CLIP Vision space using a standard VAE with light augmentation.

📝 Notes

  • This node plays beautifully with DualCLIP, Prompt Conditioners, and KSampler workflows that take positive/negative embeddings.
  • If you're doing anything with image + text matching, this node is practically a requirement.
  • This is not a text encoder. Only visual embeddings here, folks.