Computer Vision
Definition
AI that can understand and analyze images and video content.
Why It Matters
Text-only AI can't help when the data is a photo, a video frame, a scanned form, or a screenshot. Computer vision is the prerequisite for everything that touches pixels, OCR, captioning, segmentation, object detection, and image generation that respects layout.
Key Points
- CNNs dominated computer vision until 2021; Vision Transformers (ViTs) now match or beat them on most classification and detection tasks.
- ImageNet Top-1 accuracy has risen from 63 % (AlexNet, 2012) to 92 % (EVA, 2023) on the same benchmark.
- Object detection (YOLO, DETR) outputs bounding boxes + class labels; instance segmentation adds per-pixel masks for each detected object.
- Multimodal LLMs combine a vision encoder (like CLIP) with an LLM decoder, that architecture is how GPT-4V and Qwen-VL read images.
- Video understanding adds a temporal axis: models either process sampled frames independently or use 3D convolutions / video transformers to reason across time.
Example
A computer-vision model can look at a photo of a parking meter and output "no parking 8am-6pm Mon-Fri." The same architecture, trained on different labels, can spot a tumor in an MRI scan or count cars in satellite imagery.
Common Misconception
Image generation and image understanding are fundamentally different tasks built on different architectures. A diffusion model that generates photorealistic portraits cannot answer questions about those portraits. You need a vision-language model (VLM) for understanding, not a generative model.
Related Terms
- Multimodal AIAI models that can process multiple types of input, text, images, audio, video.
- OCR (Optical Character Recognition)AI technology that extracts text from images, PDFs and scanned documents.
- Diffusion ModelAn AI image generation technique that starts with noise and gradually refines it into a coherent image. Used by FLUX, Stable Diffusion.
Computer Vision on Rewind.ai
Rewind.ai's image, video and OCR tools all sit on top of vision encoders. The "describe this image" button in chat uses the same model family as the OCR tool.
Explore the ToolsQuick Facts
| Term | Computer Vision |
| Related | Multimodal AI, OCR (Optical Character Recognition), Diffusion Model |