Chapter 5

Vision Specialists: CNNs, GANs, and Diffusion

The architectures built for visual data. Understanding when to analyze images versus generate them.

10 min read

Three Tools for Visual AI

When working with images and video, three architectures dominate: CNNs for analysis, GANs for generation, and Diffusion Models as the newer generation approach. Understanding their differences is crucial for choosing the right tool.

The Core Distinction

CNNs: "What's in this image?" (Analysis)
GANs: "Create an image like this" (Generation via competition)
Diffusion: "Refine noise into an image" (Generation via denoising)

CNNs: Understanding Images

Convolutional Neural Networks are designed to analyze and understand visual data. They use convolutional layers—filters that scan across images detecting patterns like edges, shapes, and textures.

How CNNs See

Layer 1: Detects edges and simple patterns
Layer 2: Combines edges into shapes
Layer 3: Recognizes parts (eyes, wheels, corners)
Layer 4+: Identifies objects and scenes

Each layer builds on the previous, creating hierarchical understanding.

CNN Use CaseExampleIndustry
Image classificationCat vs dog, tumor detectionHealthcare, consumer
Object detectionFind all cars in photoAutonomous vehicles
Facial recognitionUnlock phone, securitySecurity, consumer
Document analysisOCR, form processingFinance, legal
Quality inspectionDetect manufacturing defectsManufacturing

GANs: Creating Images

Generative Adversarial Networks take a completely different approach. Instead of analyzing images, they generate new ones from scratch using two competing networks.

The Adversarial Game

Generator: Creates fake images from random noise
Discriminator: Tries to distinguish real from fake

They train together in competition. Generator improves at creating convincing fakes. Discriminator improves at catching them. Eventually, generator wins—producing images indistinguishable from real ones.

GAN Use CaseExampleNotable Project
Face generationSynthetic portraitsThisPersonDoesNotExist
Style transferPhoto to paintingPrisma app
Image enhancementUpscaling, restorationNVIDIA DLSS
DeepfakesFace swappingVarious (controversial)
Synthetic dataTraining data generationEnterprise AI

Diffusion Models: The New Standard

Diffusion Models are the newest approach, now powering DALL-E 3, Midjourney, and Stable Diffusion.

How Diffusion Works

Training: Learn to reverse noise—given a noisy image, predict the less noisy version
Generation: Start with pure noise, iteratively denoise until coherent image emerges

Think of it like sculpting: start with a rough block (noise) and gradually refine it into a detailed sculpture (image).

Why Diffusion Won

Stability: Much easier to train than GANs (no mode collapse)
Quality: Produces higher quality, more diverse outputs
Control: Better at following text prompts
Tools: DALL-E 3, Midjourney, Stable Diffusion all use diffusion

Quick Decision Guide

TaskBest ArchitectureWhy
Analyze imagesCNNDesigned for understanding
Generate imagesDiffusionHigher quality, more stable
Generate videoGANStill leads for motion
Synthetic training dataGAN or specialized toolsFast, controllable
Real-time enhancementGANFaster inference
AI Assistant
00:00