rgbguy

I was reading the AMD Neural Texture Block Compression paper. NTBC, arXiv 2407.09543. Two paragraphs in, it mentioned BC1 as the baseline it was trying to beat. The engineer in me couldn't stop going down the rabit hole. So I spent a whole day, trying to understand every bit of what BC1 does. How it encodes colors, how it is cache friendly and what's the math behind it!

What I found: every texture you've ever seen in a game was never stored as real colors. Not a single pixel. Your GPU reconstructs approximate colors from two numbers and a 2-bit index, billions of times per frame, in hardware. It has been doing this since 1998! What!!

This post derives BC1 from scratch. Starting from raw texture size, working down to the exact bit layout.

The problem

A raw texture stores each pixel as four channels: red, green, blue, alpha. Each channel is 8 bits. That's 32 bits per pixel, or 4 bytes per pixel. RGBA8888.

A 4K texture at that rate is 32MB. A modern game ships thousands of textures. VRAM is fast, but the GPU's texture cache is small. If your texture data doesn't fit in cache, you pay a bandwidth penalty every single sample call. Every frame. For every visible surface.

Obvious answer - compression. But we can't zip a texture, right? It has to reside in the VRAM. The GPU needs to decompress individual pixels during rendering, in hardware, with zero pipeline stall. We need a formula the texture unit can execute in one clock cycle.

BC1 is that formula. Let's understand it by deriving.

Step 1: How big is the raw data?

BC1 works on 4x4 pixel blocks. The entire texture is tiled into these blocks and each one is compressed independently.

A single 4x4 block:

4 x 4 = 16 pixels
each pixel = RGBA = 8+8+8+8 = 32 bits = 4 bytes
16 x 4 = 64 bytes per block

raw 4x4 block. rgba8888. 4 bytes per pixel. 64 bytes total.

Step 2: Set the compression target

The goal for BC1 is 8x compression. Needed to make a meaningful balance between VRAM and cache pressure while keeping the format simple enough to decode in hardware in one cycle.

64 bytes / 8 = 8 bytes budget
everything BC1 stores for a 4x4 block must fit in 8 bytes

8 bytes. 64 bits. That's the constraint everything else is built on.

Step 3: Spend 4 bytes on two colors

Here's the key design insight. Instead of storing all 16 pixel colors, store just two: the two endpoint colors that define the range of colors in this block. Call them C0 and C1. Since we are talking about a really small 4x4 pixels block. The colors would be quite similar. Hence we assume that 2 colors could well-represent the range.

Each endpoint is stored in RGB565 format: 5 bits for red, 6 bits for green, 5 bits for blue. 16 bits total. Two endpoints: 32 bits = 4 bytes.

C0: RGB565 = 5+6+5 = 16 bits = 2 bytes
C1: RGB565 = 5+6+5 = 16 bits = 2 bytes
total for endpoints = 4 bytes
remaining from budget: 8 - 4 = 4 bytes left

Why RGB565 and not RGB888? Because you need to fit two colors in 4 bytes. 3 bytes per color gets you there but 565 is the format that maps cleanly to 16 bits with the hardware already built around it in DirectX.

Step 4: The remaining 4 bytes force everything

4 bytes left. 16 pixels to describe.

4 bytes = 32 bits
32 bits / 16 pixels = 2 bits per pixel
2 bits = 2^2 = 4 possible values: 00, 01, 10, 11

Each pixel gets 2 bits. Those 2 bits can represent exactly 4 states. It's the only budget we have. You wanted 8x compression, you spent 4 bytes on endpoints, the remaining 4 bytes give you exactly 2 bits per pixel.

the complete bc1 block. every number derives from the 8-byte budget constraint.

Step 5: What do the 4 possible values mean?

Each pixel's 2 bits are an index: 00, 01, 10, or 11, which is 0, 1, 2, 3. They are positions on a line.

C0 and C1 define a line segment in RGB color space. C0 is a point in 3D (R, G, B). C1 is another point. Between them is a continuous range of colors. The 4 index values divide that range into 3 equal segments:

index 0 → t = 0/3 = 0.000 → C0 (stored)
index 1 → t = 1/3 = 0.333 → lerp(C0, C1, 1/3) (derived)
index 2 → t = 2/3 = 0.667 → lerp(C0, C1, 2/3) (derived)
index 3 → t = 3/3 = 1.000 → C1 (stored)

Only C0 and C1 are stored. The middle two colors are never written to disk or VRAM. The GPU always computes them on the fly from C0 and C1. Storing them would waste bytes on values perfectly predictable from what you already have.

The lerp is a parametric line equation. Same formula as basic geometry, just in 3D RGB space:

color = C0 + t x (C1 - C0)

which expands per channel to:
R = R0 + t x (R1 - R0)
G = G0 + t x (G1 - G0)
B = B0 + t x (B1 - B0)

One scalar t moves you along all three channels simultaneously. That's why two endpoints are enough to describe an entire color range. One number drives three lerps in parallel. The index is just that scalar t, quantized down to 2 bits.

four t values. evenly spaced. only 0 and 3 stored. 1 and 2 always derived. one lerp to decode.

Step 6: Why not just quantize each pixel directly?

A reasonable question at this point: why not skip the endpoints entirely? Give each pixel 4 bits, quantize it to one of 16 fixed levels, and call it done. Same byte count. Simpler.

The problem is what "16 fixed levels" means. If the levels are spread evenly from 0 to 255, each step covers about 17 intensity units. Now consider a block that's all dark reds. Every pixel between intensity 40 and 80. How many of your 16 fixed steps land in that range?

Two, maybe three. Every pixel gets rounded to one of two or three options. That's a coarse approximation of a smooth gradient. Visible banding.

The endpoints fix this by describing the range first. C0 = intensity 40, C1 = intensity 80. Now your 4 steps subdivide just that range:

naive (fixed global steps): step size = 255 / 16 ~ 17 units, only 2-3 land in your range
BC1 (adaptive per block): step size = (C1-C0) / 3 ~ 13 units, all 4 land in your range

BC1 isn't just smaller steps. It's that all your steps are useful. None of your 2-bit budget goes to colors that don't appear in this block. The endpoints let you zoom your precision into exactly where the block's actual colors live.

It's the same idea as floating point vs fixed point. The exponent buys you range. The mantissa buys you precision within that range. The endpoints are the exponent. The indices are the mantissa.

naive quantization wastes most steps on colors outside the block's range. bc1 zooms all steps into where the actual colors live.

Step 7: Finding C0 and C1 (the PCA step)

Given 16 pixels, how do you pick the best C0 and C1? You want the two endpoints that minimize the total error when every pixel snaps to the nearest of the 4 palette colors.

Think about what you're actually doing. Each pixel is a point in 3D RGB space. Your 16 pixels are 16 dots scattered in that 3D space. You want to find the line that best fits where those dots sit. C0 and C1 are the endpoints of that line.

That's exactly PCA. Principal Component Analysis. You find the axis of greatest variance in your point cloud, which is the direction in RGB space along which your pixels spread the most, and you project everything onto it.

1. compute mean color u = (1/16) sum(pi)
2. build covariance matrix C = (1/16) sum((pi - u)(pi - u)^T)
3. find largest eigenvector of C → the principal axis
4. project all pixels onto that axis → each gets a scalar t
5. min projection = C0, max projection = C1
6. each pixel snaps to nearest of {0, 1/3, 2/3, 1} → 2-bit index

PCA doesn't modify any pixel's color. It just finds the best line. The color loss happens at step 6, when each pixel's exact t value gets rounded to the nearest of 4 allowed positions. That rounding error is the compression artifact. PCA minimizes it by finding the axis where the round-off is smallest across all 16 pixels.

This runs offline, when the artist exports the texture. In practice, high-quality encoders do a second refinement pass after PCA: slightly jitter C0 and C1 to minimize the actual mean squared error of the final snapped indices. PCA gets you 90% of the way there; the refinement pass closes the gap.

pca on 16 pixels as a 3d point cloud. principal axis = direction of max variance. endpoints at the extremes. encode once, decode every sample.

Step 8: What the 4x4 block can only be

Here is the constraint stated as plainly as possible: every pixel in a BC1 block must be one of exactly four colors. No exceptions. The whole 4x4 patch, all 16 pixels, can only draw from:

C0
lerp(C0, C1, 1/3)
lerp(C0, C1, 2/3)
C1

That's it. The original block had up to 16 unique colors. After BC1 encoding, it has exactly 4. Every pixel gets reassigned to whichever of those 4 is closest to its original color. The reassignment is the loss.

This is why BC1 breaks down on hard edges. If a 4x4 block straddles a sharp boundary between a red surface and a blue sky, those two color clusters live on opposite sides of RGB space. No single line fits both. The encoder picks the best line it can, but the four palette colors end up as mediocre approximations for both regions. You see blockiness at the boundary.

BC7 (DirectX 11) addresses this by splitting the block into sub-regions, each with their own endpoint pair. Multiple lines, better fit, more complex encoder. Same decoding principle.

When does any of this actually happen?

The timeline matters because most of the cost is invisible at runtime.

OFFLINE: artist's machine or build pipeline
PNG / TGA (raw pixels)
→ texconv runs PCA per block
→ finds C0, C1, assigns indices
→ writes .dds file to disk

LOAD TIME: game startup
read .dds from disk
→ CreateTexture2D() uploads raw bytes to VRAM
→ no decompression, no PCA, just a memcpy
→ BC1 bytes sit in VRAM as-is

RUNTIME: every frame, every sample
shader calls myTex.Sample(sampler, uv)
→ texture unit fetches 8-byte block from L2 cache
→ reads 2-bit index for this pixel
→ computes t = index / 3
→ lerps C0 and C1
→ returns float4 to shader
→ one clock cycle, transparent to HLSL

The expensive part, PCA, endpoint search, index assignment, never runs in your game. It happened offline when the artist exported the asset. At runtime the GPU decoder is just a shift, a mask, and a lerp. That's why it runs for free at three billion samples per second.

The cache benefit is the real reason

The obvious win is smaller files. But that's not why this matters at runtime.

A 256x256 raw RGBA texture is 256KB. In BC1 it's 32KB. Eight times more texture fits in the GPU's L2 cache at the same time. At three billion texture samples per second, cache miss rate directly determines bandwidth pressure. BC1 textures aren't just smaller. They're faster. That's why it has been mandatory in every DirectX-compliant GPU since 1999.

Where this connects back to the paper

BC1's entire quality ceiling is set by one assumption: that the 16 pixels in a block cluster along a single straight line in RGB space. When that's true, the four lerp steps approximate the original colors well.

The NTBC paper (AMD, 2024) asks what happens if you remove that assumption. Instead of fitting a line in RGB space, you train a small neural network to find the best possible encoding for each block. A learned latent space that isn't constrained to be linear. The two stored values stop being geometric endpoints and become latent codes. The decoder stops being a lerp and becomes a neural network inference.

The quantization step, rounding each pixel's t to the nearest of four positions, which is the source of all BC1 loss, is exactly what the Straight-Through Estimator in the NTBC paper handles during training. The STE lets gradients flow through that discrete rounding operation so the network can be trained end-to-end. If you understand why the rounding in BC1 is lossy, you understand what the STE is solving for.

That paper and my curiosity sent me down a five-hour rabbit hole. This is what was at the bottom of it.

Your GPU fakes every texture. BC1 compression explained