Perceptual Image Codec: What Matters in Practical Learned Image Compression

ksec 107 points 33 comments May 24, 2026

Discussion Highlights (12 comments)

a-dub

this is interesting. would be cool to explore something like integrating a vlm to add a "semantic" term to the loss function. looking through the comparisons, some of the baseline codecs create meaningfully different details (as could be described by text) in the images.

dahart

Looks very cool assuming all the comparisons are correct & fair and there’s no major failure cases. Quick link to the HTML version of the paper to save you a couple of clicks: https://arxiv.org/html/2605.05148v1 Since this is by Apple, I’m certainly curious if this is aimed at becoming the new default format for Apple devices. What kind of effort does it take to do that, beyond getting the paper published? On the PR summary page, the “speed” column should be labeled “time”. Time is lower-is-better, whereas speed means higher-is-better. The BD rate column could also use a less cryptic label. (Though maybe the audience is paper reviewers and not me.) The paper itself doesn’t even write out what the BD acronym in “BD rate” stands for, but it seems like it would be fair and accurate and better to call the column maybe something like relative compressed size, and mention the exact metric in the caption — where there’s already an explanation of BD rate. I’m somewhat confused by, and slightly skeptical TBH, of the device timings. Are they correct & fair? Why is the NN-only portion almost as fast on an iPhone 17 compared to a V100 when the V100 has 4x the FP throughput? Is it comparing apples to apples (ha!), and is the GPU implementation reasonable? The data suggests the GPU implementation is not saturating the GPU. Also why are there several different GPU models? And why is V100 even used? V100 is four generations old and not even supported anymore.

klodolph

Interesting, but when I look at the sweater in the second image, the knitting just looks completely lost in the PICO vesion. The knitting looks correct but soft in other codecs. In the PICO version, it looks just completely wrong to me. The yarn structure has been replaced with a bunch of fuzzy strips. Similar problem in the third picture. I guess this is what happens when you chase after extremely low data rates but I’m not happy with the results.

kllrnohj

I find it very curious that their new image codec did not really compare itself against other image codecs, but instead primarily video codecs pretending to do images. As in, no JPEG or JPEG-XL. 150ms to decode 12mp is also incredibly slow. That's like PNG territory of slow. A more flagship 50mp image would be... oof.

ksec

Some Notes. According to Chrome Stats from Google 2019 [1], ~80% to 85% of images served are above bpp 1.0 ( Bit per Pixel ). Around ~95% are above bpp 0.5. I doubt this have changed if not gotten worse as images gets larger over the years. There are images such as logo or specific patterns that works a lot better and compressed at low bpp ( below 0.5 ). But those are rare on web, and in the case here, most of the images are photography. Which means at sub 0.3 bpp it is a ridiculously low bitrate even for Web photo. HEIC on iPhone 17 is based on HEVC, H.265 hardware encoder on iPhone 17. VVC / VTM is H.266 Codec, the successor of HEVC / H.265 most people may have heard of one of its encoder called x265. ETM is H.267 Codec, currently the best in class video compression codec that is still in development. I assume everyone knows about AV1 and AV2 already since we are on HN. CPU Encoder, generally speaking produce better image quality while hardware encoder tends to be fast but lower quality. Both HEIC and AV1 are based on hardware encoder on the iPhone 17. ( At least that is my read of it ) In case anyone wondering. JPEG XL is not designed or yet to be optimised for low bpp. It excels at 0.8 bpp onwards depending of type of image. So the result of XL would likely be very bad. Similarly to normal JPEG and H.264 encoder. I am wondering if this image codec is deterministic. I would imagine if we use this on the web the actual image size would be a lot smaller. Meaning a lot of the artifacts we see shouldn't matter. And clicking on it would bring up a different image file. There is finally a possibility in the future some half decent image could be included within 14K frame of the webpage. Encoding and Decoding speed is actually useable on today's hardware. Decoding in sub 100ms on iPhone 17. And on another notes, VVC is doing extremely well in terms of compression rate and encoding / decoding complexity. AV2 launching by the end of this month. [1] Figure 1 https://www.spiedigitallibrary.org/conference-proceedings-of...

brcmthrowaway

What would this be used for?

jcelerier

would be great to have the weights somewhere

warumdarum

I always wondered wether a image wouldnt be best encoded in a sort of spline by color, the intensity beeing the curves and then those splines just overlayed and rendered with a thickness per segment.

Dwedit

I can see by dragging the slider around and comparing to the ground truth that it is making things up that look plausible to what is there. Just keep any numbers out of there. Remember what happened with the Xerox scanners and JBIG2 compression where numbers got substituted with similar looking numbers.

xyzsparetimexyz

So the decoding step is running the latent values through a model in order to decode it right? The artifacts are super offputting and are just going to get this labelled as slop. I would instead follow a neural texture compression model (e.g. RTXNTC) and generate a small model & latent values when compressing an image. It'll take longer, sure, but as long as it can work in less than 10 seconds on an iPhone I think theres a use case. As it is, idk

DiogenesKynikos

The PICO images do look more realistic at first glance, but when you zoom in, it turns out that PICO has changed all sorts of small details in the images. For example, if you zoom in on the trees in the first image, PICO has moved branches around and invented new branches that didn't exist. It does so in a way that looks very realistic, but it is still altering reality. The JPEG failure mode is much different: it causes ringing artifacts that are obviously artificial, but it doesn't move a branch from location A to location B.

srean

There are a few comments on the unnatural/hallucinated look of the compressed images. I don't like it, but this is sort of the expected behavior for 'AI' based compression and denoising. Think of it as encoding followed by decoding where encoding reduces an image to a 'prompt' and decoding a generation of a new image from that prompt. These techniques have become quite popular in astrophotography and it makes me uncomfortable. Note though that this strategy is not that fundamentally different from the classic method. There the steps are called analysis followed by synthesis . Analysis step detects the degree of presence of preferred basis functions (for example, discrete cosine basis), the undesirable components discarded or down-weighted. Then in the synthesis phase, the bases weighted with their detected or accentuated strengths are recombined to obtain the image. What's different this time is the nature of the bases and the Bayesian prior they encode.

Perceptual Image Codec: What Matters in Practical Learned Image Compression

Discussion Highlights (12 comments)

Related Discussions