I am exited to read about Google's efforts on the WebP format.
If we are going to work on a new image format, why not make one that loads progressively? Think back to the interlacing options of GIF, but go further than that. Much further.
If the image information was ordered in the file in such a way that a very basic fuzzy image (with limited colour space) could be displayed very quickly, then the image got better over time as more data was downloaded, this would assist everyone, not just mobile devices!
I'm not suggesting we repeat the "thumbnail" image efforts of other formats, where essentially the file contains two images, a tiny one, followed by a larger one. No, I mean a truly progressive file format where the longer you stay connected, the more detailed the image became in both resolution and colour accuracy. A smart device could sever the connection as soon as it determines that any further information would be superfluous on its display. For example, lets say you create a very detailed background image in 1920x1200 format for use on iMacs, and you have specified in CSS that this image must scale to fit the entire HTML window, and a visitor on an iPhone comes along... As soon as that device has downloaded enough information to render this background image in 480x320, it could sever the link to the server and save oodles of bandwidth!
Conceptually it might work like this. (I read this somewhere a long time ago, so if it is already patented by some greedy corporation, then I'm sorry but we will have to come up with something better).
First send a very small header which tells the renderer the total pixel dimensions of the file, and the total number of channels. If this is extensible so that arbitrary meta data can also be encoded, then so much the better, but encoders must give the user (content provider) the choice as to where in the file this meta data is, defaulting to the end so that mobile devices won't have to download them first. I also advocate including a fingerprint (MD5 hash of the entire file) in the header so that a device can determine if it has the full content or only partial content in cache. A simple file size or checksum is not enough, since we also want to ensure the file's integrity, including the metadata at the end. This has more to do with rights management... but I digress.
After the header we need to send the "average" colour for the entire image. For greyscale images this needs to be no more than 8 bits, and for colour images no more than 24 bits. If there is another channel (such as alpha) this needs to add no more than another 8 bits to store the "average" value for that channel.
Next we divide the image into four quarters and work out the average colour (and other channel) values for each quarter. We now need to send the difference between each quarter's average value and the whole image's average values. Typically these differences will be small and we can use some compression technique (perhaps Huffman, or LZW) to reduce the number of bits needed to encode these differences.
Next we divide each of the areas from the previous step into four quarters and work out the differences in average values for each quarter to the average value of the area in which they form part. Encode these differences to reduce the number of bits sent.
We keep on repeating this process until we are sending only the differences between individual pixels' values and the average of the four pixels in the local 4x4 cell which was sent in the previous step.
The trick to making this work well is going to be in choosing the right compression algorithm. We may even need a different algorithm in the early part of the file, and another one in the later parts as the pattern of differences change. Or perhaps a different algorithm for different channels (but this could be difficult to mix into one bitstream).
Some additional channels (such as refraction, reflections and bump maps) should perhaps not be mixed into the stream of colour and alpha. So the header format must be flexible enough to allow the encoder to specify which channels are mixed into which streams. This will allow simple/mobile devices to receive the information that they can work with quickly, and defer the optional extra information for more capable devices to follow later in the stream.
Comments and suggestions for bit encoding are welcome!