- Thoughts
How MPEG video compression works
I've been playing with some MPEG1,2 video play source code (developed by the MPEG Software Simulation Group) with the view to incorporate video playback into a game I'm writing. In doing this, I've learnt a bit about how MPEG works, and thought it may be interesting reading for anyone who doesn't know anything about it.
Since this is all gleamed from the code, there may be technical inaccuracies in it, but the gist of it should be fine!
Frame types
Each frame of an MPEG video is encoded as one of three types, commonly called "I", "P" and "B" frames.
- "I" frames are index frames (the "I" actually stands for "intra"), and they are the whole image without any reference to earlier frames.
- "P" frames are frames predicted from the previous "I" frame, and are effectively the differences between the frames.
- "B" frames are also difference files - however, they use the previous "I" or "P" frames as well as the next one as a basis for the differences.
When you encode an MPEG sequence, you normally use a sequence such as "IBBPBBPBBPBB". This sequence is repeated throughout the MPEG video file. What it means is that the first frame is an "I" frame, followed by two "B" frames, then a "P" frame, then two "B" frames and so forth.
Colour format
MPEG data is encoded not as RGB data, but YUV (kindof - experts will say it's slightly different).
- The Y is the brightness of the pixel - the human eye is most sensitive to this, so it is normally encoded at a higher quality than the other two.
- The U and V values are colour differences. Since the eye is less sensitive to this, they're normally encoded at lower quality.
Blocks
The image is split into blocks of 8x8 pixels in size, and each blcok in an I frame is encoded using the same sort of technique as JPEG. If a lower quality UV is used, then these are either 8x16 or 16x16 pixels in size.
The technique JPEG uses is a Discrete Cosine Tranformation, where the blocks are transformed into frequencies (like an Fast Fourier Transform or FFT). Ordinarily, wavelets will not save any more space compared to the original data block. However, if you remove frequencies that have low amplitude, you can save the amount of data being used without impacting visual quality by much. Audio schemes such as MP3 also do the same kind of thing, albeit in the audio spectrum. To decompress, you use an inverse DCT algorithm.
For P and B frames, blocks can be moved, copied and/or added (or subtracted) to produce movement.
Optimisation for ARM processors
Initially, I used a straight forward recompilation of the MSSG code, although I changed the output so it displayed on the screen rather than saving the frames on the disk.
This is quite slow due to a number of factors:
- The ARM processor doesn't handle short integers very well (later ones have an instruction to do this, but not on the target processor).
- The ARM processor doesn't have floating point capability (the MSSG allows you to use reference inverse DCT if you want - very very slow!).
- The ARM's cache isn't as big as more modern processors, so it's important how you reference data - if you access memory "horizontally" rather than "vertically", it's quicker.
- The glue that C code has uses some registers that could be (temporarily) used for other purposes (such as the frame register).
How the code runs
The code is split into two areas - the handling of the data stream, and creating the YUV data, and then the rendering. I label these "C" for "Conversion", and "R" for rendering.
Optimsation 1 - changing short integers to normal integers
This seemed to make it run a bit faster - unfortunately, I hadn't added code to tell me how fast it was operating before this. Here, the C time was 10cs, and R time 14cs for the test data I was using (which was a 352x288 MPEG-1 file). This equated to 4.2 frames per second.
Optimisation 2 - convert horizontal row YUV to RGB
Since the R time was longer than the C time, I decided this would be my first area to look at - the aim was to get it so that C+R was 10cs. The code that took a row of data from the YUV stream and convert to RGB was a good step. However, it didn't quite have the impact I was hoping - mainly due to the simplicity of the algorithm. R time was now 13cs - 4.3FPS.
Optimisation 3 - disable TRACE and VERIFY
I'm not sure if this had an impact, but the text in the MakeFile mentioned it would. However, C time was still 10, and R time was still 13.
Optimisation 4 - optimisation of 420 to 422 chroma format
One of the things I mentioned earlier was that the UV colour components can be encoded at a lower quality. There are three ways to do this.
- "444" encoding means that the UV components are at the same quality as the Y component.
- "422" encoding means that the UV components are squashed in the Y direction, so each UV pixel refers to two Y-colour pixels.
- "420" encoding means that the UV components are squashed in both the X and Y direction, so each UV pixel refers to four Y-colour pixels.
The MSSG code converts the 420 encoding into 422 encoding, and then the 422 encoding to 444, with the renderer requiring 444 encoding. The way this is done is via an FIR filter.
FIR (Finite Impuse Response) filters I've used before in !Sonor, so I knew what they did. The way this one was written was not particularly well suited to ARM processors (and probably not well suited to many others either). The first optimisation was to recode this so it would first of all use the cache more efficiently, use pointers more effectively, and also be easy to convert to ARM code.
After this, R time was now 12cs - 4.5FPS.
Optimisation 4b - conversion of 420 to 422 into ARM code
The FIR filter was then recoded in ARM. It didn't appear to make a huge difference - R time still 12cs.
Optimisation 5 - optimisation of 422 to 444
This FIR filter was terrible when it came to the way it accessed memory - it was very vertically organised, which meant the cache's usefulness is limited. However, the change did not have much of an impact - R time was still 12cs (although it was now a low 12).
Optimisation 5b - conversion of 422 to 444 into ARM code
After some quite heavy conversion of the ARM code, this really started to make a difference - R time was now 10cs. The algorithm was converted to be vertical, but since it wasn't thrashing around the memory map, the algorithm was more friendly to the cache.
Optimisation 6 - removal of FIR filter
I was thinking about the FIR filter - the co-efficients of the filter algorithm (look at the references for how FIR filters work) showed that the current value was quite significantly the largest factor. This meant that if you removed the FIR filter, and just doubled the pixels, it won't look that different. And, no, it didn't look that much more different.
R time was still 10, but a very low 10.
Optimisation 7 - quad-pixel copying (420)
The removal of the FIR filter was simply a "does it work" test - now it was to optimise the removal by copying 4 pixels at a time instead of just 1. R time became 8cs (5.5FPS).
Optimisation 7b - quad-word pixel copying (420)
Instead of 4 pixels, 16 pixels were copied at once. This didn't have much of an impact (R time still 8cs).
Optimisation 8 - quad-pixel copying (422)
The same optimisation was done in the 422->444 conversion as above. This made the R time 6cs - 6.3FPS. Now, it's about 50% faster than it was before hand - although render time had been cut by over half.
Optimisation 9 - Inverse DCT (row) conversion to ARM code
This was the first time I'd tackled the C time. Inverse DCT is performed on the row and the column. Since the row is horizontally aligned in memory, it was a fairly easy thing to convert. However, this didn't alter the C time by any significant amount.
Optimisation 10 - Inverse DCT (column) conversion to ARM code
This is essentially the same kind of algorithm as the IDCT row code, but with the memory vertically aligned. This, too, had no significant impact.
Optimisation 11 - Move and add blocks conversion into ARM code
As I mentioned earlier, MPEG can move and merge blocks for in-between frames. By rewriting this into ARM code, C time was now 9cs (and with R at 6, gives 6.7FPS).
Optimisation 12 - Direct chroma access
This was quite a big change to the rendering code - rather than converting 420 to 422, and then 422 to 444, I modified the code in optimisation 1 so that it would now directly access the 420 or 422 data. This actually removed the requirement for the code from optimisation 2 through to optimisation 8.
This was quite a big change - R time 4cs, giving an overall time of 13cs and 7.7FPS. Not quite my aim of 10FPS, but definately getting there.
Unfortunately, I think I've optimised all the short-loop organised code, so it may mean I can't make it much faster than that. I know that if I change my source data to 320x240, C time is 5cs, and R time is around 4cs (may be less), so I'll probably do some video at 320x180 (wide screen). I may also find that if I change the compiler from GCC to Norcroft, it'll be a bit faster.