Dave Hooper | 16 Aug 2012 17:00

Re: Opus codec developments

What's the performance of the current implementation like (compared to other codecs on rockbox today)?

(Apologies for the double-post if you received this message once already - from my end it looks like it was blocked by the list server when I first sent this a couple weeks ago)

Jonas Wielicki | 31 Dec 2012 13:54
Gravatar

Re: Opus codec developments

This goes to the kind people who integrated opus into rockbox. Please
see my questions below, this isn't a rant.

On 16.08.2012 17:00, Dave Hooper wrote:
> What's the performance of the current implementation like (compared to
> other codecs on rockbox today)?

I tested opus on my iriver H320 yesterday. To my disappointment, the
performance was horrible. The UI was lagging (didn't even know that this
was possible in rockbox -- thats what I call true multitasking :) ),
despite the CPU being overclocked all the time to ensure fluent playback.

Well yeah. I just started to dig into the opus code, as I really feel it
should be possible to make this perform better. With vorbis, I get
fluent playback without any hassles in the UI and no CPU overclocking at
all (except while reading from the non-DMA HDD).

I am totally unfamiliar on how opus works (reading the RFC right now),
also I have no knowledge about the rockbox codec framework. Can you give
me some pointers on where to start if I were to optimize this? Are there
any known bottlenecks inside or at the interface from rockbox to opus?
Or do I have to look for opportunities in the libopus code? Is there any
hope at all for such an ancient target as the H320? Is there any
extensive documentation on how to integrate new codecs into rockbox
(which would give me some pointers on how codecs and rockbox interact,
which will be valuable)? If I have  to modify the libopus code, how
would I make sure that this (a) won't break on an upstream update and
(b) might go back to upstream in case it's valuable for them too?

cheers and happy new year,
Jonas W.

Magnus Holmgren | 31 Dec 2012 14:37
Picon

Re: Opus codec developments

On 2012-12-31 13:54, Jonas Wielicki wrote:

> Well yeah. I just started to dig into the opus code, as I really feel it
> should be possible to make this perform better. With vorbis, I get
> fluent playback without any hassles in the UI and no CPU overclocking at
> all (except while reading from the non-DMA HDD).

As I recall it, Vorbis wasn't realtime in the early versions, so quite a bit of 
optimization has been done since then.

> I am totally unfamiliar on how opus works (reading the RFC right now),
> also I have no knowledge about the rockbox codec framework. Can you give
> me some pointers on where to start if I were to optimize this? Are there
> any known bottlenecks inside or at the interface from rockbox to opus?
> Or do I have to look for opportunities in the libopus code? Is there any
> hope at all for such an ancient target as the H320? Is there any
> extensive documentation on how to integrate new codecs into rockbox
> (which would give me some pointers on how codecs and rockbox interact,
> which will be valuable)? If I have  to modify the libopus code, how
> would I make sure that this (a) won't break on an upstream update and
> (b) might go back to upstream in case it's valuable for them too?

Based on my experience with the H140, I'd say start looking at what data is put 
in IRAM. The external memory bus is pretty slow in comparison to IRAM, so 
putting "hot" buffers and tables in IRAM can make a big difference. The next 
step would probably be some assembler code, e.g. to exploit the 
multiply-and-accumulate unit. If you can get the profiling code working, that 
could help locating hotspots.

Note that I haven't looked at the Opus code, so I don't know how much of this 
has been done.

--

-- 
   Magnus

Jonas Wielicki | 31 Dec 2012 17:15
Gravatar

Re: Opus codec developments

On 31.12.2012 14:37, Magnus Holmgren wrote:
> Based on my experience with the H140, I'd say start looking at what data
> is put in IRAM. The external memory bus is pretty slow in comparison to
> IRAM, so putting "hot" buffers and tables in IRAM can make a big
> difference. The next step would probably be some assembler code, e.g. to
> exploit the multiply-and-accumulate unit. If you can get the profiling
> code working, that could help locating hotspots.

Okay. A quick grep for iram_ showed that this is not used at all yet. It
seems though that the EMAC is already used in the celt-code. One
function which could benefit from using the EMAC cannot be made to use
it due to GCC issues appearantly:

static inline int32_t MAC16_16(int32_t c, int16_t a, int16_t b) {
    register int32_t cp asm("acc0") = c;
    int32_t r;
    asm volatile ("mac.w %[a], %[b], %[c];"
                  "movclr.l %[c], %[r];"
                  : [r] "=r" (r)
                  : [a] "r" (a), [b] "r" (b), [c] "r" (cp)
                  : "cc");
    return r;
}

This won't compile, getting:

~/Builds/rockbox/lib/rbcodec/codecs/libopus/celt/fixed_generic.h:164:22:
error: invalid register name for ‘cp’

(I am compiling for iriver target) Note that the accumulator register
for m[s]ac.x must be an %accX register, according to the processors user
guide…

I also was unable to find anything related to the accX-registers for
Coldfire in the GCC manuals (checked[1]) after reading up on constraints
at [2]. More annoying in this context is, that the SILK-part of opus
also requires a really similar function, which thusly cannot be made
into EMAC instructions if this issue persists (and is not rooted in my
ignorance).

I am not sure at all at which places iram_ makes sense. I may have to
dig into profiling, although I'm wondering how profiling function calls
will help to trace memory use…

regards,
Jonas

   [1]:
http://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
   [2]: http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Extended-Asm

Mike Giacomelli | 1 Jan 2013 08:52
Picon
Favicon

RE: Opus codec developments

> Okay. A quick grep for iram_ showed that this is not used at all yet.


The actual defines are ICODE, ICONST, IBSS, and IDATA.


Maybe someone else can help you with your ASM problems (I don't know coldfire).
Jonas Wielicki | 1 Jan 2013 11:42
Gravatar

Re: Opus codec developments

On 01.01.2013 08:52, Mike Giacomelli wrote:
> 
>> Okay. A quick grep for iram_ showed that this is not used at all yet.
> 
> The actual defines are ICODE, ICONST, IBSS, and IDATA.
I see -- I more thought about dynamic allocations as they are used in
the vorbis codec, where iram_alloc is used if I recall correctly.

Nils Wallménius | 2 Jan 2013 10:19
Picon

Re: Opus codec developments

On Tue, Jan 1, 2013 at 11:42 AM, Jonas Wielicki
<j.wielicki <at> sotecware.net> wrote:
> On 01.01.2013 08:52, Mike Giacomelli wrote:
>>
>>> Okay. A quick grep for iram_ showed that this is not used at all yet.
>>
>> The actual defines are ICODE, ICONST, IBSS, and IDATA.
> I see -- I more thought about dynamic allocations as they are used in
> the vorbis codec, where iram_alloc is used if I recall correctly.
>

HI, i've done some of the optimization work on the opus codec in
rockbox. Iram is used for most of the hot buffers in the celt part of
the codec afaik. The way we did it was by profiling and looking at the
hot functions to see what buffers they were using and then of course
benchmarking. celt is slower than silk so we have been focusing on
that first. The part taking the most cycles in celt is the fft/imdct
at the moment. I have been contributing some patches with
optimizations in this area upstream and hope that soon we will be able
to merge in the upstream codec to rockbox. It also has some other
optimizations and other improvements that might help us.

Gcc doesn't handle the coldfire EMAC at all so you can only use the
special regs in asm, not bind them to c vars. Take a look at how this
is done in other functions, there are lots of them, an example is
MULT16_32_Q15 in the fixed_generic.h file you were editing.

Nils

Jonas Wielicki | 2 Jan 2013 10:37
Gravatar

Re: Opus codec developments

Hi Nils,

On 02.01.2013 10:19, Nils Wallménius wrote:
> I have been contributing some patches with
> optimizations in this area upstream and hope that soon we will be able
> to merge in the upstream codec to rockbox. It also has some other
> optimizations and other improvements that might help us.
That sounds truely great!

> Gcc doesn't handle the coldfire EMAC at all so you can only use the
> special regs in asm, not bind them to c vars. Take a look at how this
> is done in other functions, there are lots of them, an example is
> MULT16_32_Q15 in the fixed_generic.h file you were editing.
I've seen that function. The problem with the
Multiply-And-Accumulate-Function (MAC16_16 it's called iirc), is that it
takes three arguments, the accumulation variable c and the two
multiplication operands a and b. The syntax for the ASM mnemonic is (±
characters):

    mac.l %dX, %dY, %RaccX

%RaccX is required to be an accumulation register. So the only way to
handle this is to move the variable c into the %RaccX, run the
mac-instruction, and copy it out again. This sounds horribily
inefficient and I doubt that there's still any benefit over just adding
and multiplying (I wish I had the PDF with the instruction timings at hand).

-- Jonas

Nils Wallménius | 2 Jan 2013 11:43
Picon

Re: Opus codec developments

Hi again

>> Gcc doesn't handle the coldfire EMAC at all so you can only use the
>> special regs in asm, not bind them to c vars. Take a look at how this
>> is done in other functions, there are lots of them, an example is
>> MULT16_32_Q15 in the fixed_generic.h file you were editing.
> I've seen that function. The problem with the
> Multiply-And-Accumulate-Function (MAC16_16 it's called iirc), is that it
> takes three arguments, the accumulation variable c and the two
> multiplication operands a and b. The syntax for the ASM mnemonic is (±
> characters):
>
>     mac.l %dX, %dY, %RaccX
>
> %RaccX is required to be an accumulation register. So the only way to
> handle this is to move the variable c into the %RaccX, run the
> mac-instruction, and copy it out again. This sounds horribily
> inefficient and I doubt that there's still any benefit over just adding
> and multiplying (I wish I had the PDF with the instruction timings at hand).

Yes, this is how the EMAC is used, the accumulator registers can only
be used for accumulation so the result must be copied into an a or d
reg before it can be used for anything else, which is also what
MULT16_32_Q15 does with movclr. This is still much faster than the c
code in this case since it uses a series of 16*16 multipications and
likely better than the same coded with regular mul.l, this might of
course not be true for the 16*16 multiplication you are looking at,
the only way to know is to bench it.

Sections 3.7 and 3.8 in [1] have the instruction timings worth noting
is that there's a stall when using movclr after mac on the same
accumulator

[1] cache.freescale.com/files/32bit/doc/ref_manual/MCF5249UM.pdf

Nils

Mike Giacomelli | 3 Jan 2013 02:15
Picon
Favicon

Re: Opus codec developments

>The part taking the most cycles in celt is the fft/imdct at the moment. I have been contributing some patches with optimizations in this area upstream and hope that soon we will be able to merge in the upstream codec to rockbox. It also has some other optimizations and other improvements that might help us.


For what its worth, I've been working on this.  This link:  http://pastie.org/4908106

indicates that 71% of the total runtime on PP is spent in the MDCT.  This is insanely high and probably indicates that gcc is screwing up.  Presumably it does not do much better on ARM9/11.

My first efforts are here:


Since then I've been working on writing the FFT butterlfy functions in ARMv5 assembly.  It should be possible to get a large speed up by exploiting our prior knowledge about the function arguments (in practice, only the 480 point FFT significantly impacts performance thus we know the stride and loop counts in advance) and by using 16 bit packed instructions to implement the butterflies rather then the generic 32 bit arm ones.  Together these should give a very large speed up I think.  

Mike

Gmane