Favicon

GSoC - Theora multithread decoder

Hi all,

I apologize to not keep you up to date to what is going on with my project. Portavales has worked in a desk behind me and when we go to take coffee we talk about the project. Second I didn't know we have to discuss weekly, it was my fault. I should have read the rules. Sorry.

At the first month, I studied the code and the Theora Beta implementation. The code is completely different from Alpha and I have to be familiarized with the code.
After that I started doing tests with OpenMP.

One first test was 40% faster, but unfortunately it did not decode the frame correctly, three quarters was green.

I have one implementation decoding the Y, Cb and Cr planes in parallel. The OpenMP implementation was about 5% faster. Not worthless, since it does not require any great modifications.

I looked at Ralph's implementation and merged it to the current. The speed up was about 10% but the code have to be modified in many places.

Extract parallelism from the current implementation is very difficult. Coarse grain functions are the best functions to be parallelize to become the overhead worthwhile, but the current implementation has one, at most two. The parts that I suggested in my initial plan are fine grain functions, they spend a lot of cpu time but they are called too many times. The time spent to create and synchronize threads is greater than the speed up gains. We need functions that are called a few times and spend many cpu time. Also data dependency should be the lowest as possible.

According to the model that i did (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/implementation.pdf) the decoding time should be reduced in 33%, but it was just 10% for pthread an 5% for openMP.

I used a video with 1440x1080. The pthread implementation has 3 threads and the OpenMP was executed with the environment variable OMP_NUM_THREADS=3. The results are:

                   Real(s)         User(s)             System(s)             Speed up(%)
OpenMP      25.2             29.2                  1.8                       4
PThread       23.8             28.3     &n bsp;            1.0                       10
Current        26.2             26.0                   0.3                       0

I used an Intel(R) Core(TM)2 Quad CPU with 2.4GHz and RAM of 4GB. The video has 85 seconds.
These two implementations decode the Y, Cb and Cr planes in parallel, that is why I am using OMP_NUM_THREADS=3 and the upper bound gain is 33%, that is, let To be the time spent in decoding a video with the current implementation. Let T1 be a video decoded with the parallel implementation. T1 should be at most 0.66To.


I will use the pthread implementation to try a pipelined version and see if we obtain more gains.
These version will run the functions (c_dec_dc_unpredict_mcu_plane + oc_dec_frags_recon_mcu_plane) and
(oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in parallel. The upper bound for the gain is 60%, that is, let T2 be a video decoded with the pipelined implementation. T2 should be at most 0.4To.

Here is the branch for the OpenMP implementation: http://svn.xiph.org/branches/theora_multithread_decode_omp/
Here is the branch for the PThread implementation: http://svn.xiph.org/branches/theora_multithread_decode_pthread/





Again, sorry about the long time without any feedback.

--
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
http://lampiao.lsc.ic.unicamp.br/~piga
_______________________________________________
theora-dev mailing list
theora-dev <at> xiph.org
http://lists.xiph.org/mailman/listinfo/theora-dev
Favicon

Re: GSoC - Theora multithread decoder

This week I will work with the pipeline and by the end of this week I will send a report.


On Sun, Jul 6, 2008 at 9:39 PM, Leonardo de Paula Rosa Piga <lpiga <at> terra.com.br> wrote:
Hi all,

I apologize to not keep you up to date to what is going on with my project. Portavales has worked in a desk behind me and when we go to take coffee we talk about the project. Second I didn't know we have to discuss weekly, it was my fault. I should have read the rules. Sorry.

At the first month, I studied the code and the Theora Beta implementation. The code is completely different from Alpha and I have to be familiarized with the code.
After that I started doing tests with OpenMP.

One first test was 40% faster, but unfortunately it did not decode the frame correctly, three quarters was green.

I have one implementation decoding the Y, Cb and Cr planes in parallel. The OpenMP implementation was about 5% faster. Not worthless, since it does not require any great modifications.

I looked at Ralph's implementation and merged it to the current. The speed up was about 10% but the code have to be modified in many places.

Extract parallelism from the current implementation is very difficult. Coarse grain functions are the best functions to be parallelize to become the overhead worthwhile, but the current implementation has one, at most two. The parts that I suggested in my initial plan are fine grain functions, they spend a lot of cpu time but they are called too many times. The time spent to create and synchronize threads is greater than the speed up gains. We need functions that are called a few times and spend many cpu time. Also data dependency should be the lowest as possible.

According to the model that i did (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/implementation.pdf) the decoding time should be reduced in 33%, but it was just 10% for pthread an 5% for openMP.

I used a video with 1440x1080. The pthread implementation has 3 threads and the OpenMP was executed with the environment variable OMP_NUM_THREADS=3. The results are:

                   Real(s)         User(s)             System(s)             Speed up(%)
OpenMP      25.2             29.2                  1.8                       4
PThread       23.8             28.3     &n bsp;            1.0                       10
Current        26.2             26.0                   0.3                       0

I used an Intel(R) Core(TM)2 Quad CPU with 2.4GHz and RAM of 4GB. The video has 85 seconds.
These two implementations decode the Y, Cb and Cr planes in parallel, that is why I am using OMP_NUM_THREADS=3 and the upper bound gain is 33%, that is, let To be the time spent in decoding a video with the current implementation. Let T1 be a video decoded with the parallel implementation. T1 should be at most 0.66To.


I will use the pthread implementation to try a pipelined version and see if we obtain more gains.
These version will run the functions (c_dec_dc_unpredict_mcu_plane + oc_dec_frags_recon_mcu_plane) and
(oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in parallel. The upper bound for the gain is 60%, that is, let T2 be a video decoded with the pipelined implementation. T2 should be at most 0.4To.

Here is the branch for the OpenMP implementation: http://svn.xiph.org/branches/theora_multithread_decode_omp/
Here is the branch for the PThread implementation: http://svn.xiph.org/branches/theora_multithread_decode_pthread/





Again, sorry about the long time without any feedback.

--
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
http://lampiao.lsc.ic.unicamp.br/~piga



--
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
http://lampiao.lsc.ic.unicamp.br/~piga
_______________________________________________
theora-dev mailing list
theora-dev <at> xiph.org
http://lists.xiph.org/mailman/listinfo/theora-dev
Favicon

Re: GSoC - Theora multithread decoder

Leonardo de Paula Rosa Piga wrote:
> Coarse grain functions are the best functions to be parallelize to
> become the overhead worthwhile, but the current implementation has one,
> at most two. The parts that I suggested in my initial plan are fine

The reason the current decoder does this is cache coherency. The idea is
that only a few (16 to 32) rows need to be kept in L1/L2 cache between
each stage of the pipeline, which is a big reason the current decoder is
as fast as it is on high resolution content.

It's possible break this pipeline back up into separate stages that
operate on the entire frame at once (e.g., just make the MCU size the
height of the frame). You lose cache coherency, but get coarse-grained
parallelism. Only testing will determine which is the better strategy.

> the decoding time should be reduced in 33%, but it was just 10% for
> pthread an 5% for openMP.

The chroma coefficients are usually quantized much more coarsely, so
they very likely don't account for a full 33% of the decode time even on
a uniprocessor. Fewer coded blocks and fewer tokens to unpack in the
blocks that are coded means fewer and smaller iDCTs, fewer invocations
of the loop filter, etc.

It's sad that OpenMP didn't do better... I was hoping with the option
available to them to do platform-specific tricks, they could cut down on
the overhead of pthreads, but I guess that stuff's just not "there" yet.

> These version will run the functions (c_dec_dc_unpredict_mcu_plane +
> oc_dec_frags_recon_mcu_plane) and
> (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in
> parallel. The upper bound for the gain is 60%, that is, let T2 be a
> video decoded with the pipelined implementation. T2 should be at most 0.4To.

I think you mean "at least". Let us know what your test results look
like (good or bad)! Keep in mind that, if possible, the same thread that
does oc_dec_dc_unpredict_mcu_plane+oc_dec_frags_recon_mcu_plane on a set
of blocks should also be the one to do
oc_state_loop_filter_frag_rows+oc_state_borders_fill_rows on the same
set of blocks (and hopefully the scheduler doesn't muck things up by
moving the thread to a different physical CPU inbetween).
Favicon

Re: GSoC - Theora multithread decoder

Hi Timothy, below some new and good results.

On Mon, Jul 7, 2008 at 1:52 AM, Timothy B. Terriberry <tterribe <at> email.unc.edu> wrote:
Leonardo de Paula Rosa Piga wrote:
> Coarse grain functions are the best functions to be parallelize to
> become the overhead worthwhile, but the current implementation has one,
> at most two. The parts that I suggested in my initial plan are fine

The reason the current decoder does this is cache coherency. The idea is
that only a few (16 to 32) rows need to be kept in L1/L2 cache between
each stage of the pipeline, which is a big reason the current decoder is
as fast as it is on high resolution content.

It's possible break this pipeline back up into separate stages that
operate on the entire frame at once (e.g., just make the MCU size the
height of the frame). You lose cache coherency, but get coarse-grained
parallelism. Only testing will determine which is the better strategy.
You are right! You gave me a great tip. I did some tests for different MCU  size. The MCU size for the current implementation is 8.
For MCU size >= 16, PThread and OpenMP implementations produce the same results, that is, a speedup 13% on average. The time spend to thread communication was reduced.

I plotted three graphs to show these facts
One for Real Time vs MCU size. This graph shows that for MCU size >= 16 PThread and OpenMP implementations are equivalents.
(http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png)

The second graph compares the speedup and prove that for coarse grain functions we can achieve better results.
(http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png)

And to conclude the third graph. It was plotted the system time vs MCU size. For greater MCU size, lower is the system time. Because the thread communication overhead was reduced.



> the decoding time should be reduced in 33%, but it was just 10% for
> pthread an 5% for openMP.

The chroma coefficients are usually quantized much more coarsely, so
they very likely don't account for a full 33% of the decode time even on
a uniprocessor. Fewer coded blocks and fewer tokens to unpack in the
blocks that are coded means fewer and smaller iDCTs, fewer invocations
of the loop filter, etc.

It's sad that OpenMP didn't do better... I was hoping with the option
available to them to do platform-specific tricks, they could cut down on
the overhead of pthreads, but I guess that stuff's just not "there" yet.
The results above show that it is not the case. For coarse grain jobs they are equivalent
 


> These version will run the functions (c_dec_dc_unpredict_mcu_plane +
> oc_dec_frags_recon_mcu_plane) and
> (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in
> parallel. The upper bound for the gain is 60%, that is, let T2 be a
> video decoded with the pipelined implementation. T2 should be at most 0.4To.

I think you mean "at least". Let us know what your test results look
like (good or bad)! Keep in mind that, if possible, the same thread that
does oc_dec_dc_unpredict_mcu_plane+oc_dec_frags_recon_mcu_plane on a set
of blocks should also be the one to do
oc_state_loop_filter_frag_rows+oc_state_borders_fill_rows on the same
set of blocks (and hopefully the scheduler doesn't muck things up by
moving the thread to a different physical CPU inbetween).

_______________________________________________
theora-dev mailing list
theora-dev <at> xiph.org
http://lists.xiph.org/mailman/listinfo/theora-dev




--
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
http://lampiao.lsc.ic.unicamp.br/~piga
_______________________________________________
theora-dev mailing list
theora-dev <at> xiph.org
http://lists.xiph.org/mailman/listinfo/theora-dev
Favicon

Re: GSoC - Theora multithread decoder

I forgot to send the link for the last graph
(http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/systime.png)

On Mon, Jul 14, 2008 at 1:03 AM, Leonardo de Paula Rosa Piga <lpiga <at> terra.com.br> wrote:
Hi Timothy, below some new and good results.

On Mon, Jul 7, 2008 at 1:52 AM, Timothy B. Terriberry <tterribe <at> email.unc.edu> wrote:
Leonardo de Paula Rosa Piga wrote:
> Coarse grain functions are the best functions to be parallelize to
> become the overhead worthwhile, but the current implementation has one,
> at most two. The parts that I suggested in my initial plan are fine

The reason the current decoder does this is cache coherency. The idea is
that only a few (16 to 32) rows need to be kept in L1/L2 cache between
each stage of the pipeline, which is a big reason the current decoder is
as fast as it is on high resolution content.

It's possible break this pipeline back up into separate stages that
operate on the entire frame at once (e.g., just make the MCU size the
height of the frame). You lose cache coherency, but get coarse-grained
parallelism. Only testing will determine which is the better strategy.
You are right! You gave me a great tip. I did some tests for different MCU  size. The MCU size for the current implementation is 8.
For MCU size >= 16, PThread and OpenMP implementations produce the same results, that is, a speedup 13% on average. The time spend to thread communication was reduced.

I plotted three graphs to show these facts
One for Real Time vs MCU size. This graph shows that for MCU size >= 16 PThread and OpenMP implementations are equivalents.
(http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png)

The second graph compares the speedup and prove that for coarse grain functions we can achieve better results.
(http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png)

And to conclude the third graph. It was plotted the system time vs MCU size. For greater MCU size, lower is the system time. Because the thread communication overhead was reduced.



> the decoding time should be reduced in 33%, but it was just 10% for
> pthread an 5% for openMP.

The chroma coefficients are usually quantized much more coarsely, so
they very likely don't account for a full 33% of the decode time even on
a uniprocessor. Fewer coded blocks and fewer tokens to unpack in the
blocks that are coded means fewer and smaller iDCTs, fewer invocations
of the loop filter, etc.

It's sad that OpenMP didn't do better... I was hoping with the option
available to them to do platform-specific tricks, they could cut down on
the overhead of pthreads, but I guess that stuff's just not "there" yet.
The results above show that it is not the case. For coarse grain jobs they are equivalent
 


> These version will run the functions (c_dec_dc_unpredict_mcu_plane +
> oc_dec_frags_recon_mcu_plane) and
> (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in
> parallel. The upper bound for the gain is 60%, that is, let T2 be a
> video decoded with the pipelined implementation. T2 should be at most 0.4To.

I think you mean "at least". Let us know what your test results look
like (good or bad)! Keep in mind that, if possible, the same thread that
does oc_dec_dc_unpredict_mcu_plane+oc_dec_frags_recon_mcu_plane on a set
of blocks should also be the one to do
oc_state_loop_filter_frag_rows+oc_state_borders_fill_rows on the same
set of blocks (and hopefully the scheduler doesn't muck things up by
moving the thread to a different physical CPU inbetween).

_______________________________________________
theora-dev mailing list
theora-dev <at> xiph.org
http://lists.xiph.org/mailman/listinfo/theora-dev




--
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
http://lampiao.lsc.ic.unicamp.br/~piga



--
Leonardo de Paula Rosa Piga
Undergraduate Computer Engineering Student
LSC - IC - UNICAMP
http://lampiao.lsc.ic.unicamp.br/~piga
_______________________________________________
theora-dev mailing list
theora-dev <at> xiph.org
http://lists.xiph.org/mailman/listinfo/theora-dev

Re: GSoC - Theora multithread decoder

That's a good result. Congratulations Leonardo!

From my point of view, the first two charts
http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png
http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png
Shows that as the MCU increases, the OpenMP extra overhead is
amortized and OpenMP becomes as fast as the pthreads implementation.

The last chart
http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/systime.png

Shows that both pthreads and OpenMP overhead decreases as what seems
to be a logarithmic function of the MCU size.

This was a great experiment, and from what I can conclude, the OpenMP
implementation can be as good as the pthread.

Therefore maybe it is worth to work on a OpenMP implementation because
of the easiness of code maintenance.

What you guys think ?

Cheers,
Felipe

On Mon, Jul 14, 2008 at 1:04 AM, Leonardo de Paula Rosa Piga
<lpiga <at> terra.com.br> wrote:
> I forgot to send the link for the last graph
> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/systime.png)
>
> On Mon, Jul 14, 2008 at 1:03 AM, Leonardo de Paula Rosa Piga
> <lpiga <at> terra.com.br> wrote:
>>
>> Hi Timothy, below some new and good results.
>>
>> On Mon, Jul 7, 2008 at 1:52 AM, Timothy B. Terriberry
>> <tterribe <at> email.unc.edu> wrote:
>>>
>>> Leonardo de Paula Rosa Piga wrote:
>>> > Coarse grain functions are the best functions to be parallelize to
>>> > become the overhead worthwhile, but the current implementation has one,
>>> > at most two. The parts that I suggested in my initial plan are fine
>>>
>>> The reason the current decoder does this is cache coherency. The idea is
>>> that only a few (16 to 32) rows need to be kept in L1/L2 cache between
>>> each stage of the pipeline, which is a big reason the current decoder is
>>> as fast as it is on high resolution content.
>>>
>>> It's possible break this pipeline back up into separate stages that
>>> operate on the entire frame at once (e.g., just make the MCU size the
>>> height of the frame). You lose cache coherency, but get coarse-grained
>>> parallelism. Only testing will determine which is the better strategy.
>>
>> You are right! You gave me a great tip. I did some tests for different
>> MCU  size. The MCU size for the current implementation is 8.
>> For MCU size >= 16, PThread and OpenMP implementations produce the same
>> results, that is, a speedup 13% on average. The time spend to thread
>> communication was reduced.
>>
>> I plotted three graphs to show these facts
>> One for Real Time vs MCU size. This graph shows that for MCU size >= 16
>> PThread and OpenMP implementations are equivalents.
>> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png)
>>
>> The second graph compares the speedup and prove that for coarse grain
>> functions we can achieve better results.
>> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png)
>>
>> And to conclude the third graph. It was plotted the system time vs MCU
>> size. For greater MCU size, lower is the system time. Because the thread
>> communication overhead was reduced.
>>
>>>
>>>
>>> > the decoding time should be reduced in 33%, but it was just 10% for
>>> > pthread an 5% for openMP.
>>>
>>> The chroma coefficients are usually quantized much more coarsely, so
>>> they very likely don't account for a full 33% of the decode time even on
>>> a uniprocessor. Fewer coded blocks and fewer tokens to unpack in the
>>> blocks that are coded means fewer and smaller iDCTs, fewer invocations
>>> of the loop filter, etc.
>>>
>>> It's sad that OpenMP didn't do better... I was hoping with the option
>>> available to them to do platform-specific tricks, they could cut down on
>>> the overhead of pthreads, but I guess that stuff's just not "there" yet.
>>
>> The results above show that it is not the case. For coarse grain jobs they
>> are equivalent
>>
>>>
>>>
>>> > These version will run the functions (c_dec_dc_unpredict_mcu_plane +
>>> > oc_dec_frags_recon_mcu_plane) and
>>> > (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows) in
>>> > parallel. The upper bound for the gain is 60%, that is, let T2 be a
>>> > video decoded with the pipelined implementation. T2 should be at most
>>> > 0.4To.
>>>
>>> I think you mean "at least". Let us know what your test results look
>>> like (good or bad)! Keep in mind that, if possible, the same thread that
>>> does oc_dec_dc_unpredict_mcu_plane+oc_dec_frags_recon_mcu_plane on a set
>>> of blocks should also be the one to do
>>> oc_state_loop_filter_frag_rows+oc_state_borders_fill_rows on the same
>>> set of blocks (and hopefully the scheduler doesn't muck things up by
>>> moving the thread to a different physical CPU inbetween).
>>>
>>> _______________________________________________
>>> theora-dev mailing list
>>> theora-dev <at> xiph.org
>>> http://lists.xiph.org/mailman/listinfo/theora-dev
>>>
>>
>>
>>
>> --
>> Leonardo de Paula Rosa Piga
>> Undergraduate Computer Engineering Student
>> LSC - IC - UNICAMP
>> http://lampiao.lsc.ic.unicamp.br/~piga
>
>
> --
> Leonardo de Paula Rosa Piga
> Undergraduate Computer Engineering Student
> LSC - IC - UNICAMP
> http://lampiao.lsc.ic.unicamp.br/~piga
> _______________________________________________
> theora-dev mailing list
> theora-dev <at> xiph.org
> http://lists.xiph.org/mailman/listinfo/theora-dev
>
>

--

-- 
________________________________________
Felipe Portavales Goldstein <portavales at gmail>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://lampiao.lsc.ic.unicamp.br/~portavales/

Gmane