Nicolae Mihalache | 11 Jan 2011 06:45
Picon

Using a ByteBuffer instead of a ByteString?

Hello,

I recently started to use GPB, great software! :)

But I have noticed in java that it is impossible to create a message
containing a "bytes" fields without copying some buffers around. For
example if I have a encoded message of 1MB with a few regular fields
and one big bytes field, decoding the message will make a copy of the
entire buffer instead of keeping a reference to it.

Even worse when encoding: if I read some data from file, does not seem
possible to put it directly into a ByteString so I have to make first
a byte[], then copy it into the ByteString and when encoding, it makes
yet another byte[].

So my question: is it possible to make an exception from the
immutability for the "bytes" fields and use java.nio.ByteBuffers
instead of ByteStrings?

thanks,
nicolae

--

-- 
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to protobuf <at> googlegroups.com.
To unsubscribe from this group, send email to protobuf+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.

Evan Jones | 11 Jan 2011 21:25
Picon
Favicon

Re: Using a ByteBuffer instead of a ByteString?

On Jan 11, 2011, at 0:45 , Nicolae Mihalache wrote:
> But I have noticed in java that it is impossible to create a message
> containing a "bytes" fields without copying some buffers around. For
> example if I have a encoded message of 1MB with a few regular fields
> and one big bytes field, decoding the message will make a copy of the
> entire buffer instead of keeping a reference to it.

By "decoding" I'm assuming you mean deserializing the message from a  
file or something.

This is a disadvantage, but it makes things much easier: it means the  
buffer used to read data can be recycled for the next message. Without  
this copy, the library would need to do complicated tracking of chunks  
of memory to determine if they are "in use" or not.

However, now that you mention it: in the case of big buffers,  
CodedInputStream.readBytes() gets called, which currently makes 2  
copies of the data (it calls readRawBytes() then calls  
ByteString.copyFrom()). This could probably be "fixed" in  
CodedInputStream.readBytes(), which might improve performance a fair  
bit. I'll put this on my TODO list of things to look at, since I think  
my code does this pretty frequently.

> Even worse when encoding: if I read some data from file, does not seem
> possible to put it directly into a ByteString so I have to make first
> a byte[], then copy it into the ByteString and when encoding, it makes
> yet another byte[].

The copy cannot be avoided because it makes the API simpler (thread- 
safety, don't need to worry about the ByteBuffer being accidentally  
(Continue reading)

Nicolae Mihalache | 11 Jan 2011 23:53
Picon

Re: Using a ByteBuffer instead of a ByteString?

On Jan 11, 9:25 pm, Evan Jones <ev... <at> MIT.EDU> wrote:
> This is a disadvantage, but it makes things much easier: it means the  
> buffer used to read data can be recycled for the next message. Without  
> this copy, the library would need to do complicated tracking of chunks  
> of memory to determine if they are "in use" or not.
I read in several places that allocating objects in java rather than
reusing is not so bad. The garbage collector is smart enough to take
care of it.

> However, now that you mention it: in the case of big buffers,  
> CodedInputStream.readBytes() gets called, which currently makes 2  
> copies of the data (it calls readRawBytes() then calls  
> ByteString.copyFrom()). This could probably be "fixed" in  
> CodedInputStream.readBytes(), which might improve performance a fair  
> bit. I'll put this on my TODO list of things to look at, since I think  
> my code does this pretty frequently.
ok, thanks.

>
> The copy cannot be avoided because it makes the API simpler (thread-
> safety, don't need to worry about the ByteBuffer being accidentally  
> changed, etc). The latest version of Protocol Buffers in Subversion  
> has ByteString.copyFrom(ByteBuffer) which will do what you want  
> efficiently.
>
I want to avoid copying data as much as possible (I'm aware it will
not be possible to eliminate it altogether).
I thought it wouldn't be so difficult to put an option in a message
definition that will make protoc generate ByteBuffer fields instead of
ByteString.
(Continue reading)

Kenton Varda | 12 Jan 2011 07:00
Picon
Favicon

Re: Using a ByteBuffer instead of a ByteString?



On Mon, Jan 10, 2011 at 9:45 PM, Nicolae Mihalache <xpromache <at> gmail.com> wrote:
Hello,

I recently started to use GPB, great software! :)

But I have noticed in java that it is impossible to create a message
containing a "bytes" fields without copying some buffers around. For
example if I have a encoded message of 1MB with a few regular fields
and one big bytes field, decoding the message will make a copy of the
entire buffer instead of keeping a reference to it.

We are actually looking at fixing this by allowing ByteStrings to share buffers.
 
Even worse when encoding: if I read some data from file, does not seem
possible to put it directly into a ByteString so I have to make first
a byte[], then copy it into the ByteString and when encoding, it makes
yet another byte[].

ByteString provides multiple methods of construction.  One is to copy from a byte array.  Another is to use an OutputStream that writes into a ByteString.  In future versions, we are looking at making it possible to concatenate ByteStrings without a copy.

But yes, if you start with a byte[], and you want a ByteString with the same content, you are going to need to make a copy, because ByteString has to guarantee immutability.
 
So my question: is it possible to make an exception from the
immutability for the "bytes" fields and use java.nio.ByteBuffers
instead of ByteStrings?

No, sorry, making any exception to immutability would end up unraveling the whole library.  You can go from ByteString to ByteBuffer without a copy (by calling asReadOnlyByteBuffer()), but you can't go the other way, because there is no way to know given a ByteBuffer pointer whether or not someone might be able to modify it in the future.

Storing ByteBuffer in message objects directly has additional problems.  ByteBuffer is a stateful class -- it maintains a pointer to the current read location, for example.  So a protocol message object with ByteBuffers inside it would be thread-hostile no matter how you look at it.  This just leads to too many problems...

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to protobuf <at> googlegroups.com.
To unsubscribe from this group, send email to protobuf+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.

Gmane