Andrzej Bialecki (JIRA | 12 May 05:01 2012
Picon

[jira] [Created] (LUCENE-4050) Change SegmentInfos format to plain text

Andrzej Bialecki  created LUCENE-4050:
-----------------------------------------

             Summary: Change SegmentInfos format to plain text
                 Key: LUCENE-4050
                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/codecs
            Reporter: Andrzej Bialecki 
             Fix For: 4.0

I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current
binary format.

SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each of
the segments that the commit point consists of. However, this is a chicken and egg situation - in theory the
format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to first
discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes a
certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is
read again, only this time using the right Codec.

This is ugly. Instead I propose to use a simple plain text format, either line oriented properties or JSON,
in such a way that newer versions could easily extend it, and which wouldn't require any special Codec to
read and parse. Consequently we could remove SegmentInfosFormat altogether, and instead add
SegmentInfoFormat (notice the singular) to Codec to read single per-segment SegmentInfo-s in a
codec-specific way. E.g. for Lucene40 codec we could either add another file or we could extend the .fnm
file (FieldInfos) to contain also this information. 

Then the plain text SegmentInfos would contain just the following information:
(Continue reading)

Robert Muir (JIRA | 3 Aug 15:35 2012
Picon

[jira] [Resolved] (LUCENE-4050) Make segments_NN file codec-independent


     [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-4050.
---------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 4.0)
                   4.0-ALPHA

This was never resolved (segments_N file is codec-independent now)

> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0-ALPHA
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current
binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each
of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory
(Continue reading)

Robert Muir (JIRA | 12 May 08:31 2012
Picon

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273858#comment-13273858
] 

Robert Muir commented on LUCENE-4050:
-------------------------------------

I agree this is a total mess. We should really revisit how we handle:

# commit file (in my opinion this should just be a list of segments! only!)
  currently segmentinfos stores a ton of stuff more than this, it stores
  per-segment metadata within this file when it really should not.
# per-segment metadata. In this case we have a lot of confusion with 
  segmentinfo and fieldinfo. It would be great for the codec to have more
  flexibility here, via abstract classes/interfaces+attributes or something
  that ensures its lossless yet still a codec can add what it needs. Really
  for the most part segmentinfo is basically useless since many values actually
  return "well if you want to know this, then go look at the fieldinfos".
# actual commit strategy. We do a lot of funky stuff like writing fake bogus
  data, seeking backwards, etc. Why not just a normal atomic rename like
  any other computer program on the planet????

                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
(Continue reading)

Michael McCandless (JIRA | 12 May 12:39 2012
Picon

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273921#comment-13273921
] 

Michael McCandless commented on LUCENE-4050:
--------------------------------------------

+1 to fully separate (separate files maybe?) the codec-neutral "list
of committed segments" from "the codec-specific details/metadata for
each segment".

Then, a codec can easily store its own stuff in the segment metadata.

And I agree the FieldInfo/SegmentInfo duality is confusing...

Plain text encoding of these files would be really nice but isn't as
important, I think... and will be a fair amount of work (I suspect we
need a JSON or YAML or something that represents lists, maps,
different native types, etc.).  I think this is separate / can come
later.

{quote}
We do a lot of funky stuff like writing fake bogus
data, seeking backwards, etc. Why not just a normal atomic rename like
any other computer program on the planet????
{quote}

In fact Lucene used to use rename to commit the segments file but this
proved problematic on Windows (sometimes the rename would hit "access
(Continue reading)

Robert Muir (JIRA | 12 May 12:47 2012
Picon

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273927#comment-13273927
] 

Robert Muir commented on LUCENE-4050:
-------------------------------------

{quote}
In fact Lucene used to use rename to commit the segments file but this
proved problematic on Windows (sometimes the rename would hit "access
denied" error).
{quote}

Well, problematic at least once right? I dont think it justifies doing
things a strange way.

Surely this is just some problem only on windows 3.1 and java 1.2 or
something and now fixed, since this is how every other linux/cygwin program
(e.g. vi) works.

                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
(Continue reading)

Andrzej Bialecki (JIRA | 12 May 22:14 2012
Picon

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274064#comment-13274064
] 

Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

bq. Plain text encoding of these files would be really nice but isn't as important, I think...

Yeah, it could be sufficient if we would agree on necessarily separate the "plain list of segments:codec"
from the segmentInfo/fieldInfo parts and push those parts down to the codec-specific formats.

Then we could just use a version number as the first element of this file to allow for extensions in the
future, like e.g. switching to JSON or to some other format du jour.

bq. Surely this is just some problem only on windows 3.1 and java 1.2 or something and now fixed, since this is
how every other linux/cygwin program (e.g. vi) works.

I'm not so sure. I know for a fact that Windows doesn't allow renames or deletes of open files, no matter if
it's open by you or by some other process (e.g. user examining the file in Notepad.exe), and IIRC the issue
was that JVM doesn't release OS file handles quickly enough.

> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
(Continue reading)

Andrzej Bialecki (JIRA | 13 May 16:14 2012
Picon

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274271#comment-13274271
] 

Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

Discussing this further with Robert, it looks like this is a (smaller) part of a larger issue, in that
SegmentInfo+FieldInfo should be made extensible and the process of reading/writing this information
should be *completely codec-specific*. Let's make a separate issue for that part.

And the smaller issue discussed here is to record only the information about a commit point in a *completely
codec-independent, versioned format*, whatever that format is. Let's call it CommitInfo or whatever
other name fits. This part would be written to a file that is separate from the codec-dependent parts.

Regarding two-phase commit and checksums - one reason we have SegmentInfosWriter/Reader was the
AppendingCodec, because we couldn't make it work for append-only filesystems. However, we could change
the two-phase commit implementation to the following:

* write the data to the CommitInfo file
* write a marker indicating "end of data, checksum follows"
* finally, write the checksum

Then the reading code knows that:
* if there's a marker missing then the file is invalid
* if the marker is present then the checksum must be present too
* and the checksum must be correct.

This implementation doesn't require seek back / overwrite so it's supported on any filesystem.
(Continue reading)

Andrzej Bialecki (JIRA | 13 May 16:24 2012
Picon

[jira] [Updated] (LUCENE-4050) Make segments_NN file codec-independent


     [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-4050:
--------------------------------------

    Summary: Make segments_NN file codec-independent  (was: Change SegmentInfos format to plain text)

Changing the title to better reflect the scope of this issue.

> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current
binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each
of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory
the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to
first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes
a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is
(Continue reading)

Andrzej Bialecki (JIRA | 13 May 17:10 2012
Picon

[jira] [Updated] (LUCENE-4050) Make segments_NN file codec-independent


     [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated LUCENE-4050:
--------------------------------------

    Issue Type: Bug  (was: Improvement)

It's actually a bug - it's not possible to cleanly extend index format via Codec-s without addressing this issue.

> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current
binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each
of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory
the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to
first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes
a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is
(Continue reading)

Andrzej Bialecki (JIRA | 15 May 21:51 2012
Picon

[jira] [Assigned] (LUCENE-4050) Make segments_NN file codec-independent


     [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  reassigned LUCENE-4050:
-----------------------------------------

    Assignee: Robert Muir

> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current
binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each
of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory
the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to
first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes
a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is
read again, only this time using the right Codec.
(Continue reading)

Michael McCandless (JIRA | 15 May 23:01 2012
Picon

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276204#comment-13276204
] 

Michael McCandless commented on LUCENE-4050:
--------------------------------------------

bq.  However, we could change the two-phase commit implementation to the following:

I think that's a good solution?  It seems important to keep the non-codec-controlled write/read as simple
as possible...

The only small thing we lose is if a disk full is going to strike... today we write the 0s ahead (in
prepareCommit) so that we'll hit disk full during prepareCommit and not commit... but I think the chance
of those 4 bytes hitting the disk full is very low so the simpler code is better...

> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
(Continue reading)

Andrzej Bialecki (JIRA | 16 May 01:15 2012
Picon

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276325#comment-13276325
] 

Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

bq. The only small thing we lose is if a disk full is going to strike... 
I thought about this too - if it's really a big concern we could use the following trick: > 99% filesystems
keep data in blocks that are multiples of 512 bytes. We could add filler bytes at the end of the file so that it
comes out to a round multiple of 512 B, and only then append the marker and the checksum. This way we will know
that writing a marker required allocation of a new block, and if it succeeded then writing a checksum
should also succeed.

> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current
binary format.
(Continue reading)

Marvin Humphrey (JIRA | 16 May 21:33 2012
Picon

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277041#comment-13277041
] 

Marvin Humphrey commented on LUCENE-4050:
-----------------------------------------

Ever considered using hard links instead of renaming?

> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current
binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each
of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory
the format of this file is customizable via Codec.getSegmentInfosFormat, but in practice we have to
first discover what is the codec implementation that wrote this file - so the SegmentCoreReaders assumes
a certain fixed binary layout of a preamble of this file that contains the codec name... and then the file is
(Continue reading)

Michael McCandless (JIRA | 17 May 02:13 2012
Picon

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent


    [
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277281#comment-13277281
] 

Michael McCandless commented on LUCENE-4050:
--------------------------------------------

bq. Ever considered using hard links instead of renaming?

That's a neat option ... but I think it's only in Java 7 that we can create hard links
(java.nio.file.Files.createLink)?  And even then it's an optional operation...

> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use plain text instead of the current
binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs were used for writing each
of the segments that the commit point consists of. However, this is a chicken and egg situation - in theory
(Continue reading)


Gmane