Scott Gifford | 1 May 2012 05:43
Gravatar

Extracting PDF metadata and exploding pages

Hello,


We're working on an application that needs to shuffle around the pages in a PDF file.

Right now it uses a hodgepodge of different programs to manipulate the PDF, and if possible I'd like to have it just use Ghostscript.  That would simplify dependencies, and also simplify troubleshooting in the event something goes wrong.

First, it uses poppler's pdfinfo to extract metadata from the PDF, like this:

Title:          t10_4C
Creator:        Adobe Illustrator CS4
Producer:       Adobe PDF library 9.00
CreationDate:   Fri Dec 16 18:26:22 2011
ModDate:        Fri Dec 16 18:26:22 2011
Tagged:         no
Pages:          1
Encrypted:      no
Page size:      270 x 162 pts
File size:      955508 bytes
Optimized:      yes
PDF version:    1.4

Next, it splits a multi-page PDF into many single-page PDFs, with "pdftk burst".

After that it uses ghostscript to generate PNG thumbnails of each page.

The user then re-orders the pages in a Web UI using the thumbnails.  Finally, it puts them back together in a different order with ghostscript.

I have not been able to find a reasonable way to extract the PDF metadata with Ghostscript, or to "burst" a multi-page PDF document into many one-page PDF documents (well, I could use -dFirstPage and -dLastPage for every page, but that requires many calls to gs for a big document and is much, much slower than pdftk).

I would really like to be able to load the PDF file into ghostscript one time, extract the data I need, then convert the pages one at a time to individual PDF files then to PNGs.  Is it possible to drive ghostscript like this, having it do multiple operations on each page?

Thanks for any tips!

----Scott.

_______________________________________________
gs-devel mailing list
gs-devel <at> ghostscript.com
http://ghostscript.com/cgi-bin/mailman/listinfo/gs-devel
SaGS | 2 May 2012 20:58
Picon
Favicon

Re: Extracting PDF metadata and exploding pages

----- Original Message ----- 
From: "Scott Gifford" <sgifford <at> suspectclass.com>
To: <gs-devel <at> ghostscript.com>
Sent: Tuesday, 1 May 2012 06:43
Subject: [gs-devel] Extracting PDF metadata and exploding pages

> ...
> First, it uses poppler's pdfinfo to extract metadata from the PDF, like
> this:
>
> Title:          t10_4C
> Creator:        Adobe Illustrator CS4
> Producer:       Adobe PDF library 9.00
> CreationDate:   Fri Dec 16 18:26:22 2011
> ModDate:        Fri Dec 16 18:26:22 2011
> Tagged:         no
> Pages:          1
> Encrypted:      no
> Page size:      270 x 162 pts
> File size:      955508 bytes
> Optimized:      yes
> PDF version:    1.4

Try Ghostscript's toolbin\pdf_info.ps. May even be more suitable, depending 
on what exact metatdata you need. For example 'Page size' above is vague, 
different pages may have different sizes and also there are different 
'boxes' for each page (Mediabox, Cropbox, and others). If some info you need 
is not already provided, you can modify pdf_info.ps with only a little 
PostScript programming.

Another tool to try is pdftk, see its dump_data command.

>
>
> Next, it splits a multi-page PDF into many single-page PDFs, with "pdftk
> burst".
>
> After that it uses ghostscript to generate PNG thumbnails of each page.

From your description it doesn't seem you *need* those one-page PDFs. 
Convert the original PDF to one-PNG-per-page in one go by using %d in 
Ghostscript's output filename. The %d gets replaced with the page number. If 
you prefer fixed width 0-padded numbers use something like %04d (yes, it's 
just C printf() formatting).

>
> The user then re-orders the pages in a Web UI using the thumbnails.

OK (that's your app).

> Finally, it puts them back together in a different order with ghostscript.

pdftk is more suitable for this task, out-of-the-box, see its cat command. 
For example 'pdftk IN.PDF cat 3 1 10-5 2 4 11-end output OUT.PDF' shuffles 
the first 10 pages and leaves the rest untouched.

The page reordering can be done using Ghostscript alone, without fully 
interpreting the input file and generating a brand new output PDF, but this 
requires [a lot?] more PostScript programming and knowledge about PDF 
internals. Start with toolbin\pdfinflt.ps. This tool loads the input PDF 
without interpreting it (= without translating it to a series of drawing 
operations) then writes it out with the streams uncompressed. You can do 
some surgery on the PDF Page tree between loading the input and writing the 
output (and there's no need to suppress compressing the streams). A much 
more complex example is lib\pdfopt.ps, this one loads a PDF and writes it 
out linearised ('Web-optimised').

> ...
> I would really like to be able to load the PDF file into ghostscript one
> time, extract the data I need, then convert the pages one at a time to
> individual PDF files then to PNGs.  Is it possible to drive ghostscript
> like this, having it do multiple operations on each page?

Not multiple operations on each page, but I think it's possible to get the 
metadata and the one-PNG-per-page in just one execution of Ghostscript. 
Haven't tried it though; I think I can imagine ways for doing this, but it's 
not tame at all. In any case I don't think it's worth the trouble, your only 
gain is that you start Ghostscript only once instead of twice. Most of the 
time is spent interpreting the PDF and generating the PNGs.

> ...

Gmane