Federico Leva (Nemo | 23 Jul 2011 09:45
Picon

Removing watermark/footer from JSTOR PDFs

Hello,
you might have heard about 
<http://arstechnica.com/tech-policy/news/2011/07/swartz-supporter-dumps-18592-jstor-docs-on-the-pirate-bay.ars>
We're now going to upload those ~19000 PDFs to the Internet Archive, but 
we need to remove a watermark. Could you please give me a suggestion 
about how to do it? Sadly I don't know anything about PDF manipulation.
We tried pdfimages, which output a .pbms per page plus a .ppm (the 
footer/watermark); using ImageMagick to recombine pages in a PDF 
compressed with LZM produced a PDF almost 3 times as big as the original 
one, so I think it's better to edit the original PDF without converting 
it to other raster formats.
The PDF looks like this: http://p.defau.lt/?8I_tQEf0Q2SZpi9CJx6I8A
Apparently, we need to remove this image:
     /GxMWCL: 18 0 R, 187 x 248
Which is like this in other PDFs: http://p.defau.lt/?I1lqfJPL8ociEfOpvTfPaA
How can I do it?
Thank you,
     Federico

------------------------------------------------------------------------------
Storage Efficiency Calculator
This modeling tool is based on patent-pending intellectual property that
has been used successfully in hundreds of IBM storage optimization engage-
ments, worldwide.  Store less, Store more with what you own, Move data to 
the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/
Martin Petricek | 30 Jul 2011 11:45
Favicon

Re: Removing watermark/footer from JSTOR PDFs

You can try using pdfimages with -j parameter, which will (if the image 
is stored with jpeg compression) save them as JPEG, thus avoiding 
recompression, which will increase the size.

Or you can try writing custom script for pdfedit that will look in the 
content stream. Based on how does the watermark look in the content 
stream it may be anything between easy and almost impossible (depending 
how hard is for the script to distinguis the watermark images between 
the images you want to keep :)
But from the dump of the stream it looks the watermark have always the 
same size, so it should be quite easy.

Can you send me one of these documents (not to list but to my email), 
so I may have a look at it without downloading the whole archive?

Martin Petricek

On Sat, 23 Jul 2011 09:45:27 +0200, Federico Leva (Nemo) wrote:
> Hello,
> you might have heard about
> 
> <http://arstechnica.com/tech-policy/news/2011/07/swartz-supporter-dumps-18592-jstor-docs-on-the-pirate-bay.ars>
> We're now going to upload those ~19000 PDFs to the Internet Archive, 
> but
> we need to remove a watermark. Could you please give me a suggestion
> about how to do it? Sadly I don't know anything about PDF 
> manipulation.
> We tried pdfimages, which output a .pbms per page plus a .ppm (the
> footer/watermark); using ImageMagick to recombine pages in a PDF
> compressed with LZM produced a PDF almost 3 times as big as the 
(Continue reading)

Federico Leva (Nemo | 30 Jul 2011 11:57
Picon

Re: Removing watermark/footer from JSTOR PDFs

Thank you for your reply.

Martin Petricek, 30/07/2011 11:45:
> You can try using pdfimages with -j parameter, which will (if the image
> is stored with jpeg compression) save them as JPEG, thus avoiding
> recompression, which will increase the size.

Yes, that's what we first thought.

> Or you can try writing custom script for pdfedit that will look in the
> content stream. Based on how does the watermark look in the content
> stream it may be anything between easy and almost impossible (depending
> how hard is for the script to distinguis the watermark images between
> the images you want to keep :)
> But from the dump of the stream it looks the watermark have always the
> same size, so it should be quite easy.

Quite impossible at least for me, yes. :-)
In fact in the meanwhile they're being uploaded as extracted JPG: 
<http://www.archive.org/search.php?query=subject%3A%22Philosophical+Transactions+of+the+Royal+Society%22>

> Can you send me one of these documents (not to list but to my email), so
> I may have a look at it without downloading the whole archive?

I will. Anyway the last archive in the torrent (11.7z) is only ~200 
KiB). :-)

Federico

------------------------------------------------------------------------------
(Continue reading)


Gmane