Yuras Shumovich | 21 Oct 02:02 2013
Picon

Extract text from PDF file: need testers

Hello,

I just uploaded new version of pdf-toolbox suite.

Now it supports text extraction, see
http://hackage.haskell.org/package/pdf-toolbox-document-0.0.2.0/docs/Pdf-Toolbox-Document-Page.html#v:pageExtractText

New library, pdf-toolbox-content, contains low level tools for text
extraction. For example, one can extract glyphs with exact positions. It
can be used e.g. to implement text selection in PDF viewer (see
screenshots).

Is anybody interested in that functionality? I tested it on all PDF
files in my ~/Downloads, but there is a number of corner cases that are
not handled because I never saw them in the wild. So, if you are
interested, please try it out and report any issue. The easiest way is
to install pdf-toolbox-viewer (not on Hackage, see
https://github.com/Yuras/pdf-toolbox/tree/master/viewer , it depends on
gtk2hs) and run it with path to PDF file as an argument. Or you can just
use pageExtractText function directly:

import System.IO
import Pdf.Toolbox.Document

main =
  withBinaryFile "input.pdf" ReadMode $ \handle ->
    runPdfWithHandle handle knownFilters $ do
      pdf <- document
      catalog <- documentCatalog pdf
      rootNode <- catalogPageNode catalog
(Continue reading)


Gmane