extracting data from PDF files

Discussion:

(too old to reply)

Stephen Anthony

2003-09-23 14:11:47 UTC

I am trying to read a PDF file and extract out the text and its positional
information. Just spent the past 2 hours searching the web but I have not
been able to find anything yet. I guess I could either read the PDF file
directly and extract that info or I could convert the PDF file into
another more friendly format like PS (more friendly?) and then extract out
the
required info from it. Any ideas on where I could find such
software/source code? What ever solution I use needs to be incorporated
into another software package I am writing so standalone packages to
convert are not much help unless they are command line callable. A C/C++
solution would be ideal. Please post any replies here and do not e-mail.
TIA.

There's an application for Linux called xpdf, which is a free PDF reader.
You can select areas of the page with a bounding box, and then paste into a
text editor. Whatever text was in the box will be pasted, including the
proper amount of spacing if the text was indented.

You lose any graphics and formating such as bold, font size, etc., but at
least you can get the text.

Steve

Stephen Anthony

2003-09-23 23:08:04 UTC

Permalink

I am trying to read a PDF file and extract out the text and its
positional
information. Just spent the past 2 hours searching the web but I
have not
been able to find anything yet. I guess I could either read the PDF
file directly and extract that info or I could convert the PDF file
into another more friendly format like PS (more friendly?) and then
extract out the
required info from it. Any ideas on where I could find such
software/source code? What ever solution I use needs to be
incorporated into another software package I am writing so
standalone packages to
convert are not much help unless they are command line callable. A
C/C++
solution would be ideal. Please post any replies here and do not
e-mail. TIA.

Sorry I didn't read all your message before :) The following
commandline programs are available for Linux, and possibly for Cygnus
for Windows as well:

1) pdf2ps will convert a PDF file to a Postscript (PS) file
2) ps2acsii will convert the PS file to ASCII text
3) graphics and formatting will be lost

Depending on the license of your software, you can get the C/C++
source code for these applications.

Steve

Press Ctrl-Alt-Del Now

2003-09-24 05:01:28 UTC

Permalink

Post by Stephen Anthony

I am trying to read a PDF file and extract out the text and its
positional
information. Just spent the past 2 hours searching the web but I
have not
been able to find anything yet. I guess I could either read the PDF
file directly and extract that info or I could convert the PDF file
into another more friendly format like PS (more friendly?) and then
extract out the
required info from it. Any ideas on where I could find such
software/source code? What ever solution I use needs to be
incorporated into another software package I am writing so
standalone packages to
convert are not much help unless they are command line callable. A
C/C++
solution would be ideal. Please post any replies here and do not
e-mail. TIA.

Sorry I didn't read all your message before :) The following
commandline programs are available for Linux, and possibly for Cygnus
1) pdf2ps will convert a PDF file to a Postscript (PS) file
2) ps2acsii will convert the PS file to ASCII text
3) graphics and formatting will be lost
Depending on the license of your software, you can get the C/C++
source code for these applications.
Steve

Thanks. I have already come across those utils plus more but nothing yet
to get the positional information (i.e. the x,y coordinates of the text
on the sheet in mm/inches/other). I need to know exactly where on that
page that text is located, with much more precision than row, col you'd
get from the xxx to text conversions. Looks like my best bet right now
is to convert to a PS (easy to do) and then read the PS for that info.
Going to be a bit of work but there's nothing I can find which has
already done this before so I guess I will be breaking new ground...