Stephen Anthony
2003-09-23 14:11:47 UTC
I am trying to read a PDF file and extract out the text and its positional
information. Just spent the past 2 hours searching the web but I have not
been able to find anything yet. I guess I could either read the PDF file
directly and extract that info or I could convert the PDF file into
another more friendly format like PS (more friendly?) and then extract out
the
required info from it. Any ideas on where I could find such
software/source code? What ever solution I use needs to be incorporated
into another software package I am writing so standalone packages to
convert are not much help unless they are command line callable. A C/C++
solution would be ideal. Please post any replies here and do not e-mail.
TIA.
There's an application for Linux called xpdf, which is a free PDF reader.information. Just spent the past 2 hours searching the web but I have not
been able to find anything yet. I guess I could either read the PDF file
directly and extract that info or I could convert the PDF file into
another more friendly format like PS (more friendly?) and then extract out
the
required info from it. Any ideas on where I could find such
software/source code? What ever solution I use needs to be incorporated
into another software package I am writing so standalone packages to
convert are not much help unless they are command line callable. A C/C++
solution would be ideal. Please post any replies here and do not e-mail.
TIA.
You can select areas of the page with a bounding box, and then paste into a
text editor. Whatever text was in the box will be pasted, including the
proper amount of spacing if the text was indented.
You lose any graphics and formating such as bold, font size, etc., but at
least you can get the text.
Steve