Content extraction from PDF

For system help, all hardware / software topics NOTE: use Coders Corner for all coders topics.

Moderators: Krom, Grendel

Post Reply
User avatar
Gekko71
DBB Captain
DBB Captain
Posts: 761
Joined: Sun May 27, 2007 2:50 am
Location: Perth

Content extraction from PDF

Post by Gekko71 »

I've got a doozy here folks, any help you could offer would be great.

I have been supplied a 478-page PDF-format product catalogue, containing over 3,500 different entries (each entry consists of a product name, product category, jpeg image, reference number, bar code and relevant product-feature text.

I need to create (or recreate) a database of all this information so the products can be uploaded to a website and searched online.

I can extract the images okay, but how can I also extract the text / barcode *next to each image* and ensure the image and relevant text are saved together?

I've worked with PDFs for years, but this is a first. Is there an application out there that someone knows of that might do the job, or do I have to find myself a really good programmer and write our own app?
User avatar
fliptw
DBB DemiGod
DBB DemiGod
Posts: 6459
Joined: Sat Oct 24, 1998 2:01 am
Location: Calgary Alberta Canada

Post by fliptw »

is a catalogue made by your company?
User avatar
Duper
DBB Master
DBB Master
Posts: 9214
Joined: Thu Nov 22, 2001 3:01 am
Location: Beaverton, Oregon USA

Post by Duper »

Foxit?

Adobe suite may be able to do it with Dreamweaver. If you have access to it.

For those who haven't seen Foxit, it's a semi free program that lets you read and create PDF. MS Word 2007 has this function as well.
User avatar
The Lion
DBB Ace
DBB Ace
Posts: 197
Joined: Mon Apr 17, 2006 2:13 pm
Location: The Netherlands

Post by The Lion »

Poppler is a PDF rendering library that includes utility programs to
extract text and images from PDFs, among other things. If these
utilities don't work the way you need them to, the Poppler library
may aid a programmer in writing one that does.

Notes: (1) Poppler is copylefted free software (GPL v2). (2) I have
no idea if/how these can be used on Windows.
User avatar
Duper
DBB Master
DBB Master
Posts: 9214
Joined: Thu Nov 22, 2001 3:01 am
Location: Beaverton, Oregon USA

Post by Duper »

Nice find Lion!
User avatar
Gekko71
DBB Captain
DBB Captain
Posts: 761
Joined: Sun May 27, 2007 2:50 am
Location: Perth

Re:

Post by Gekko71 »

fliptw wrote:is a catalogue made by your company?
No, it was put together by a supplier a few years ago. Whoever created it didn't create any meta info / tags on the PDF, otherwise I could possibly export the data to XML.
User avatar
Gekko71
DBB Captain
DBB Captain
Posts: 761
Joined: Sun May 27, 2007 2:50 am
Location: Perth

Re:

Post by Gekko71 »

The Lion wrote:Poppler is a PDF rendering library that includes utility programs to
extract text and images from PDFs, among other things. If these
utilities don't work the way you need them to, the Poppler library
may aid a programmer in writing one that does.

Notes: (1) Poppler is copylefted free software (GPL v2). (2) I have
no idea if/how these can be used on Windows.
Very nice find Lion - unfortunately there's no windows executable - looks like it's Linux only too from the README file... which is a right bugger, 'cause it sounds ilke exactly the kind of thing I was looking for.

Anyone know of a WIN32 .exe version, or of another .exe that has comparable features?
User avatar
Gekko71
DBB Captain
DBB Captain
Posts: 761
Joined: Sun May 27, 2007 2:50 am
Location: Perth

Re:

Post by Gekko71 »

Duper wrote:Foxit?

Adobe suite may be able to do it with Dreamweaver. If you have access to it.

For those who haven't seen Foxit, it's a semi free program that lets you read and create PDF. MS Word 2007 has this function as well.
Foxit is anice find Duper - thank you. Unfortunately it doesn't have the functionality I'm looking for.

I have been able to set up automated bookmarks based around text settings, but again, linking the text and images next to each particualr heading and then saving them as a separate entitiy is proving impossible so far...
User avatar
Jeff250
DBB Master
DBB Master
Posts: 6539
Joined: Sun Sep 05, 1999 2:01 am
Location: ❄️❄️❄️

Post by Jeff250 »

Download and boot an Ubuntu Live CD. I believe that poppler-utils is installed by default, but otherwise install that package. Then run something like pdftotext or pdftohtml (does xml too). The alternative is to try to compile poppler-utils using cygwin or mingw32 or try to find someone masochistic enough to have already done this and posted the binaries on the interwebs. But it's probably easier to acquire some basic familiarity with Linux at this point.
User avatar
Gekko71
DBB Captain
DBB Captain
Posts: 761
Joined: Sun May 27, 2007 2:50 am
Location: Perth

Re:

Post by Gekko71 »

Jeff250 wrote:Download and boot an Ubuntu Live CD. I believe that poppler-utils is installed by default, but otherwise install that package. Then run something like pdftotext or pdftohtml (does xml too). The alternative is to try to compile poppler-utils using cygwin or mingw32 or try to find someone masochistic enough to have already done this and posted the binaries on the interwebs. But it's probably easier to acquire some basic familiarity with Linux at this point.
Is it difficult to set up dual boot for winxp 32 & Ubuntu Jeff? I have no experience with either dual boot or Linux. I like the idea of that particular combo, but I'm hesitant to do it as I've been having problems with the BIOS on my machine. (the joys of using older hardware...)
User avatar
Jeff250
DBB Master
DBB Master
Posts: 6539
Joined: Sun Sep 05, 1999 2:01 am
Location: ❄️❄️❄️

Post by Jeff250 »

It is not, but just booting straight from the Live CD is probably easier. When you convert the PDF to xml, just copy the xml to a flash drive or email it to yourself or something like this, and then you can see what you can do with the xml file back in Windows when you reboot (the xml output may not be helpful to what you need to do).
User avatar
Gekko71
DBB Captain
DBB Captain
Posts: 761
Joined: Sun May 27, 2007 2:50 am
Location: Perth

Post by Gekko71 »

Quick update and overdue thanks - we managed to extract the content and images in a workable format and uploaded them to our new website - bad news being I still need to check every entry manually for errors.

Many thanks to all those who contributed ideas. God willing, I won't have to do this again for a long time! :)
Post Reply