Content extraction from PDF

Gekko71 · Post by **Gekko71** » Thu Nov 27, 2008 8:16 pm

I've got a doozy here folks, any help you could offer would be great.

I have been supplied a 478-page PDF-format product catalogue, containing over 3,500 different entries (each entry consists of a product name, product category, jpeg image, reference number, bar code and relevant product-feature text.

I need to create (or recreate) a database of all this information so the products can be uploaded to a website and searched online.

I can extract the images okay, but how can I also extract the text / barcode *next to each image* and ensure the image and relevant text are saved together?

I've worked with PDFs for years, but this is a first. Is there an application out there that someone knows of that might do the job, or do I have to find myself a really good programmer and write our own app?

fliptw · Post by **fliptw** » Thu Nov 27, 2008 10:21 pm

is a catalogue made by your company?

Duper · Post by **Duper** » Thu Nov 27, 2008 11:32 pm

Foxit?

Adobe suite may be able to do it with Dreamweaver. If you have access to it.

For those who haven't seen Foxit, it's a semi free program that lets you read and create PDF. MS Word 2007 has this function as well.

The Lion · Post by **The Lion** » Fri Nov 28, 2008 6:06 pm

Poppler is a PDF rendering library that includes utility programs to
extract text and images from PDFs, among other things. If these
utilities don't work the way you need them to, the Poppler library
may aid a programmer in writing one that does.

Notes: (1) Poppler is copylefted free software (GPL v2). (2) I have
no idea if/how these can be used on Windows.

Duper · Post by **Duper** » Fri Nov 28, 2008 8:44 pm

Nice find Lion!

Gekko71 · Post by **Gekko71** » Sat Nov 29, 2008 3:59 am

fliptw wrote:is a catalogue made by your company?

No, it was put together by a supplier a few years ago. Whoever created it didn't create any meta info / tags on the PDF, otherwise I could possibly export the data to XML.

Gekko71 · Post by **Gekko71** » Sat Nov 29, 2008 4:24 am

The Lion wrote:Poppler is a PDF rendering library that includes utility programs to
extract text and images from PDFs, among other things. If these
utilities don't work the way you need them to, the Poppler library
may aid a programmer in writing one that does.

Notes: (1) Poppler is copylefted free software (GPL v2). (2) I have
no idea if/how these can be used on Windows.

Very nice find Lion - unfortunately there's no windows executable - looks like it's Linux only too from the README file... which is a right bugger, 'cause it sounds ilke exactly the kind of thing I was looking for.

Anyone know of a WIN32 .exe version, or of another .exe that has comparable features?

Gekko71 · Post by **Gekko71** » Sat Nov 29, 2008 4:36 am

Duper wrote:Foxit?

Adobe suite may be able to do it with Dreamweaver. If you have access to it.

For those who haven't seen Foxit, it's a semi free program that lets you read and create PDF. MS Word 2007 has this function as well.

Foxit is anice find Duper - thank you. Unfortunately it doesn't have the functionality I'm looking for.

I have been able to set up automated bookmarks based around text settings, but again, linking the text and images next to each particualr heading and then saving them as a separate entitiy is proving impossible so far...

Jeff250 · Post by **Jeff250** » Sat Nov 29, 2008 5:04 am

Download and boot an Ubuntu Live CD. I believe that poppler-utils is installed by default, but otherwise install that package. Then run something like pdftotext or pdftohtml (does xml too). The alternative is to try to compile poppler-utils using cygwin or mingw32 or try to find someone masochistic enough to have already done this and posted the binaries on the interwebs. But it's probably easier to acquire some basic familiarity with Linux at this point.

Gekko71 · Post by **Gekko71** » Sat Nov 29, 2008 6:34 pm

Jeff250 wrote:Download and boot an Ubuntu Live CD. I believe that poppler-utils is installed by default, but otherwise install that package. Then run something like pdftotext or pdftohtml (does xml too). The alternative is to try to compile poppler-utils using cygwin or mingw32 or try to find someone masochistic enough to have already done this and posted the binaries on the interwebs. But it's probably easier to acquire some basic familiarity with Linux at this point.

Is it difficult to set up dual boot for winxp 32 & Ubuntu Jeff? I have no experience with either dual boot or Linux. I like the idea of that particular combo, but I'm hesitant to do it as I've been having problems with the BIOS on my machine. (the joys of using older hardware...)

Jeff250 · Post by **Jeff250** » Sat Nov 29, 2008 7:19 pm

It is not, but just booting straight from the Live CD is probably easier. When you convert the PDF to xml, just copy the xml to a flash drive or email it to yourself or something like this, and then you can see what you can do with the xml file back in Windows when you reboot (the xml output may not be helpful to what you need to do).

Gekko71 · Post by **Gekko71** » Wed Dec 17, 2008 1:06 am

Quick update and overdue thanks - we managed to extract the content and images in a workable format and uploaded them to our new website - bad news being I still need to check every entry manually for errors.

Many thanks to all those who contributed ideas. God willing, I won't have to do this again for a long time!

Content extraction from PDF

Content extraction from PDF

Re:

Re:

Re:

Re: