Can you find a string inside of apdf using cfpdf in Coldfusion?

Report · Apr 09, 2012

Is it possilbe to determine if a String, for example "Not Possible", exists in a PDF using CFPDF or another function inside of Coldfusion? If so, any suggestions on how to do this would be appreciated.

Thanks!

Report · Apr 10, 2012

Yes, it is possible. In the following example, the 2 files and the PDF ('myDoc.pdf') are in the same directory.

textFromPDF.cfm

</cfif>

Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>

myDDX.ddx

<?xml version="1.0" encoding="UTF-8"?>

<DDX xmlns="http://ns.adobe.com/DDX/1.0/"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">

</DocumentText>

</DDX>

Report · Apr 10, 2012

BKBK,

Thanks for your help. You've gotten me off to a great start!

I keep getting a DDX is invalid error, Check for invalid construct or restricted keywords. Is it possible that your ddx is somehow misformed?

Report · Apr 10, 2012

I suspect you made the same mistake I did in the beginning. Note that there is a space before the word coldfusion in:

"http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd"

Report · Apr 10, 2012

Thanks! Actually I did end up eventually finding that space.

I think the issue I'm having is that I'm trying to do this from a binary file. I'm storing my pdfs in a database as blobs. I can successfully read them out of the database but I'm having issues incorporating the binary blob into the example above.

I think it would be something like

But I can't get that to work out. Any thoughts?

Report · Apr 10, 2012

I deceided to just write the file locally to see if that would help. It successfully writes the file and I can open the pdf with Adobe.

When I run the code, it never writes the my_PDF_doc_as_text.xml file. The myDDXVar keeps reporting back failed.

When I dump myDDXVar is says

failed: 0, Size: 0

Report · Apr 10, 2012

Hmm, challenging construction! What about first writing the PDF to disk? I know it's inefficient, but let us first get a working example.

</cfif>

Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>

Report · Apr 10, 2012

BKBK,

Same result when I write the file to disk.

When I dump myDDXVar is says

failed: 0, Size: 0

Report · Apr 10, 2012

Can you open the newly created PDF? Is its content what you expected?

Report · Apr 10, 2012

Yes. When I open the new pdf. It is exactly what I'm expecting coming out of the DB. I just can't get CF to spit out the text file

Report · Apr 10, 2012

Could you please show us your code.

Report · Apr 10, 2012

I appologize! I should have done that first!

<br><cfoutput>#myDDXVar.Out1#</cfoutput><br>

</cfif>

Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>

Report · Apr 10, 2012

My DDX file:

<?xml version="1.0" encoding="UTF-8"?>

<DDX xmlns="http://ns.adobe.com/DDX/1.0/"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">

</DocumentText>

</DDX>

Report · Apr 10, 2012

Interesting.

I tried using a different PDF. Just a random pdf I had on my file system not one that was coming out of the DB. Using the random pdf the code worked perfectly.

Using what was coming out of the DB, gets me the error. The pdf opens without an issue in Adobe reader though.

The random file has

PDF Producer: Adobe PDF Library 9.9

PDF Vesion: 1.6 (Acrobat 7.x)

The files out of the database has

PDF Producer: iText 2.0.2 (by lowagie.com)

PDF Vesion: 1.4 (Acrobat 5.x)

* I thought maybe it was a version issue. I tried reading the db file and using cfpdf to write it back to the file system as a differenct version. Unfortunatly that failed as well.

Report · Apr 10, 2012

By the way, CF9 has an extracttext action on cfpdf!

<cfpdf

action="extracttext"

source= "mypdf"

pages = "*"

type = "xml"

destination = "#currentDir#testxml.xml" >

Report · Apr 10, 2012

Paiz wrote:
By the way, CF9 has an extracttext action on cfpdf!
<cfpdf action="read" source="pdfFile.pdf" name="mypdf">
<cfpdf
    action="extracttext"
    source= "mypdf"
    pages = "*"
    type = "xml"
    destination = "#currentDir#testxml.xml" >

Ahhh, there's the kind of efficiency we want! I went with DDX from memory, as I had used it a lot in a project. I honestly didn't think of 'extractText'! Thanks for bringing it in and lightening the load.

However, at least, as I see it, the main problem remains how to go from the byte array from the database to the text file. What about something like this:

<p>Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",txtFromPdf)#</cfoutput></p>

Report · Apr 11, 2012

BKBK,

Unfortuantley I think the issue is with the PDF not the code. We use 2 differnt methods to generate the pdfs based on what we need. One of those methods is an AFP2PDF process and it appears those pdfs are somehow corrupted. The Adobe reader opens them just fine, but internal they are somehow corrupted. Similar to how a web browser is forgiving and will still display a web page if you have misformed HTML.

I got the code to work with other pdfs, just not the ones I need it to work on. I need to research how we generate those pdfs.

Thanks again for all your help!

Report · Apr 11, 2012

That must feel like a bit of a downer, after your discovery of the extractText action!

Adobe Community

Can you find a string inside of apdf using cfpdf in Coldfusion?