Copy link to clipboard
Copied
Is it possilbe to determine if a String, for example "Not Possible", exists in a PDF using CFPDF or another function inside of Coldfusion? If so, any suggestions on how to do this would be appreciated.
Thanks!
Copy link to clipboard
Copied
Yes, it is possible. In the following example, the 2 files and the PDF ('myDoc.pdf') are in the same directory.
textFromPDF.cfm
<!--- Convert from PDF to text and search text --->
<cfset currentDir = getDirectoryFromPath(expandpath('*.*'))>
<cfset ddxfile = "#currentDir#myDDX.ddx">
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1= "#currentDir#myDoc.pdf">
<cfset outputStruct=StructNew()><!--- Coldfusion automatically saves the text as XML file --->
<cfset outputStruct.Out1="#currentDir#my_PDF_doc_as_text.xml">
<cfpdf action="processddx" ddxfile="#ddxfile#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="myDDXVar">
<cfif myDDXVar.out1 is "successful"><!--- read the text --->
<cffile action="read" file="#currentDir#my_PDF_doc_as_text.xml" variable="my_PDF_doc_as_text">
</cfif>
Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>
myDDX.ddx
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="Doc1"/>
</DocumentText>
</DDX>
Copy link to clipboard
Copied
BKBK,
Thanks for your help. You've gotten me off to a great start!
I keep getting a DDX is invalid error, Check for invalid construct or restricted keywords. Is it possible that your ddx is somehow misformed?
Copy link to clipboard
Copied
I suspect you made the same mistake I did in the beginning. Note that there is a space before the word coldfusion in:
"http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd"
Copy link to clipboard
Copied
Thanks! Actually I did end up eventually finding that space.
I think the issue I'm having is that I'm trying to do this from a binary file. I'm storing my pdfs in a database as blobs. I can successfully read them out of the database but I'm having issues incorporating the binary blob into the example above.
I think it would be something like
<cfset inputStruct.Doc1= "#ToString(query.pdfBinaryVariable)#">
But I can't get that to work out. Any thoughts?
Copy link to clipboard
Copied
I deceided to just write the file locally to see if that would help. It successfully writes the file and I can open the pdf with Adobe.
When I run the code, it never writes the my_PDF_doc_as_text.xml file. The myDDXVar keeps reporting back failed.
When I dump myDDXVar is says
failed: 0, Size: 0
Copy link to clipboard
Copied
Hmm, challenging construction! What about first writing the PDF to disk? I know it's inefficient, but let us first get a working example.
<cfset currentDir = getDirectoryFromPath(expandpath('*.*'))>
<cfset ddxfile = "#currentDir#myDDX.ddx">
<cffile action="write" file="#currentDir#myNewDoc.pdf" output="#pdfBinaryVariable#" >
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1= "#currentDir#myNewDoc.pdf">
<cfset outputStruct=StructNew()><!--- Coldfusion automatically saves the text as XML file --->
<cfset outputStruct.Out1="#currentDir#my_PDF_doc_as_text.xml">
<cfpdf action="processddx" ddxfile="#ddxfile#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="myDDXVar">
<cfif myDDXVar.out1 is "successful"><!--- read the text --->
<cffile action="read" file="#currentDir#my_PDF_doc_as_text.xml" variable="my_PDF_doc_as_text">
<!---<cfdump var="#my_PDF_doc_as_text#">--->
</cfif>
Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>
Copy link to clipboard
Copied
BKBK,
Same result when I write the file to disk.
When I dump myDDXVar is says
failed: 0, Size: 0
Copy link to clipboard
Copied
Can you open the newly created PDF? Is its content what you expected?
Copy link to clipboard
Copied
Yes. When I open the new pdf. It is exactly what I'm expecting coming out of the DB. I just can't get CF to spit out the text file
Copy link to clipboard
Copied
Could you please show us your code.
Copy link to clipboard
Copied
I appologize! I should have done that first!
<cfset currentDir = getDirectoryFromPath(expandpath('*.*'))>
<cfset ddxfile = "#currentDir#myDDX.ddx">
<cffile action="write" file="#currentDir#myNewDoc.pdf" output="#query.pdf#" >
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1= "#currentDir#myNewDoc.pdf">
<cfset outputStruct=StructNew()><!--- Coldfusion automatically saves the text as XML file --->
<cfset outputStruct.Out1="#currentDir#my_PDF_doc_as_text.xml">
<cfpdf action="processddx" ddxfile="#ddxfile#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="myDDXVar">
<br><cfoutput>#myDDXVar.Out1#</cfoutput><br>
<cfdump var="#myDDXVar#">
<cfif myDDXVar.Out1 is "successful"><!--- read the text --->
<cffile action="read" file="#currentDir#my_PDF_doc_as_text.xml" variable="my_PDF_doc_as_text">
<!---<cfdump var="#my_PDF_doc_as_text#">--->
</cfif>
Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>
Copy link to clipboard
Copied
My DDX file:
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="Doc1"/>
</DocumentText>
</DDX>
Copy link to clipboard
Copied
Interesting.
I tried using a different PDF. Just a random pdf I had on my file system not one that was coming out of the DB. Using the random pdf the code worked perfectly.
Using what was coming out of the DB, gets me the error. The pdf opens without an issue in Adobe reader though.
The random file has
PDF Producer: Adobe PDF Library 9.9
PDF Vesion: 1.6 (Acrobat 7.x)
The files out of the database has
PDF Producer: iText 2.0.2 (by lowagie.com)
PDF Vesion: 1.4 (Acrobat 5.x)
* I thought maybe it was a version issue. I tried reading the db file and using cfpdf to write it back to the file system as a differenct version. Unfortunatly that failed as well.
Copy link to clipboard
Copied
By the way, CF9 has an extracttext action on cfpdf!
<cfpdf action="read" source="pdfFile.pdf" name="mypdf">
<cfpdf
action="extracttext"
source= "mypdf"
pages = "*"
type = "xml"
destination = "#currentDir#testxml.xml" >
Copy link to clipboard
Copied
Paiz wrote:
By the way, CF9 has an extracttext action on cfpdf!
<cfpdf action="read" source="pdfFile.pdf" name="mypdf">
<cfpdf
action="extracttext"
source= "mypdf"
pages = "*"
type = "xml"
destination = "#currentDir#testxml.xml" >
Ahhh, there's the kind of efficiency we want! I went with DDX from memory, as I had used it a lot in a project. I honestly didn't think of 'extractText'! Thanks for bringing it in and lightening the load.
However, at least, as I see it, the main problem remains how to go from the byte array from the database to the text file. What about something like this:
<cfset currentDir = getDirectoryFromPath(expandpath('*.*'))>
<cffile action="write" file="#currentDir#myNewDoc.pdf" output="#query.pdf#" >
<cfpdf action="extracttext" source="#currentDir#myNewDoc.pdf" name="txtFromPdf">
<p>Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",txtFromPdf)#</cfoutput></p>
Copy link to clipboard
Copied
BKBK,
Unfortuantley I think the issue is with the PDF not the code. We use 2 differnt methods to generate the pdfs based on what we need. One of those methods is an AFP2PDF process and it appears those pdfs are somehow corrupted. The Adobe reader opens them just fine, but internal they are somehow corrupted. Similar to how a web browser is forgiving and will still display a web page if you have misformed HTML.
I got the code to work with other pdfs, just not the ones I need it to work on. I need to research how we generate those pdfs.
Thanks again for all your help!
Copy link to clipboard
Copied
That must feel like a bit of a downer, after your discovery of the extractText action!