Is it possilbe to determine if a String, for example "Not Possible", exists in a PDF using CFPDF or another function inside of Coldfusion? If so, any suggestions on how to do this would be appreciated.
Thanks!
Yes, it is possible. In the following example, the 2 files and the PDF ('myDoc.pdf') are in the same directory.
textFromPDF.cfm
<!--- Convert from PDF to text and search text --->
<cfset currentDir = getDirectoryFromPath(expandpath('*.*'))>
<cfset ddxfile = "#currentDir#myDDX.ddx">
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1= "#currentDir#myDoc.pdf">
<cfset outputStruct=StructNew()><!--- Coldfusion automatically saves the text as XML file --->
<cfset outputStruct.Out1="#currentDir#my_PDF_doc_as_text.xml">
<cfpdf action="processddx" ddxfile="#ddxfile#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="myDDXVar">
<cfif myDDXVar.out1 is "successful"><!--- read the text --->
<cffile action="read" file="#currentDir#my_PDF_doc_as_text.xml" variable="my_PDF_doc_as_text">
</cfif>
Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>
myDDX.ddx
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="Doc1"/>
</DocumentText>
</DDX>
I suspect you made the same mistake I did in the beginning. Note that there is a space before the word coldfusion in:
"http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd"
Thanks! Actually I did end up eventually finding that space.
I think the issue I'm having is that I'm trying to do this from a binary file. I'm storing my pdfs in a database as blobs. I can successfully read them out of the database but I'm having issues incorporating the binary blob into the example above.
I think it would be something like
<cfset inputStruct.Doc1= "#ToString(query.pdfBinaryVariable)#">
But I can't get that to work out. Any thoughts?
I deceided to just write the file locally to see if that would help. It successfully writes the file and I can open the pdf with Adobe.
When I run the code, it never writes the my_PDF_doc_as_text.xml file. The myDDXVar keeps reporting back failed.
When I dump myDDXVar is says
failed: 0, Size: 0
Hmm, challenging construction! What about first writing the PDF to disk? I know it's inefficient, but let us first get a working example.
<cfset currentDir = getDirectoryFromPath(expandpath('*.*'))>
<cfset ddxfile = "#currentDir#myDDX.ddx">
<cffile action="write" file="#currentDir#myNewDoc.pdf" output="#pdfBinaryVariable#" >
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1= "#currentDir#myNewDoc.pdf">
<cfset outputStruct=StructNew()><!--- Coldfusion automatically saves the text as XML file --->
<cfset outputStruct.Out1="#currentDir#my_PDF_doc_as_text.xml">
<cfpdf action="processddx" ddxfile="#ddxfile#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="myDDXVar">
<cfif myDDXVar.out1 is "successful"><!--- read the text --->
<cffile action="read" file="#currentDir#my_PDF_doc_as_text.xml" variable="my_PDF_doc_as_text">
<!---<cfdump var="#my_PDF_doc_as_text#">--->
</cfif>
Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>
I appologize! I should have done that first!
<cfset currentDir = getDirectoryFromPath(expandpath('*.*'))>
<cfset ddxfile = "#currentDir#myDDX.ddx">
<cffile action="write" file="#currentDir#myNewDoc.pdf" output="#query.pdf#" >
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1= "#currentDir#myNewDoc.pdf">
<cfset outputStruct=StructNew()><!--- Coldfusion automatically saves the text as XML file --->
<cfset outputStruct.Out1="#currentDir#my_PDF_doc_as_text.xml">
<cfpdf action="processddx" ddxfile="#ddxfile#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="myDDXVar">
<br><cfoutput>#myDDXVar.Out1#</cfoutput><br>
<cfdump var="#myDDXVar#">
<cfif myDDXVar.Out1 is "successful"><!--- read the text --->
<cffile action="read" file="#currentDir#my_PDF_doc_as_text.xml" variable="my_PDF_doc_as_text">
<!---<cfdump var="#my_PDF_doc_as_text#">--->
</cfif>
Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",my_PDF_doc_as_text)#</cfoutput>
My DDX file:
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="Doc1"/>
</DocumentText>
</DDX>
Interesting.
I tried using a different PDF. Just a random pdf I had on my file system not one that was coming out of the DB. Using the random pdf the code worked perfectly.
Using what was coming out of the DB, gets me the error. The pdf opens without an issue in Adobe reader though.
The random file has
PDF Producer: Adobe PDF Library 9.9
PDF Vesion: 1.6 (Acrobat 7.x)
The files out of the database has
PDF Producer: iText 2.0.2 (by lowagie.com)
PDF Vesion: 1.4 (Acrobat 5.x)
* I thought maybe it was a version issue. I tried reading the db file and using cfpdf to write it back to the file system as a differenct version. Unfortunatly that failed as well.
Paiz wrote:
By the way, CF9 has an extracttext action on cfpdf!
<cfpdf action="read" source="pdfFile.pdf" name="mypdf">
<cfpdf
action="extracttext"
source= "mypdf"
pages = "*"
type = "xml"
destination = "#currentDir#testxml.xml" >
Ahhh, there's the kind of efficiency we want! I went with DDX from memory, as I had used it a lot in a project. I honestly didn't think of 'extractText'! Thanks for bringing it in and lightening the load.
However, at least, as I see it, the main problem remains how to go from the byte array from the database to the text file. What about something like this:
<cfset currentDir = getDirectoryFromPath(expandpath('*.*'))>
<cffile action="write" file="#currentDir#myNewDoc.pdf" output="#query.pdf#" >
<cfpdf action="extracttext" source="#currentDir#myNewDoc.pdf" name="txtFromPdf">
<p>Position of search text "Not Possible": <cfoutput>#findNoCase("Not Possible",txtFromPdf)#</cfoutput></p>
BKBK,
Unfortuantley I think the issue is with the PDF not the code. We use 2 differnt methods to generate the pdfs based on what we need. One of those methods is an AFP2PDF process and it appears those pdfs are somehow corrupted. The Adobe reader opens them just fine, but internal they are somehow corrupted. Similar to how a web browser is forgiving and will still display a web page if you have misformed HTML.
I got the code to work with other pdfs, just not the ones I need it to work on.
I need to research how we generate those pdfs.
Thanks again for all your help!
North America
Europe, Middle East and Africa
Asia Pacific