In theory, yes, but identifying a paragraph in a PDF file is extremely tricky (and sometimes impossible).
I've developed for my customers tools that can do similar things (extract a certain range of words around a matching term, or even a whole sentence), so if you're interested in such a tool feel free to contact me privately (try6767 at gmail.com) so we could discuss it further.
There are a couple of things you can do along these lines. The first thing is to setup Acrobat/Reader so that the highlighted text is copied into the comment for the highlight annotation. You'll need to set the Acrobat Commenting Preference shown below: The preferences dialog is accesssed from the Edit menu.
After this you can get all the text by creating a comment summary. Or writing a script to collect all the text into a CSV file that could then be opened by Excel. Moving this data in to a CSV file or some other storage location requires Acrobat Pro. The easiest solution to to create the CSV as a file attachment. You can find more info here: https://www.pdfscripting.com/public/ExcelAndAcrobat.cfm
Thanks, I understand how to do this process manually. What I'm trying to do is understand if there is a way I could automate the process of searching my list of terms and having to manually highlight the surrounding context of that term so I can create an export that has the context that the term was being used in.
I have XI Pro,
I'm thinking there would be a way to script it somehow using the redact while referencing the tags, or objects, of the document somehow but my java skills are minimal at best.
Please let me know if anyone else has any ideas
I'd suggest you pay someone to do this for you but that's what I do for a living and I wouldn't accept the job because, unless you have a very, very, very, limited set of PDF files to search, I'd never be able to create an acceptable deliverable... and I've been at this for 20 years.
That said, a C-based plugin does have access to the structure information and could be used to develop this solution but you'd need to hire someone with those skills. I'd recommend Thom Parker who is on this thread.
Thank you for the info and perspective on what I'm looking to do, it's much appreciated.
Before I head down the bigger programming solution rabbit hole, I want to make sure I explore the capabilities of java fully.
Do you believe it would be possible to use the redact with a partial word and then designate the character count to be a couple hundred characters. I think this would accomplish something similar.
The saved search function also has some capabilities that could be useful, is there a way to manipulate that output?
Thanks again, this is super helpful
If you're taking in PDF from random sources, you're going to get random results.
Joel hasn't mentioned the next biggest issue with this, which is performance. You could conceivably write a script that acquires all the words and their positions, and does the necessary analysis to determine correct order and document structure. But beside that fact that this is a horrendously difficult bit of programming, it would take a very long time to run.
I know because I've actually written this type of program before as a C++ plug-in, and it was slow. JS is 100 times slower, literally.
Thanks for all this info and feedback.
would anything change in the above of the documents were in a somewhat structured format, ie the documents would be technical specifications which are typically generated and organized by a spec writing software to begin with?
Yes. It would help... there's be less heuristics for the page decomposition but even so it's still a ton of work.
whatabout this idea,
search, highlight, reference markup location in doc and create a image at that location that’s page width by x number pixels tall.
these would then be compiled into a new doc, this could then be OCR, and exported to excel or used in pdf?
Thank you all for the input
There are a bunch of problems with the OCR approach which I won't go into.
If your ultimate goal is automation, you don't want to use Acrobat anyway. There are developer toolkits out there that will do page decomposition and convert the PDF drawing instructions into 99.9% correct reading order and/or can read the structure tags to get it 100% correct. They are expensive but if you are doing this manually now, it'll pay for itself in no time.
The Datalogics PDF Java Toolkit is one such library, it's technology from Adobe and marketed by Datalogics.
I am using a version of the search, highlight, action that you put together. (thanks for this by the way)
My question is would it be possible to easily modify this script to replace the redact annotation with a shaped annotation such as a rectangle and set its set properties with regards to its size? This would be done in lieu of highlighting obviously but my thought is that this box could be used to capture the underlying text much like a highlight would but be able to capture the additional context I'm after.
If the box could be referenced and located over the searched term and give it parameters of page full page width at X height it would do exactly what I'm looking for.
Let me know if anyone thinks this would be a viable option.
That might actually be a viable solution. You could use the annotation as a sort of seed value to define a rectangle that is the width of the page and then a certain y value above and below the annotation, then extract all of the words on the page and detect which ones are within the larger rectangle. You won't necessarily get complete sentences and paragraphs... but you would get the context.
Thanks, is this a terribly difficult thing to do?
I really don't care much about full sentences or paragraphs but rather I just want to be able to "summarize" a 1,000 page pdf into 20-30 pages of "snippets" that reflect where and how my list of search terms are being used in the document. Being that the document will be a technical spec, the context of the use of the term will be easily identified in a very small amount of adjacent text.
It is also easy enough to dig deeper into the anomalies that are very clear in this type of summarized data.
I think the extraction function of the above method is already built into acrobats functionality because it will grab the text within the "box" the same way as highlighting the text does. Once it's annotated this way, there are lots of easy ways to get these annotations into other usable formats such as excel.
I just have no idea how to replace a highlight annotation with a specific sized box.
You're making a lot of assumptions there, which are not very well based.
Yes, it's fairly easy to take the coordinates of a word and enlarge them to a larger "box", but extracting the text within that box is not that simple at all. There's no relation (or very little relation) between what Acrobat can do on its own and what can be done using a script. That's not to say it's impossible, but it's not a trivial task, and there might be a lot of issues if you go beyond the line where the match was found.
I guess what I'm saying is that the script doesn't have to do any extraction.
For example if I just draw a rectangle drawing markup on the pdf, the text within that gets populated (as long as the preference is set correctly) into the markup annotation itself in the comments area all of that text is then visable and part of the markup. From that point on I can just use exportation of the markups/annotations to reassemble the summary and create my deliverable as needed.
Am I missing something on that?
Yes. When you create an annotation using a script you can't use that function that automatically copies the selected text into it.
That only works when you create a comment manually.
That is why I've developed a separate script that allows you to do it retroactively:
10-4 I’ll be in touch