1 Reply Latest reply on May 25, 2010 12:23 AM by try67

    How to extract ALL text on a line, including extra white space, from PDF?

    tlibson

      I have written my first scripts to parse thru PDF files and successfully locate the beginning and ending of text portions that I wish to copy/extract.  For example, one of the scripts reads thru PDF files and locates patterns that allow me to copy/extract a lengthy Title that spans several lines on the page.  My remaining problem is that sometimes the text I wish to copy/extract has tabs or lots of white space between the words, and I need to retain ALL the "white space" between words.  Basically, I am successful at using getPageNthWord(j, i, false) along with "Quad coordinate sets" to get a word-at-a-time for all the words on the multiple lines but it seems to be skipping over the tabs and extra white spacing beetween the words.   Another way to say this is that I need to extract/copy a full line at a time, including all white space, instead of being limited to just a word at a time. 

       

      Can anyone help me with an algorithm to extract/copy text, not just the words but also including tabs and other extra white space characters? Is there some type of Regular Expression I should be using, or maybe some additional parameter used for getPageNthWord that would retain all the extra white space?

       

      Thanks in advance,

      Ted L