4 Replies Latest reply: Feb 5, 2011 7:22 AM by CtDave RSS

    Font don't show after OCR'ed pdf is edited

    asdfabcedasf Community Member

      I try to edit an OCR'ed pdf (https://sites.google.com/site/sharedacrobat/data/edit_font_ocr.pdf?attredirects=0&d=1).

       

      For example, "Example 6.1" is incorrectly OCR'ed as "Example 6.I". Then, I selected TouchUp Text tool, then type "1" (the font selected was TimeNewRoman). But it appears as something that can not be shown (  ), when I copy it after I quit TouchUp Text tool selection.

       

      I'm wondering what the problem is. Is there a way to correct "I" to "1" for this pdf document?

        • 1. Re: Font don't show after OCR'ed pdf is edited
          CtDave CommunityMVP

          Trying to edit the hidden text layer of OCR is very time intensive. "Try" is the operative word.


          Typically, better results come from use of Acrobat 8's formatted text & graphics or Acrobat 9/X clearscan.

           

          Acrobat OCR with Searchable Image or Searchable Image (Exact) provides a text output that uses text rendering mode 3.
          Which is to say, the text is "invisible" or "hidden".

          To view the OCR output characters use Acrobat's Examine Document feature.
          A dialog will be presented that lets you preview the hidden text of OCR.
          Drag the Examine Document over and off of the PDF page.
          When you preview the associated dialog window will also be clear of the page.
          Visually compare.

          As often as not any "select" of the hidden text will not really be a selection of characters that correspond to the image of the characters.
          This is why the edit of  the 6.1 character's image OCR output is now 6.⸀l (displayed in Examine Document's Hidden text preview).

          I see that your PDF has many symbols. A large population of these have not been captured by the OCR (to be expected).

          You can make something of an educated guess about what you select. Make your edit. Save the PDF.
          Use Examine Document > preview hidden text to locate and view.
          A lot of trial and error. Save the working file often using an incremented extension to the filename (like "filename_001" "....002", etc).


          Considering that the PDF's content is on a single page and is relatively straight forward if a 100% accurate rendition is needed it would not take too long to transcribe into MS Word. More pages of similar material would take longer of course but, still, not too difficult.

          Use of of a symbol font set that maps to unicode would be desired. When done, create the PDF.
          Much less stress and strain than trying to edit OCR's hidden text layer from Searchable Image or Searchable Image (Exact).

          Especially in context of PDF not being a format/layout editing format.


          Be well....


          ~~~~~~~~~~~~~
          fwiw — a copy of the OCR output is below.
          The capture of textual content is not bad (expected). Symbols are something else (e.g, uppercase beta captured as "J").

           

          138 THEORETICAL STATISTICS [5.2 note that the reduction to the consideration of T is possible because of sufficiency under the alternative hypotheses {J =1= (Jo. lt follows on summing (11) that pr(R = r, T = t ; 'Y, (J) = c(r,t)e'YT+/3t ITO + e'Y+/3Xj) (2) ~u c(r, u)e 'YT+/3u pr(R = r) = ITO + e'Y+/3Xj) , (3) where c(r, t) is the number of distinct subsets of size r which can be formed from {Xl, ... ,Xn} and which sum to t. Formally c(r, t) is the coefficient of ~~ ~~ in the generating function (14) Thus, from (2) and (13), the required conditional distribution for arbitrary (J is pr(T= tlR =r;{J) = c(r, t)e/3t/~uc(r,u)e/3u. (15) lt is now clear that the likelihood ratio for an alternative {J = {JA > {Jo versus {J = {Jo is an increasing function of t and that therefore the one-sided significance level for testing {J = {Jo against {J > {Jo is the upper tail probability (16) where, of course, tmax ~ ~Xj. When (Jo = 0, (15) simplifies and the statistic T is in effect the total of a random sample size r drawn without replacement from the finite population {Xl, ... , Xn}; see also Example 6.⸀l. In particular, it follows that the conditional mean and variance of T under the null hypothesis (J = 0 are and r(n -r) ~(Xj -iJ2 n(n -1) (17) A special case is the two-sample problem in which the first nl observations have, say, pr(Yj = 1) = 01, whereas the second group of n2 = n -nl observations have a corresponding probability O2. This is covered, for example, by taking Xl = ... = xn\ = 1 and

          • 2. Re: Font don't show after OCR'ed pdf is edited
            asdfabcedasf Community Member

            CtDave wrote:

             

            Trying to edit the hidden text layer of OCR is very time intensive. "Try" is the operative word.


            Typically, better results come from use of Acrobat 8's formatted text & graphics or Acrobat 9/X clearscan.

             

            Acrobat OCR with Searchable Image or Searchable Image (Exact) provides a text output that uses text rendering mode 3.
            Which is to say, the text is "invisible" or "hidden".

            To view the OCR output characters use Acrobat's Examine Document feature.
            A dialog will be presented that lets you preview the hidden text of OCR.
            Drag the Examine Document over and off of the PDF page.
            When you preview the associated dialog window will also be clear of the page.
            Visually compare.

            As often as not any "select" of the hidden text will not really be a selection of characters that correspond to the image of the characters.
            This is why the edit of  the 6.1 character's image OCR output is now 6.⸀l (displayed in Examine Document's Hidden text preview).

            I see that your PDF has many symbols. A large population of these have not been captured by the OCR (to be expected).

            You can make something of an educated guess about what you select. Make your edit. Save the PDF.
            Use Examine Document > preview hidden text to locate and view.
            A lot of trial and error. Save the working file often using an incremented extension to the filename (like "filename_001" "....002", etc).


            Considering that the PDF's content is on a single page and is relatively straight forward if a 100% accurate rendition is needed it would not take too long to transcribe into MS Word. More pages of similar material would take longer of course but, still, not too difficult.

            Use of of a symbol font set that maps to unicode would be desired. When done, create the PDF.
            Much less stress and strain than trying to edit OCR's hidden text layer from Searchable Image or Searchable Image (Exact).

            Especially in context of PDF not being a format/layout editing format.


            Be well....


            ~~~~~~~~~~~~~
            fwiw — a copy of the OCR output is below.
            The capture of textual content is not bad (expected). Symbols are something else (e.g, uppercase beta captured as "J").

             

            138 THEORETICAL STATISTICS [5.2 note that the reduction to the consideration of T is possible because of sufficiency under the alternative hypotheses {J =1= (Jo. lt follows on summing (11) that pr(R = r, T = t ; 'Y, (J) = c(r,t)e'YT+/3t ITO + e'Y+/3Xj) (2) ~u c(r, u)e 'YT+/3u pr(R = r) = ITO + e'Y+/3Xj) , (3) where c(r, t) is the number of distinct subsets of size r which can be formed from {Xl, ... ,Xn} and which sum to t. Formally c(r, t) is the coefficient of ~~ ~~ in the generating function (14) Thus, from (2) and (13), the required conditional distribution for arbitrary (J is pr(T= tlR =r;{J) = c(r, t)e/3t/~uc(r,u)e/3u. (15) lt is now clear that the likelihood ratio for an alternative {J = {JA > {Jo versus {J = {Jo is an increasing function of t and that therefore the one-sided significance level for testing {J = {Jo against {J > {Jo is the upper tail probability (16) where, of course, tmax ~ ~Xj. When (Jo = 0, (15) simplifies and the statistic T is in effect the total of a random sample size r drawn without replacement from the finite population {Xl, ... , Xn}; see also Example 6.⸀l. In particular, it follows that the conditional mean and variance of T under the null hypothesis (J = 0 are and r(n -r) ~(Xj -iJ2 n(n -1) (17) A special case is the two-sample problem in which the first nl observations have, say, pr(Yj = 1) = 01, whereas the second group of n2 = n -nl observations have a corresponding probability O2. This is covered, for example, by taking Xl = ... = xn\ = 1 and

             

            I understand that extracting text into a word file and editing the word file may be useful. But for my particular case, I only need to have a pdf file with embedded hidden layer and I don't need to extract the text into a separate file. I understand that with the currently poor OCR'ed pdf editing  support in Acrobat, it is painful to edit an OCR'ed pdf file.

             

            I think that editing a couple of words in the hidden layer is not an unreasonable requirement from the users. But unfortunately, Acrobat is different from my expectation on this aspect. I'm wondering if anybody have at least one way to correct "6.I" to "6.1" in the hidden layer in my example pdf file.

            • 3. Re: Font don't show after OCR'ed pdf is edited
              ohgivemeabreak

              I think that editing a couple of words in the hidden layer is not an unreasonable requirement from the users. But unfortunately, Acrobat is different from my expectation on this aspect. I'm wondering if anybody have at least one way to correct "6.I" to "6.1" in the hidden layer in my example pdf file.

               

              I am a new Acrobat Pro X user, and after having shelled out over $100 for this software, I am getting pretty peeved at its limitations. If the software is capable of going through all of its optically "recognized" [sic] characters and showing its self-identified suspects in order to allow users to view and edit it, then it is absolutely capable of showing users the rest of the text that it has presumed was correctly identified in order to let them correct ITS many mistakes as well. All of this back and forth (export to a text file, re-edit it and then create a new PDF) is absolutely ridiculous.

               

              • 4. Re: Font don't show after OCR'ed pdf is edited
                CtDave CommunityMVP

                "Suspects" are only associated with Acrobat OCR via "Formatted Text & Graphics" (Acrobat 8) or "ClearScan" (Acrobat 9 or X).

                Either is, essentially, replacing the image of characters with a renderable character. Use of the TouchUp Text tool permits "touchup".

                 

                However, if the end-user directs the software to process OCR via Searchable Image or Searchable Image (Exact) the OCR output is provided as "hidden text". No "suspects".

                 

                Different processes, different out comes.

                Either can be manipulated; however, for either a measure of effort is required.

                This is due to what PDF, as a file format, is and what it is not.

                 

                Be well...