7 Replies Latest reply on Dec 7, 2018 2:28 PM by piotreba

    Text and Data Mining and PDF

    piotreba Level 1



      from the perspective of "Text and Data Mining" (https://www.rightsdirect.com/text-and-data-mining/), do I need to consider any extra steps when setting up InDesign document to be exported to PDF? I ask this question with relation to preparing scientific publications.


      I would think of XML tags. But to be honest I'm not sure how it translates to final PDF. Is PDF a good format allowing text and data mining?


      Would be grateful for any hints.




        • 1. Re: Text and Data Mining and PDF
          David W. Goodrich Level 3

          As no one else has answered I'll throw in my inexpert two cents.  PDF is great for reproducing pages of text but that is not the same thing as preserving text or other data.  As Test Screen Name wrote a month ago in another context:


          > By the way, on PDF, there are no rules for rendering text at all. Each character has a position

          > on the page; that's where you see them, and all there is to it. The rules were used by another

          > app, to decide where each character goes. The editing in Acrobat does a near miraculous job

          > of running over the page, guessing where lines, words and paragraphs are, and giving a kind

          >of primitive editor.


          The key concept is "Each character has a position on the page; that's ... all there is to it."  That is really good for allowing accurate reproduction of a composed page.  One modest way to help Acrobat decipher the text in a PDF is to enable "Tagged Text" when exporting from InDesign -- but note that "Tagged Text"  has a special meaning in PDF.


          Good luck!


          • 2. Re: Text and Data Mining and PDF
            piotreba Level 1

            Hi David W. Goodrich, thank you for the contribution. I'm trying to understand this subject better.


            By "tagged text" do you mean embracing the text with a tag like, e.g. "Introduction" or "methods", so that the content is recognized as something particular?


            Whether it is about position of the character or not, the text in PDF can be extracted as being meaningful (and is searchable).

            • 3. Re: Text and Data Mining and PDF
              David W. Goodrich Level 3

              I have no firm idea of the definition for "Tagged Text" in Acrobat-speak, so I'm not the one to ask.  I just know is that it isn't anything like "Tagged" as in HTML/XML tags.  Try exporting the same page from ID to PDF with and without Tagging and compare the result when you copy-and-paste the same passage from each PDF into a text editor (such as Windows Notepad).


              Another way to look at text in a PDF is that there are no word-spaces per se, just distances between characters.  This means that any software trying to extract text from a PDF must guess where the word-boundaries fall, based on the distance between where one character ends and the next begins.  Letter-spacing can be a serious confusion, not to mention tables or formats like bulleted lists.


              Don't get me wrong, PDF is superlative for reproducing composed pages for humans to read: it preserves the spacing, and uses the actual fonts (unless you fail to embed them).  It lets me typeset Chinese, Arabic and alphabetic text all in the same line using the fonts I specify.  But text and data for machine processing or reading may be better off in another format, such as HTML or XML.

              • 4. Re: Text and Data Mining and PDF
                Colin Flashman Adobe Community Professional

                I would imagine that this would work in a similar fashion to creating PDFs intended for accessibility purposes. This page may have more information: https://indesignsecrets.com/creating-accessible-pdfs.php

                • 5. Re: Text and Data Mining and PDF
                  Bevi Chagnon | PubCom Adobe Community Professional

                  In Acrobat-speak, "Tagged Text" means that the content in the PDF has been defined by a set of known tags that are specified by the PDF standard ISO 32000-1.


                  Some of the tags are <P> <H1> <H2> <L> and a couple dozen more. These tags are used to create accessible PDFs that can be used by various assistive technologies for those with disabilities, such as blindness and low-vision, mobility such as paralysis and amputation, and cognitive such as dyslexia and Asperger's syndrome. (Note, there are many other disabilities that need tagged, accessible PDFs.)  The accessibility tags are defined by using Paragraph Styles and defining the Export Tagging section (lower portion of the dialogue for PDF).


                  Your example of <Introduction> or <Methods> would be XML tags, which can be anything that the content author want. The "X" stands for "extensible";  hence, XML uses a custom set of tags.  Use InDesign's Structure and Tags panels to work with XML.


                  As far as I know, you cannot create a PDF from InDesign with XML tags; if you check the option for "Create Tagged PDF," you will make a PDF with the accessibility tags, not custom XML tags.


                  And just to make this more confusing, "tags" are used when exporting the InDesign file to HTML and EPUBs, too. Very similar to the accessibility tags above, but are instead defined by another international accessibility standard called WCAG.

                  • 6. Re: Text and Data Mining and PDF
                    Bevi Chagnon | PubCom Adobe Community Professional

                    The free Lynda.com video about making accessible PDFs (noted above) is very elementary.


                    It takes a helluva lot more to make an accessible PDF than what's covered in the video!


                    For those who want to explore this niche of publishing, you'll need a full course of instruction on accessible PDFs, the accessibility standards, and the procedures to do in InDesign.

                    • 7. Re: Text and Data Mining and PDF
                      piotreba Level 1

                      Thank you guys for your hints. I am now taking a closer look at accessibility options and document structure in InDesign.