As no one else has answered I'll throw in my inexpert two cents. PDF is great for reproducing pages of text but that is not the same thing as preserving text or other data. As Test Screen Name wrote a month ago in another context:
> By the way, on PDF, there are no rules for rendering text at all. Each character has a position
> on the page; that's where you see them, and all there is to it. The rules were used by another
> app, to decide where each character goes. The editing in Acrobat does a near miraculous job
> of running over the page, guessing where lines, words and paragraphs are, and giving a kind
>of primitive editor.
The key concept is "Each character has a position on the page; that's ... all there is to it." That is really good for allowing accurate reproduction of a composed page. One modest way to help Acrobat decipher the text in a PDF is to enable "Tagged Text" when exporting from InDesign -- but note that "Tagged Text" has a special meaning in PDF.
Hi David W. Goodrich, thank you for the contribution. I'm trying to understand this subject better.
By "tagged text" do you mean embracing the text with a tag like, e.g. "Introduction" or "methods", so that the content is recognized as something particular?
Whether it is about position of the character or not, the text in PDF can be extracted as being meaningful (and is searchable).
I have no firm idea of the definition for "Tagged Text" in Acrobat-speak, so I'm not the one to ask. I just know is that it isn't anything like "Tagged" as in HTML/XML tags. Try exporting the same page from ID to PDF with and without Tagging and compare the result when you copy-and-paste the same passage from each PDF into a text editor (such as Windows Notepad).
Another way to look at text in a PDF is that there are no word-spaces per se, just distances between characters. This means that any software trying to extract text from a PDF must guess where the word-boundaries fall, based on the distance between where one character ends and the next begins. Letter-spacing can be a serious confusion, not to mention tables or formats like bulleted lists.
Don't get me wrong, PDF is superlative for reproducing composed pages for humans to read: it preserves the spacing, and uses the actual fonts (unless you fail to embed them). It lets me typeset Chinese, Arabic and alphabetic text all in the same line using the fonts I specify. But text and data for machine processing or reading may be better off in another format, such as HTML or XML.
I would imagine that this would work in a similar fashion to creating PDFs intended for accessibility purposes. This page may have more information: https://indesignsecrets.com/creating-accessible-pdfs.php
In Acrobat-speak, "Tagged Text" means that the content in the PDF has been defined by a set of known tags that are specified by the PDF standard ISO 32000-1.
Some of the tags are <P> <H1> <H2> <L> and a couple dozen more. These tags are used to create accessible PDFs that can be used by various assistive technologies for those with disabilities, such as blindness and low-vision, mobility such as paralysis and amputation, and cognitive such as dyslexia and Asperger's syndrome. (Note, there are many other disabilities that need tagged, accessible PDFs.) The accessibility tags are defined by using Paragraph Styles and defining the Export Tagging section (lower portion of the dialogue for PDF).
Your example of <Introduction> or <Methods> would be XML tags, which can be anything that the content author want. The "X" stands for "extensible"; hence, XML uses a custom set of tags. Use InDesign's Structure and Tags panels to work with XML.
As far as I know, you cannot create a PDF from InDesign with XML tags; if you check the option for "Create Tagged PDF," you will make a PDF with the accessibility tags, not custom XML tags.
And just to make this more confusing, "tags" are used when exporting the InDesign file to HTML and EPUBs, too. Very similar to the accessibility tags above, but are instead defined by another international accessibility standard called WCAG.
The free Lynda.com video about making accessible PDFs (noted above) is very elementary.
It takes a helluva lot more to make an accessible PDF than what's covered in the video!
For those who want to explore this niche of publishing, you'll need a full course of instruction on accessible PDFs, the accessibility standards, and the procedures to do in InDesign.
Thank you guys for your hints. I am now taking a closer look at accessibility options and document structure in InDesign.