• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Text and Data Mining and PDF

Explorer ,
Dec 02, 2018 Dec 02, 2018

Copy link to clipboard

Copied

Hi,

from the perspective of "Text and Data Mining" (https://www.rightsdirect.com/text-and-data-mining/), do I need to consider any extra steps when setting up InDesign document to be exported to PDF? I ask this question with relation to preparing scientific publications.

I would think of XML tags. But to be honest I'm not sure how it translates to final PDF. Is PDF a good format allowing text and data mining?

Would be grateful for any hints.

Peter

Views

866

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Dec 05, 2018 Dec 05, 2018

I would imagine that this would work in a similar fashion to creating PDFs intended for accessibility purposes. This page may have more information: https://indesignsecrets.com/creating-accessible-pdfs.php

Votes

Translate

Translate
Advocate ,
Dec 03, 2018 Dec 03, 2018

Copy link to clipboard

Copied

As no one else has answered I'll throw in my inexpert two cents.  PDF is great for reproducing pages of text but that is not the same thing as preserving text or other data.  As Test Screen Name wrote a month ago in another context:

> By the way, on PDF, there are no rules for rendering text at all. Each character has a position

> on the page; that's where you see them, and all there is to it. The rules were used by another

> app, to decide where each character goes. The editing in Acrobat does a near miraculous job

> of running over the page, guessing where lines, words and paragraphs are, and giving a kind

>of primitive editor.

The key concept is "Each character has a position on the page; that's ... all there is to it."  That is really good for allowing accurate reproduction of a composed page.  One modest way to help Acrobat decipher the text in a PDF is to enable "Tagged Text" when exporting from InDesign -- but note that "Tagged Text"  has a special meaning in PDF.

Good luck!

David

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Dec 04, 2018 Dec 04, 2018

Copy link to clipboard

Copied

Hi David W. Goodrich, thank you for the contribution. I'm trying to understand this subject better.

By "tagged text" do you mean embracing the text with a tag like, e.g. "Introduction" or "methods", so that the content is recognized as something particular?

Whether it is about position of the character or not, the text in PDF can be extracted as being meaningful (and is searchable).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Dec 04, 2018 Dec 04, 2018

Copy link to clipboard

Copied

I have no firm idea of the definition for "Tagged Text" in Acrobat-speak, so I'm not the one to ask.  I just know is that it isn't anything like "Tagged" as in HTML/XML tags.  Try exporting the same page from ID to PDF with and without Tagging and compare the result when you copy-and-paste the same passage from each PDF into a text editor (such as Windows Notepad).

Another way to look at text in a PDF is that there are no word-spaces per se, just distances between characters.  This means that any software trying to extract text from a PDF must guess where the word-boundaries fall, based on the distance between where one character ends and the next begins.  Letter-spacing can be a serious confusion, not to mention tables or formats like bulleted lists.

Don't get me wrong, PDF is superlative for reproducing composed pages for humans to read: it preserves the spacing, and uses the actual fonts (unless you fail to embed them).  It lets me typeset Chinese, Arabic and alphabetic text all in the same line using the fonts I specify.  But text and data for machine processing or reading may be better off in another format, such as HTML or XML.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 05, 2018 Dec 05, 2018

Copy link to clipboard

Copied

In Acrobat-speak, "Tagged Text" means that the content in the PDF has been defined by a set of known tags that are specified by the PDF standard ISO 32000-1.

Some of the tags are <P> <H1> <H2> <L> and a couple dozen more. These tags are used to create accessible PDFs that can be used by various assistive technologies for those with disabilities, such as blindness and low-vision, mobility such as paralysis and amputation, and cognitive such as dyslexia and Asperger's syndrome. (Note, there are many other disabilities that need tagged, accessible PDFs.)  The accessibility tags are defined by using Paragraph Styles and defining the Export Tagging section (lower portion of the dialogue for PDF).

Your example of <Introduction> or <Methods> would be XML tags, which can be anything that the content author want. The "X" stands for "extensible";  hence, XML uses a custom set of tags.  Use InDesign's Structure and Tags panels to work with XML.

As far as I know, you cannot create a PDF from InDesign with XML tags; if you check the option for "Create Tagged PDF," you will make a PDF with the accessibility tags, not custom XML tags.

And just to make this more confusing, "tags" are used when exporting the InDesign file to HTML and EPUBs, too. Very similar to the accessibility tags above, but are instead defined by another international accessibility standard called WCAG.

|    Bevi Chagnon   |  Designer & Technologist for Accessible Documents
|    Classes & Books for Accessible InDesign, PDFs & MS Office |

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 05, 2018 Dec 05, 2018

Copy link to clipboard

Copied

I would imagine that this would work in a similar fashion to creating PDFs intended for accessibility purposes. This page may have more information: https://indesignsecrets.com/creating-accessible-pdfs.php

If the answer wasn't in my post, perhaps it might be on my blog at colecandoo!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 05, 2018 Dec 05, 2018

Copy link to clipboard

Copied

The free Lynda.com video about making accessible PDFs (noted above) is very elementary.

It takes a helluva lot more to make an accessible PDF than what's covered in the video!

For those who want to explore this niche of publishing, you'll need a full course of instruction on accessible PDFs, the accessibility standards, and the procedures to do in InDesign.

|    Bevi Chagnon   |  Designer & Technologist for Accessible Documents
|    Classes & Books for Accessible InDesign, PDFs & MS Office |

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Dec 07, 2018 Dec 07, 2018

Copy link to clipboard

Copied

LATEST

Thank you guys for your hints. I am now taking a closer look at accessibility options and document structure in InDesign.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines