6 Replies Latest reply on Aug 15, 2014 4:55 PM by [Jongware]

    Issues when importing Word document with endnotes

    sarahh85

      I'm wondering if anyone else has come across this issue. It doesn't happen all the time, but I'm working on a very big document at the moment and it's happened quite a few times.

      When it's imported the Word file, it sometimes comes up as below, rather than just having the superscript number. It's frustrating having to go through it and make sure I only delete the right text. It still comes up with the correct endnote at the end of the document, but includes this text within the main body.

      Does anyone know why this happens and if there is any way to stop this happening?

       

      One study6Shayna E.</author><author>Margolis, David</author><author>Shardell, Michelle</author><author>Hawkes, William G.</author><author>Miller, Ram R.</author><author>Amr, Sania</author><author>Baumgarten, Mona</author></authors></contributors><auth-address>Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, USA.</auth-address><titles><title>Frequent manual repositioning and incidence of pressure ulcers among bed-bound elderly hip fracture patients</title><secondary-title>Wound Repair and Regeneration</secondary-title></titles><periodical><full-title>Wound Repair and Regeneration</full-title></periodical><pages>10-18</pages><volume>19</volume><number>1</n umber><dates><year>2011</year></dates><pub-location>United States</pub-location><isbn>1524-475X</isbn><accession-num>21134034. Language: English. Date Created: 20110117. Update Code: 20110204. Publication Type: Journal Article</accession-num><label> GROUPS: repositioning, prevalence,</label><reviewed-item>T Repositioning</reviewed-item><urls><related-urls><url>https://library1.unmc.edu/login?url=http://search.ebscohost.com/login.aspx?direct=true&db= cmedm&AN=21134034&login.asp&site=ehost-live</url></related-urls></urls><electronic-resource-num>10.1111/j.1524-475X.2010.00644.x</ele ctronic-resource-num><remote-database-name>cmedm</remote-database-name><remote-database-pr ovider>EBSCOhost</remote-database-provider></record></Cite></EndNote> investigated the association between repositioning frequency and pressure ulcer incidence. In this study “frequently repositioned” amounted to at least 12 repositions per day over the study period of 21 days.


      Thank you

        • 1. Re: Issues when importing Word document with endnotes
          Peter Spier Most Valuable Participant (Moderator)

          What does it look like in Word?

           

          have you checked to be sure all tracked changes are accepted or rejected and then done a Save As to force word to re-write the file? Changing format between .doc, .docx and .rtf can sometimes fix odd Word imports, too, but I don't know how that might affect endnotes.

          • 2. Re: Issues when importing Word document with endnotes
            sarahh85 Level 1

            I have checked the Word document and there are no track changes in it at all.

            Haven't tried saving it as a different file format, but I will give that a go and see if it makes any difference.

            Here is a screenshot of how that para looks in Word.

            screenshot 1.jpg

             

            And when I click on the superscript 6, it goes to this reference, which is totally different to what was imported into InDesign after that text...

             

            6.     Baharestani MM, Ratliff CR. Pressure ulcers in neonates and children: an NPUAP white paper. Advances in Skin & Wound Care. 2007;20(4):208.

            • 3. Re: Issues when importing Word document with endnotes
              [Jongware] Most Valuable Participant

              What version of InDesign are you using? I have had this problem lots of times with CS4 (and I want to believe it ought to be fixed by now).

               

              The problem lies not in the Word file but in InDesign's Word Import filter. This is not something you (or anyone else than Adobe) can fix. I resigned myself to manually flattening the data fields in the Word file: select all text, press Ctrl+6, and save the file.

              It's not a big deal (I fix up a lot more in Word before expecting a reasonable import of my text) but it does require you have Microsoft Word. And, of course, since you cannot tell in advance whether a Word file will import correctly or not, you have no other choice than always going through this routine, for every Word file. (And I do exactly that.)

               

              Longer, somewhat technical* story**

               

              * A slight understatement.

               

              ** In fact it got so long, I'm going to finish it later today. Spoiler Alert: Adobe got it wrong, but it's all the fault of the Americans.

              • 4. Re: Re: Issues when importing Word document with endnotes
                [Jongware] Most Valuable Participant

                As promised,

                 

                The longer, somewhat technical* story

                 

                * This is still somewhat of an understatement.

                 

                The underlying problem is that Microsoft Word is written by Americans, and for Microsoft Windows. The basic encoding of text in Word is "Windows Latin-1", that is, ye olde "A to Z" plus a somewhat random smattering of accented letters, as well as a couple of useful typographic characters. This basic encoding is designed by Microsoft and is perfect for English texts (well, American English), but not that good to write French, German, or Spanish in, and it simply ignores Czech, Hungarian, or Polish special characters ... not to mention entirely different alphabets (think of Telugu, Tibetan, and Thai -- just three random ones that start with a "T", and there are lots more).

                 

                So, wiser heads invented Unicode, a much larger system that can define millions upon millions of different characters, and continues to grow even as we speak. (Although, famously, they refuse to include the Klingon alphabet.)

                Now instead of re-writing their word processor from the ground up, Microsoft's programmers took a shortcut in implementing Unicode. Word files henceforth consisted of two types of blocks of characters: "regular" (i.e., Good Ole American text) and "Unicode" (a.k.a. The Rest of the World). Each single character in the "regular" block uses only one byte, each single character in a Unicode block uses at least two bytes (and possibly more). A list of all text blocks is stored in the file, indicating its offset inside the file, its length, and of course what type of block it is.

                 

                That means that even for a simple instruction such as "please import the plain text", there is a whole lot of to-and-from calculations going on in the background. You cannot simply ask how long a block of text is: do you want the length in characters or in bytes? The numbers may differ. Adding to the fray, there doesn't seem to be much consensus inside the Word format of which values refer to bytes and which to characters. (See Understanding the Word .doc Binary File Format on Microsoft's own site if you want to know more about that.) In addition, Microsoft churned out lots of different versions of the .doc file format, and they just fibbed (pun intended) the file header to cater for any new additions. Now the documentation is littered with "Obsolete" and "Do Not Use" remarks, and many values are unreliable, obsolete, "may not be up to date", and "may be expanded in future versions".

                 

                How does this tie in to your Endnote Problem? Real "endnotes" have a real simple structure in the file. In place of the actual footnote/endnote marker in your document, there is a one-byte code that basically states "here comes the next note". There is a bit more information stored elsewhere (which note number it is -- you can (gasp) select whatever number you want!), how it's supposed to be formatted (again a plus point for Word), and where the actual note text itself is stored. That's for regular endnotes; but yours aren't.

                 

                Another Word feature, employed by the Word extension you are using (confusingly also called "Endnote"), is that it can store any data you want inside a Field code. Field codes are something like the Text Variables of InDesign, but they can do far more. Automatic references are field codes; so are equations, page numbers, and hyperlinks.

                A Field code typically contains displayable text (such as a superscript number) as well as hidden data (which can temporarily be made visible inside Word only). The Hidden text is what an extension such as Endnote uses to automatically construct a References section at the end of your document. It's the combination of automatic numbering and appearing at the end that makes it look and work just like regular endnotes inside Word.

                 

                So why does this hidden text suddenly appear? The programmers of InDesign made mistakes in the Word Import filter.

                 

                The blocks of "one-byte/two-byte" texts are independent of what sort of text or code is inside these blocks. This means the entire hidden 'endnote' data may be not inside a single block. The runs of "one byte/two byte" characters are totally independent of the actual meaning of those characters.

                 

                In practice, this means that one has to very carefully track which text should appear where, and how it should be read and translated to native InDesign text. Somewhere inside the Word-reading code, they forgot to translate a jump from one to two bytes (or the other way around) in the middle of a field code's text. So suddenly, a block that was rightfully hidden in Word got counted as, and incorporated into, the main run of text; and that is what you see in your post. After that sudden intrusion, the "regular" text continues as usual -- in your case, directly after the closing code "</Endnote>". You can see that the sentence runs on normally when ignoring the <...> trash data:

                 

                One study⁶[trash] investigated the association ...

                 

                The superscript "6" is the 'visible' part of the "Endnotes" field. The start of the Field code that normally hides the 'hidden' part is read correctly, and so its text is not imported; but the end of that field code is mis-calculated and so InDesign happily jumps right into Things That Should Not Be Seen.

                 


                 

                Other issues caused by this mis-reading can be encountered as well.

                 

                1. InDesign forgets to insert the proper sequence number for a note, and you get a pink "unknown" character instead (the "unknown" character is actually a generic placeholder, and should have been replaced with the correct number).
                2. InDesign looses track of where text fragments should go and so they end up at the end of a document: you get see several hard returns at the end of your text, sometimes with one or two characters still attached to them. If you find where they came from by comparing the imported text with the original Word file, you will see that the characters including the hard returns are missing in InDesign.
                3. InDesign may skip a single Bold On or Bold Off code, and after that the Bold attribute is inverted for a while, typically up to the end of the paragraph (Word requires an explicit reset of all text attributes at the start of the next paragraph). "Bold" is the most visible, but this may happen with any text attribute.
                4. InDesign fails to read a certain code (a first asterisked note before the one numbered "1", for example), and everything after that gets shifted by the amount of bytes that code ought to have occupied. If ID is able to import the file, one of the most surprising results can be that this note gets placed as a footnote inside the very first footnote. Another symptom is that imported hyperlinks are off by one or more characters, as can clearly be spotted when the hyperlink frame is visible.
                5. .. and of course, InDesign may not import a perfectly formed Word file at all. I hate it when that happens.

                 

                How can I be so sure it's a problem in InDesign and not in the Word file? Well ... as I have often advised in the past for similar problems, re-saving a document inside Word may solve it. That is because on a re-save, the text blocks are cleaned up -- just like a Save As in InDesign can solve lingering random problems. Saving as another file type may also work, because even though the file import filter uses something of a shared code base (there is a lot of similar functionality between reading DOC, RTF, and DOCX), the dirty low-level routines that actually have to deal with the raw bytes are, by definition, coded specifically for each file type. Which in practice means that an error of this kind in one filter may not be there in another.

                 

                With all this knowledge, and a browser bookmark on Microsoft's documentation, I wrote a Javascript to read Word files with. After some initial problems, I got it to work for plain text (so no notes or tables or auto-numbering), and much to my surprise I ran into similar problems: a relatively small error in my reading code produced the same kind of errors in my own imported text.

                 

                I found out I could fix this in my script, and would have gladly expanded it to read and format an entire Word file if only Javascript was a just teensy-bit faster... Reading a simple file the hard (but correct) way costs about 15 minutes, even on a fast system. So I resigned this idea and now always clean up my Word files in Word -- 95 out of a 100 times this works straight up, and for the remainders, a single glance to the source file is usually enough to spot the problem, which I then solve "manually" (i.e., moving it to the end of the file, or just deleting it).

                • 5. Re: Issues when importing Word document with endnotes
                  David W. Goodrich Level 3

                  Thanks for taking the time to write all this down.  I used to think that the "lost footnotes" problem was "caused" by my using Chinese characters, but then I started seeing weird things with files made with Endnote -- which, last time I looked, seemed not to offer a way to "flatten" their fields.

                   

                  David

                  • 6. Re: Issues when importing Word document with endnotes
                    [Jongware] Most Valuable Participant

                    David, does the default "flatten fields" key Ctrl+6 work for you? As far as I am aware, this should work with any field.

                     

                    I can't check until after the weekend (all my interesting files are at the office), but I'm pretty sure the private Endnote data gets kicked out after a Flatten Fields command, because I have had the OP's problem in the past but can't remember encountering it after I started to rigorously Cleanse Word files.