4 Replies Latest reply on May 29, 2012 3:53 PM by John Hawkinson

    Importing large user dictionary crashes InDesign (word list specifications?)

    Shawn Pyle

      Just purchased a large medical dictionary with ~400K words and InDesign is having some problems importing it into a User Dictionary. Oddly, it appears to import find but completely crashes when trying to write the dictionary to disk (the import progress bar is complete and I can see the .udc file size increase).

       

      The file I received seems to have some special characters and diacritics (¡rnadÛttir, ≈kerlund, Ângstrom, ÈtagËres') but I know that I can added these to user dictionaries manually when they show up in a text file. Unfortunately, I can't share this file with anyone due to the fact that it was purchased and we signed a contract to for only our use.

       

      So, the question is, what are the limits/specifications on word lists imported into user dictionaries? Size, character encoding, things to avoid, line break types, etc. The user manual is a bit silent on the specifics, I was unable to find anything conclusive using google and a phone call to support told me that an application crash/exception was "not a bug".

       

      Environment

      • OSX 10.6.8
      • InDesign CS5.5

       

       

      My word list text file:

      • encoding: Non-ISO extended-ASCII text, with CRLF line terminators
      • size: 4.8M
      • words: 394705 (one on each line)

       

       

      Things I've tried but errors (see below):

      • Importing the full file
      • Splitting files into smaller chunks (200K words, 100K words, 50K words) all seem to error (see below)
        • I can import a 10K words per file but at that rate I'd have 40 UDCs. That doesn't seem like a good idea.
      • Converting to UTF8 encoding (using iconv)
        • This fails for all size files, even 10K words.
      • Converting to the same encoding used by InDesign when exporting user dictionaries to a text file (ISO-8859 text, with CR line terminators). However, I was unable to get OSX to do this conversion (again using iconv). It seems that Non-ISO extended-ASCII should be the same as ISO-8859 (Latin-1) text. See http://en.wikipedia.org/wiki/Extended_ASCII

       

       

      Things I haven't tried:

      • Removing special characters, but I know that the user dictionaries can support UTF/special characters. Besides it'll limit the dictionary, so this in't ideal. Also, importing a very small file (< ~1000 words) with the special characters, it's handled them just fine.

       

       

      Error:

      Process:         Adobe InDesign CS5.5 [50383]

      Path:            /Applications/Adobe InDesign CS5.5/Adobe InDesign CS5.5.app/Contents/MacOS/Adobe InDesign CS5.5

      Identifier:      com.adobe.InDesign

      Version:         7.5.3.333 (7530)

      Code Type:       X86 (Native)

      Parent Process:  launchd [341]

       

      Date/Time:       2012-05-29 09:44:50.791 -0400

      OS Version:      Mac OS X 10.6.8 (10K549)

      Report Version:  6

       

      Interval Since Last Report:          59126 sec

      Crashes Since Last Report:           1

      Per-App Interval Since Last Report:  59061 sec

      Per-App Crashes Since Last Report:   1

      Anonymous UUID:                      723C81B5-3F1C-44B5-B6F3-A81AF6A67834

       

      Exception Type:  EXC_BAD_ACCESS (SIGBUS)

      Exception Codes: KERN_PROTECTION_FAILURE at 0x0000000000000000

      Crashed Thread:  0  Dispatch queue: com.apple.main-thread

       

      Thread 0 Crashed:  Dispatch queue: com.apple.main-thread

      0   ???                                     0xa08726f0 _XHNDL_trapback_instruction + 0

      1   ...inguistic.LinguisticManager          0x210c8a01 prox_cladd + 200

       

      Thread 1:  Dispatch queue: com.apple.libdispatch-manager

      0   libSystem.B.dylib                       0x90428382 kevent + 10

      1   libSystem.B.dylib                       0x90428a9c _dispatch_mgr_invoke + 215

      2   libSystem.B.dylib                       0x90427f59 _dispatch_queue_invoke + 163

      3   libSystem.B.dylib                       0x90427cfe _dispatch_worker_thread2 + 240

      4   libSystem.B.dylib                       0x90427781 _pthread_wqthread + 390

      5   libSystem.B.dylib                       0x904275c6 start_wqthread + 30

        • 1. Re: Importing large user dictionary crashes InDesign (word list specifications?)
          John Hawkinson Level 5

          Well.

          Can you tell us who you purchased the dictionary from? That might help to find people with similar experiences.

           

          Have you contacted the dictionary vendor? Do they claim it should work with InDesign?

          Your crash report tell us it crashes in the prox_cladd() function in the Linguistics plugin. Not all too helpful, but since that suggests it is adding a Proximiry (prox_) dictionary, can you set it up to add a user dictionary as a Hunspell dictionary instead?

          Similarly, have you tried CS6? Which has stronger Hunspell defaults.

           

          Things I've tried but errors (see below):

          • Importing the full file
          • Splitting files into smaller chunks (200K words, 100K words, 50K words) all seem to error (see below)
            • I can import a 10K words per file but at that rate I'd have 40 UDCs. That doesn't seem like a good idea.
          • Converting to UTF8 encoding (using iconv)
            • This fails for all size files, even 10K words.
          • Converting to the same encoding used by InDesign when exporting user dictionaries to a text file (ISO-8859 text, with CR line terminators). However, I was unable to get OSX to do this conversion (again using iconv). It seems that Non-ISO extended-ASCII should be the same as ISO-8859 (Latin-1) text. See http://en.wikipedia.org/wiki/Extended_ASCII

          On your second bullet, are you sure that it is the size of the file rather than particular entries?

          Your reference to non-iso extendned-ASCII doesn't make sense. The file is in some encoding, be it UTF-8, ISO-8859-1, or something. What encoding is it in?

          When you say converting to utf-8 fails with iconv, you mean iconv fails, or InDesign continues to fail?

           

          If iconv can't convert to utf-8, then it's not too surprising to me that InDesign is unhappy as well.

           

          Does it work better with iconv -c?

          You might find using GNU recode more effective than iconv, but again, we'd need to know more about the original encoding.

           

          I can't imagine InDesign is expecting to read the file in anything other than Unicode or ISO-8859-1, and I'd be really surprised if it was 8859-1 (aka ISO-Latin1).

          • 2. Re: Importing large user dictionary crashes InDesign (word list specifications?)
            Shawn Pyle Level 1

            @John Hawkinson, thanks for your detailed questions!

             

            1. Can you tell us who you purchased the dictionary from?
            2. Have you contacted the dictionary vendor? Do they claim it should work with InDesign?
              • No, but that's a good place to start. I'll see if I can get it in a different encoding (maybe UTF-8), however it's a bit of a shot in the dark as I don't really know what to ask for because the text file specifications aren't specified.
            3. …can you set it up to add a user dictionary as a Hunspell dictionary instead?
              • The way I had been creating .UDC files was through the follow process: Create User Dictionary (InDesign menu -> Preferences -> Dictionary -> New User Dictionary) and then Import text file (Edit menu -> Spelling -> User Dictionary… -> Select Target (newly added user dictionary) -> Import…). Looks like the Hunspell/Proximity option is only for the default user dictionary (/Users/username/Library/Application Support/Adobe/Linguistics/Dictionaries/Adobe Custom Dictionary/eng), right?
              • Import into the default user dictionary (Proximity) crashes (like before) and I have move the dictionary aside to get InDesign to restart.
              • Importing into the default user dictionary (Hunspell) exhibits the same behavior (prod_cladd() error)
            4. Similarly, have you tried CS6?
              • No, not yet. We have a large number of scripts that are custom tailored to InDesign so we'd have to consider upgrade implications before doing that.
            5. On your second bullet, are you sure that it is the size of the file rather than particular entries?
              • I'm not sure what it might be although I can import the first 10K entries using the extended-ASCII encoding but not with the UTF-8 encoding. So it's possible it's an encoding/entry problem but I don't know what the InDesign requires, hence the question.
            6. Your reference to non-iso extendned-ASCII doesn't make sense. The file is in some encoding, be it UTF-8, ISO-8859-1, or something. What encoding is it in?
              • The file command (`file Stedmans2012.txt`) reports this "Stedmans2012.txt: Non-ISO extended-ASCII text, with CRLF line terminators". I've never seen anything like this either. iconv doesn't have an extended-ASCII format (`iconv -l`) that I can tell. Running the following commands fails to create the converted file:
                • `iconv -f ASCII -t ISO-8859-1 Stedmans2012.txt > iso8859.txt`
                • `iconv -f ASCII -t UTF-8 Stedmans2012.txt > utf8.txt`
              • However, the following commands succeed:
                • `iconv -f ISO-8859-1 -t ISO-8859-1 Stedmans2012.txt > iso8859.txt`
                • `iconv -f ISO-8859-1 -t UTF-8 Stedmans2012.txt > utf8.txt`
                • This makes me think that the file is actually in ISO-8859-1.
            7. When you say converting to utf-8 fails with iconv, you mean iconv fails, or InDesign continues to fail?
              • InDesign continues to fail with the newly created/converted files. I've created the UTF-8 file with the command above (assuming file is ISO-8859-1).
            8. Does it work better with iconv -c?
              • Yup, but I'd like to keep what I can and I can get fully converted files without it. Besides, I can create a fully converted ISO-8859-1 version that still doesn't import and crashes InDesign.
            • 3. Re: Importing large user dictionary crashes InDesign (word list specifications?)
              Shawn Pyle Level 1

              At John's suggestion, I downloaded a trial of InDesign CS6 and imported text files in both formats (extended-ASCII and UTF-8) without a problem into the Proximity, Hunspell, and "User Dictionary Only" formats. I did look at the /Users/username/Library/Application Support/Adobe/Linguistics/UserDictionaries/Adobe Custom Dictionary/en_US/added.txt file that was created on import in CS6 and it is indeed a UTF-8 file. I took that file and moved it to CS5.5's dictionary location: /Users/username/Library/Application Support/Adobe/Linguistics/Dictionaries/Adobe Custom Dictionary/ and it did seem to work fine in CS5.5 as I could see all the added words.

               

              I was also able to create a new user dictionary (.udc) and import those format into it using CS6. I then pointed CS5.5 to the newly created .udc and it seems to be working.

               

              Looks like the import feature for dictionaries in CS5.5 is broken and CS6 fixes those inadequacies.

              • 4. Re: Importing large user dictionary crashes InDesign (word list specifications?)
                John Hawkinson Level 5

                Looks like the import feature for dictionaries in CS5.5 is broken and CS6 fixes those inadequacies.

                Hmm. Well, I'm glad you got it working. That's not quite the path I had envisonied, but it seems like it worked!

                I sort of worry that it may have failed to crash but still not worked properly. It's worth looking a little

                more closely at some specific words that matter.

                 

                Anyhow some replies to your other post:

                On your second bullet, are you sure that it is the size of the file rather than particular entries?

                  • I'm not sure what it might be although I can import the first 10K entries using the extended-ASCII encoding but not with the UTF-8 encoding. So it's possible it's an encoding/entry problem but I don't know what the InDesign requires, hence the question.

                You can be pretty confident that any modern application that doesn't specify wants Unicode (and can probably accept any variant, e.g. UTF-8, UTF-16, etc.).

                 

                Your reference to non-iso extendned-ASCII doesn't make sense. The file is in some encoding, be it UTF-8, ISO-8859-1, or something. What encoding is it in?

                  • The file command (`file Stedmans2012.txt`) reports this "Stedmans2012.txt: Non-ISO extended-ASCII text, with CRLF line terminators". I've never seen anything like this either. iconv doesn't have an extended-ASCII format (`iconv -l`) that I can tell. Running the following commands fails to create the converted file:
                    • `iconv -f ASCII -t ISO-8859-1 Stedmans2012.txt > iso8859.txt`
                    • `iconv -f ASCII -t UTF-8 Stedmans2012.txt > utf8.txt`
                  • However, the following commands succeed:
                    • `iconv -f ISO-8859-1 -t ISO-8859-1 Stedmans2012.txt > iso8859.txt`
                    • `iconv -f ISO-8859-1 -t UTF-8 Stedmans2012.txt > utf8.txt`
                    • This makes me think that the file is actually in ISO-8859-1.


                "file" uses heuristics to try to guess at an encoding. There's no way for it to determine whether a file is ISO-8859-1 (Latin1) versus ISO-8859-2 etc. The only way to know would be to find a word in the file that used characters that were in different places in those encodings, and to look that word up in a dictionary in order to determine what the characters should be. But "file" doesn't have a dictionary of all words in the English language, much less the languages that tend to use extended ASCII characters (English tends not to).

                 

                Again, as I tried to suggest earlier, there is no such thing as "extended-ASCII format"; that is just file telling you that it doesn't know what it is! If you tell iconv to convert from ASCII, it will just fail whenever it sees an extended ascii character. iconv from Latin1 to Latin1 is a no-op. It should succeed on any file. For instance:

                 

                paul-rand:tmp writer$ iconv -f iso-8859-1 -t iso-8859-1 < /bin/cat > /tmp/c

                paul-rand:tmp writer$ ls -ld /tmp/c

                -rw-r--r--  1 writer  wheel  44272 May 29 18:26 /tmp/c

                paul-rand:tmp writer$ md5 /tmp/c /bin/cat

                MD5 (/tmp/c) = cdefa50d737dfcf8dc57886ea1a758c4

                MD5 (/bin/cat) = cdefa50d737dfcf8dc57886ea1a758c4

                paul-rand:tmp writer$

                 

                Similarly, that a conversion from Latin1 to UTF8 succeeds is no guarantee either. It merely indicates that there is a UTF-8 character that corresponds to every potential Latin1 character in the file, but that's no surprise because Latin1 uses all the characters and UTF8 is a superset of Latin1. You could run that conversion on an arbitrary binary file (again, like /bin/cat) and have it "succeed." But also you could convert from Latin2 (iso-8859-2) and have it "succeed."

                 

                It doesn't mean it's valid. You could very well have the wrong characters in words in your dictinonary.

                When you say converting to utf-8 fails with iconv, you mean iconv fails, or InDesign continues to fail?

                  • InDesign continues to fail with the newly created/converted files. I've created the UTF-8 file with the command above (assuming file is ISO-8859-1).

                 

                I guess this means InDesign is unhappy with something in the files. I'm surprised it fails in the same way, though.

                 

                 

                Does it work better with iconv -c?

                  • Yup, but I'd like to keep what I can and I can get fully converted files without it. Besides, I can create a fully converted ISO-8859-1 version that still doesn't import and crashes InDesign.

                The intent of using this was to test/confirm that it was the special characters that were causing the problem. Not to actually be a workable solution.

                 

                I don't understand your "Besides" comment though.

                 

                Looking back at your original message, this is telling:

                The file I received seems to have some special characters and diacritics (¡rnadÛttir, ≈kerlund, Ângstrom, ÈtagËres')

                I'm just an uncouth monoglot American, so the only one of these words I know is Ångström. But I know it's an A-ring (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE), not an A-circumflex (U+00C2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX). So it sounds like something's already messed up. But maybe it's just your cut-and-paste or your browser?

                 

                Unfortunately, I can't find an example of an encoding that has A-ring in the C2 position. All of UTF-8 and and ISO-8859-{1,2,3,4,9,15} have A-circumflex in the C2 position, and the glyph is not available in ISO-8859-{5,6,7,8,13} or Shift_JIS.

                 

                And I don't know what's up with the missing o-diaeresis. I guess a lot of people don't spell Ångström that way...