I have imported MS Word documents with index entries (Cyrillic and Latin alphabetet, quotation marks etc. ).
Now I want to generate index and I have the alert:
"The index could not be generated.
One or more index entries contain invalid characters.
Please delete any invalid characters from the index entries".
Either Word or InDesign sometimes add a null character at the end of an index entry. Yup, "sometimes". Not "always", not "never". A null character is invalid in almost every text string.
indexTopics = app.activeDocument.indexes.topics; for (i=0; i<indexTopics.length; i++) indexTopics[i].name = indexTopics[i].name.replace(String.fromCharCode(0), '');
-- the way to check if this is the only problem is to run it, then try to regenerate your index
The script does not check for sub-topics; if you have these and it still doesn't work, the script needs expanding.
Internally, strings inside InDesign are coded as Unicode. null (U+0000) is an invalid character in most contexts, but so are others such as U+FFFE (not-a-character), U+1FFF (not an existing Unicode), and U+D801 (only half of a 2-code combined Unicode). But it depends on context as well: something like a Hard Return is legal inside text but (surely?) illegal inside an index entry.
Don't you think that Preflight should check and mark such characters?
I'd rather have InDesign not import invalid characters to begin with!
Thank you Jongware,
I have fond the bug.
In Polish typography in compound words we have to use pair of characters in: Discretionary Hyphen and Nonbreaking Hyphen to repeat hyphen on the left edge of the column. (This is Polish and Slovak rulez).
Such pair was working fine in CS2.
Now (from cs3 or cs4 I do not remember) is is not important to put such pair of characters, it is enough to choice Polish or Slovak language. And such hyphens are duplicated on the left edge of a column on the fly.
But the import engine of docx file has still the bug. If someone put such pair to the MS Word, docx file, InDesign cs 5.5 can't import them properely to InDesign document.
I have checked idml file of my document
Reference Topic should be Yat-Kha. In MS Word there was Yat[discretionary hyphen and nonbreaking hyphen]Kha
but inside the reference topic in idml file is something like that I sgned in bold:
ReferencedTopic="u1b3TopicnYat<?AID 001f?><?AID 001e?>Kha"
I do not know what is ?AID 001f? and ?AID 001e?
But I know now that there are invalid characters
That was the problem.
Well, if you know the value of your invalid characters, you can use my script to remove them from your InDesign index!
Replace the 'replace' line with this:
indexTopics[i].name = indexTopics[i].name.replace('\u001F\u001E, '-');
so both of these characters will be removed and replaced with a regular hyphen.
"AID somethin' somethin'" must be some IDML code to insert special characters as plain text; the "001f" part is what we're interested in. (I checked the same way, only I tested for a null character.)