See pasted text from CSV at http://pastebin.com/mymhugpN
I am using InDesign CS6 (8.0.1)
I created the CSV by downloading it from a Google Spreadsheet as a CSV. I confirm with the Terminal that the character encoding is utf-8 usnig the file command.
I am trying to import a CSV file (utf-8) with Data Merge via the Select Data Source... command with Show Import Options checked. When viewing the Data Source Import Options dialog, I set the following options—Delimiter:Comma, Encoding:Unicode, Platform:Macintosh. I leave Preserve Spaces in Data Source unchecked. It fails to import any variables and produces no error message. I have tried other CSV files as well (created TextEdit, Espresso, etc.) and it seems that InDesign will not import any files if Unicode is specified as the encoding, no matter which other options are specified.
Can anyone else confirm this?
Importing as ACSII works, but obviously does not display my content correctly.
Importing as ACSII works, but obviously does not display my content correctly.
Obviously? What's obvious? What isn't being displayed correctly? I'm not sure that pastebin supports non-ascii characters, but I don't see anything in the paste in a quick look that should be a problem if the file is saved in ascii format.
If you want someone to test the file to see the results we'll need a link to a sample file.
It can be any valid unicode CSV file. One column or seven columns, one row or fifty rows, just alphabetic letters or numbers or unicode-only characters... it doesn't matter. None of them will import if Unicode is selected in the Data Source Import Options dialog.
Use what's in the pastebin link or copy and paste this into a new text file and save it as a utf-8 CSV and test:
Variable 1,Variable 2,Variable 3
(And pastebin does support unicode characters—there are several left- and right-double-quotes in my example.)
I'm not going to claim to be an expert on file encodings, but if I copy your text and save in Notepad (I'm a windows User) as UTF-8 I am presented with the following error:
If I susbsquently open in Excel and remove the added characters before the text in the first cell (Which I presume are the UTF-8 binary tag) I am able to load the file as ASCII and do the merge.
I also tried saving as Unicode text instead of UTF-8 in Notepad, and that works fine. As I say, I'm not expert in file encoding, but I think you may be under a misaprehension that UTF is Unicode encoding, and it simply is not a format that ID will read for Data Merge.
I've had a similar problem with text that was generated on Filemaker Pro for Mac. Drove me INSANE. Was with a former employer and has not been an issue since, but I had a somewhat clumsy workaround:
I haven't really investigated what was going on as the job was always red-hot urgent, and I haven't been able to replicate the fault as I don't have Filemaker Pro. If anyone reading would like to try this, have at it!
Google Drive only supports one format for exporting to CSV. I've tried converting the exported file to utf-8, utf-16—no dice. The best I can come up with is to import a unicode file as ASCII and then Find & Replace all out of range characters that were modified (sometimes lots of different characters, depending on the content).
I had no problem importing your data as either .txt or .csv. Granted I run CS5.5 but am also on a Mac and it went smoothly. I basically copied your data to TextEdit, saved and then imported through Data Merge. To get the CSV I opened that text in Excel and output as comma de-limited CSV.
It looks like your data (in PasteBin) has 2 different encodings, or at least has been created by 2 different software programs or maybe computers. You can see this in the text:
As you can see, there are two different quotation marks going on - the first looks like it's been copied straight from Word and the second looks to have been generated by a text or HTML program. This is also a common problem with websites when people supply text in Word/Excel and copy straight into HTML - those characters get messed up.
Run a Find and Replace on the first set of quotes in that text and the file should load with no problem.
Hi Peter, UltraEdit. Been using it since the mid 1990s for programming and then later, for HTML, PHP and what not.
As regards this issue, UE will preserve the BOM with UTF files if it is set to do so (and not convert them to the OS' format). Which is how I have it set. If one sees the odd characters at the beginning of a file, it is due to conversion to the OS format one is using. Otherwise the BOM is hidden from view.
Down towards the bottom of their page linked below explains the BOM if anyone cares. For UTF files it is necessary that these be preserved.
Take care, Mike
Sorry, been away at a weekend golf tournament. Tried to reply with my phone but up in the mountains I waas cut off on both reply attampts.
I am using CS6.
Your file is a UTF-8 ASCII file. If you want Unicode, it will be a UTF-16 file and as long as you are using Western languages, you really should be using the UTF-8 ASCII.
If you create a UTF-16 format CSV file, you will see that ID automatically determines it is the Unicode import type. Now, they are both Unicode (UTF-8 and UTF-16 that is). But to ID, I think the UTF-16 format(s) are used specifically for non-Western languages, or certainly a mix of Western and non-Western languages.
At least that's what I find.
When I import that utf-8 file as Unicode, nothing imports. When I import it as ASCII, I get various character conversions, such as
“ becomes ‚Äú
” becomes ‚Äù
’ becomes ‚Äô
… becomes ‚Ä¶
– becomes ‚Äì
and so on...
When I import that utf-8 file
Is this the file you uploaded to pastebin? Frankly I do not see those characters listed in post 18 in that file. If the missing characters you list in post 18 are suppose to be in the text, try creating a ZIP file of the text (both the .txt and the source Google Doc spreadsheet, maybe saved as an Excel or other spreadsheet type) and posting it somewhere like dropbox. Then provide a link.
The reason it will not import as Unicode, I believe, is that it is a UTF-8 file and is considered as ASCII for import.
The source data never existed as a .txt file. It was typed directly into a Google Spreadsheet. Here is a link to a zip containing the originally-downloaded .csv from Google Docs and then a utf-16 version that I converted using TextEdit on OS X.
Ok. I think that we aren't getting anywhere quick at my end. I still do not see any upper-range characters in your file(s).
I went to a web site and copied a bunch of characters and made a table. I added dollar signs where their characters were missing. But I think it illustrates that if the file is set correctly, ID automatically picks the proper format for import.
And the result when merged into a new file:
Converting the source records as an ANSI text file, and importing it, ID chooses:
And the resultant merge equals:
I downloaded the sample files, and I get the same character mapping issues. I wonder if this might be font-related. Is the font in Google spreadsheet really unincode compliant, or is it some older font that has Windows -Mac character mapping differences?
Google Doc export of the above here:
It is a UTF-8 ASCII.
Which means if you import it allowing ID to set the import type (Ascii) then it works fine.
Want me to ZIP up a Unicode (UTF-16) version? ID will then automagically pick Unicode for the type and the imports will be identical.
Mike is having some trouble posting in this thread (and I am too), but he sent me a PM with what he wanted to say:
OK. I think I might have a positive answer for you.
I was getting lost in the upper ASCII characters you showed. In your test file I never could see any--a case of not seeing the trees for the forest.
Your quote marks are getting dropped in your test file. Now, this may or may not affect other factors but it does in some further testing. I believe ID has an issue with dropping quote marks even in a plain ASCII file if the marks are at the beginning of a sentence and the file is tab delimited. Call it a bug.
Because of all the commas and quote marks in your simple file, I think you should be exporting from Google Docs' spreadsheet as a tab-delimited file. This exported file has to be opened in a text editor capable of saving it out as a UTF-16 BE (Big Endian) type of file.
Also, I think you are going to have to use proper quote marks throughout, or change them in the exported tab-delimited file. Best to have a correct source, though.
Here is your sample ZIPped up. I think it works properly. But then again, I think I might be bleary-eyed by now.
Take care, Mike
I still don't know what the problem was with posting yesterday. Admin says the error we were seeing usually means the spam filter kicked in because of something in the text, but all I did last night was remove a line of asterisks and it posted just fine, and as far as I know a line of asterisk is not flagged by the filters.
Have been following this thread with some interest as a user who uses Data Merge all of the time. Very rarely do I come across this issue, but when I do it is normally when the job is in a hurry.
I've downloaded the csv from the google file and noted the same thing the OP had noticed, and that was the quotes appearing as the OP had stated in part 18 of this thread.
My solution was to go into Textwrangler on a Mac and save the file as the following:
Then, went back into InDesign, selected "Select Data Source", and (making sure that "Show Import Options" is checked ON) selected the new database. A dialog box should now appear with these options:
and voila! The data should go from bad to good!
You are still not there, Flash.
Compare the right-hand side of your last screen shot to the one below:
See the difference? Look at the raw source again. There are double quotes, a lower and higher ascii quote. The lower ascii quote gets stripped. Download the ZIP at the link in my last message Peter was kind enough to post when I couldn't.
Take care, Mike
Actually. I think the stripping of the outer quote is correct for a .csv file. It's there to inidcate everything inside should be treated as literal, including commas, so that you don't wind up with extra fields.
Not to rain on any parades, but IMHO Pete has it right. The opening and closing quotes of that field aren't meant to be there. I stand by my screenshot.
My situation may also not be the same as the OPs, but I've had issues in the past from what FileMaker Pro calls tab delimted files (.tab) and that is twofold:
Thanks to this thread I have now overcome the latter issue....
...However I do not know if saving as specified in my earlier post will resolve the former issue as I don't have FileMaker Pro. Has anyone experienced this before?
Not for a tab delimited file. I know of no other application that mix and match delimiters. one should be able to have triple setz of quotes with a tab delimited file.
plus under certain circumstances, only the leading double quote is stripped.
i will stand by my assertion that id is goofing up a tab delimted utf-8 merge when double quotes are are present.
Colly—what you mentioned in post #27 fixed my problem, too. Thank you.
However, I still don't understand—what's the problem with the Google Docs-generated utf-8 file? Why does InDesign display the characters that way?
I think there is a bit of mixed communications here so I just want to clear this up.
The problem I am having is SIMILAR, but not IDENTICAL to the OP's issue.
The OP's data from post #20 only had straight quotation marks in the second to last field of each record, presumably because the other fields did not contain commas but the second to last field did; and is to ensure that only items within the straight quotation marks are the data for that field, and not where any commas may be.
I have seen this happen before but can't reliably say what software does/doesn't do this.
To confirm my theory, look at any reccord after the Devotion reference (e.g. Psalm 147:3 for example) and there is a comma to separate the field, then a straight quote. The end of that field does not always have a "double quote" but instead has a straight quote followed by a comma. I think the "double quote" is a red herring.
My dramas were spelled out in post #30 of this thread, and am on the lookout for any test data that can be supplied from FileMaker Pro, preferably exported as a .tab file; and featuring soft-returns in any of the sample data so I can replicate the fault from my old client.
Peter, mine are tab delimited. I.e., I stripped/converted the OP's database and thought I had made it clear above in the post you posted for me. The reason is with all the quotes, double quotes and commas in the narrative text, it is the only means of maintaining them else ID (properly) gets confused as to what to strip and what to retain.
For 15 years, I owned a company that wrote vertical market database publishing software. Mainly for the upper end of the insurance company spectrum (AIG, Zurich, and various entities operating under the Loyds of London umbrealla--which isn't an actual insurance company but that is another story--etc.).
We usually had to use tab delimited merges due to the same issue. I didn't need to jump through the hoops as seen in this thread to produce the merge. Plain ascii files for Word were used in the offices, but mostly used large-scale databases, sometimes using ZIM to interface, sometimes direct for the underrating, quotes and actuary reporting. I know it can be done without stripping the quote marks is the point.
Take care, Mike
But because YOU chose to convert from .csv to tab delimited doesn't mean the quote marks belong in merged data. You chose to preserve marks that inthe normal course of working with .csv are stripped out, so you can't say that preserving the .csv format during any of the translations and having those marks stipped in the final merge is an error.
I tend to use tab-delimited text for files like this, myself, but had they been saved that way origianlly I'm quite confident that the outer quotes would not have been included. Reading the text I see no indication that the entire block is a quotation with nested quotes inside (and most style guides would have nested quotes into single quote marks rather than double).
I bet if you resave your tab-delimited file that preserved the outer quotes as .csv from your text editor and open it again you'll find an additional set of new quotes outside.
It looks like I may not have the behavior for tab-delimited text entirely correct, as far as the adding of quotation marks, but here's a couple of screen shots to illustrate what happens when you save as .csv from Excel. I've never used Google Spreadsheet, but I don't have any reason to think it would behave differentely, or if it did that ID would understand that the file coming from Google as .csv was somehow different from one coming from another application and it should preserve quotes that would otherwise be discarded.
The text in Excel:
Note that there are no surrounding quotation marks.
The .csv opened in Notepad:
Now there are extra quotes surrounding the internal quote and the entire text string. I was surprised to see that Excel added the same quotes in the tab-delimited file I also saved, so it looks identical.
And this is that same .csv file merged into ID (after adding a header row in Notepad):
Extra quotes are stripped out as they should be.
Microsoft explains that if extra quote marks are not desired, one needs to use a macro to prevent Excel from adding them. I have VBA code that does this.
Without input from the OP, we may never know if the quote marks in the various fields are desired or not. In the raw paste, they are there, so I was going by that. I agree that style-wise, I would use italics for proper titles such as the beginning of record 8. I wouldn't quote the artist's description., etc.
Another (possible) means of obviating the ascii vs. unicode, utf-8 vs. utf-16 BE issue(s) might be to convert to XML. This has an added advantage of not needing to stitch the data merge's frames, copy/paste into another story, etc.
It's been a fun exercise and debate. I have not used ID's XML importing (mapping to styles, etc.) before, nor done much data merges with it. My stuff is rather pedestrian.
Take care, Mike
I've run into similar problems. I'm using OpenOffice to create a spreadsheet that I export to CSV, and use that as the data source for an InDesign CS2 document. When exporting from OO, if I use the default UTF-8 setting, and open the CSV file in TextWrangler, it shows as "UTF-8 with no BOM". If I resave it as UTF-8, it fails to import.
The thing is, I have some symbols in the Apple Symbols font that I need to import into my document. When the spreadsheet is exported as UTF-8, the file imports, but the symbols don't print correctly, but they show up correctly in Textwrangler. I found, through trial and error, that if I create the CSV file using the plain "Unicode" setting, then open it up in Textwrangler, it shows up as "Unicode UTF-16". I then import this data source into InDesign CS2, using the options "Unicode" and "PC", and it works perfectly.
I'm on Mac OS X 10.5.8 Intel.