7 Replies Latest reply on Sep 2, 2014 4:20 PM by johnrellis

    How the SDK handles unicode

    johnrellis Most Valuable Participant

      I spent several painful hours learning the following about how the SDK handles unicode characters -- perhaps I've missed where this is documented?  Here's what I learned:

       

      - Lua strings are sequences of 8-bit characters (bytes).

       

      - A unicode ZString is represented as a Lua string containing the UTF-8 encoding of the unicode ZString.  For example, the trademark character (TM) is unicode codepoint 2122 (hex), and the ZString LOC "$$$/unicode/tm=^U+2122" is represented as a Lua string of length three, the UTF-8 encoding of that character (decimal bytes 226 132 162).

       

      - A posting from Adobe employee "escouten" last year said that all SDK APIs treat all Lua strings as UTF-8 encoding of unicode strings.  I've personally observed that with LrView, LrFileUtils, and LrTasks.execute, but haven't checked other APIs.  In particular, a Windows unicode filename will be returned by LrFilteUtils as a Lua string encoding the the filename in UTF-8.  Passing that filename in a command line to LrTasks.execute works correctly.  (But writing a Windows batch file with a UTF-8 filename won't in general work -- a topic for another day.)

        • 1. Re: How the SDK handles unicode
          areohbee Level 5

          Thanks John,

           

          I took the liberty to add this Pearl Of Wisdom to the lrdevplugin FAQ as well:

           

          https://www.assembla.com/wiki/show/lrdevplugin/Character_Encoding_-_Unicode_UTF-8

           

          Rob

          • 2. Re: How the SDK handles unicode
            DonCristobal Level 1

            Hi John,

             

            I'm struggling a little with this UTF-8 topic currently. I can sympathize with your several painful hours now. :-)

            1) Can you (or somebody else) reproduce the following issue: (Win 8.1. LR 5.6)

            If your photos are stored in a UTF-8 encoded directory such as c:\users\username\Pictøäöüש (the last letter being the Hebrew letter shin). (This is kind of my test case after users from Norway and Israel reported problems.)

             

                local picName = selectedPhoto:getRawMetadata ("path")

                outputToLog (picName)

             

            I get the wrong result:

            C:\Users\username\Pictøäöüש\7L6B7931.CR2

            If I use, on the other hand, getFormattedMetadata:

             

            outputToLog (selectedPhoto:getFormattedMetadata ("folderName") .. " and " .. selectedPhoto:getFormattedMetadata ("fileName"))

             

            I get a correct result (but not the full pathname)

            Pictøäöüש and 7L6B7931.CR2

            Going from there, I could probably figure out the full path name (which does not seem to be offered in getFormattedMetadata), but I would like to figure out what's wrong with selectedPhoto:getRawMetadata ("path").

             

            2) The following is more for reference: I cannot seem to pass previews.db path name to sqlite if the path of the previews.db (LR catalog path) contains non-ASCII utf-8 characters.  (Other UTF8 commands on the command line work well.) chcp 65001 doesn't help. sqlite is supposed to accept UTF8 characters in the db name, but somehow doesn't (at least my version, which is somewhat older). I have worked around this issue by first cd-ing to the directory and then starting sqlite i.e. along the lines of "cd <previews-dir> && sqlite3 previews.db" This seems to work so far, even if some new issues have come up of which I don't know yet whether they are related to this or not.

            • 3. Re: Re: How the SDK handles unicode
              johnrellis Most Valuable Participant

              Re 1): I can't reproduce the problem on LR 5.6 / Windows 8.1.  Here's what photo:getRawMetadata() returns for me:

              capture1.png

              When I log the result to a file and then examine it with Sublime 2, I see the expected answer:

              capture2.png

              Perhaps the problem you're observing is somewhere between the call to your function outputToLog() and the text editor you're using to examine the log file.  Even in 2014, Unicode is an unnatural act for much software.

              • 4. Re: Re: How the SDK handles unicode
                johnrellis Most Valuable Participant

                Re 2): Though you said you used "chcp 65001", you didn't post an example .bat file, and this does smell like a problem with cmd.exe and its antiquated concept of code pages.  A couple of things to narrow this down:

                 

                - Try another shell, e.g. Cygwin's bash, to invoke sqlite3.exe.  If it runs under that shell, then the problem is related to cmd.exe.

                 

                - Use the Windows 7 "run" command to run the sqlite3 command line.  (On Windows 7, you type "run" into the Start search box; I forget the details of how you do it on Windows 8.)  I don't believe the "run" command uses cmd.exe and thus could avoid its issues with code pages.

                 

                - Rather than opening the database by passing it on the command line, write all the sqlite commands to a temporary file and invoke sqlite3 with:

                 

                sqlite3 < tempfile

                 

                You'll need a newer version of sqlite3 that has the ".open" command.

                • 5. Re: Re: How the SDK handles unicode
                  DonCristobal Level 1

                  Hi John,

                  re 1) many thanks for checking this, it was indeed a problem with the text editor not showing the result correctly, I didn't expect Notepad to not handle UTF-8 by default (Windows is not my native platform and I haven't used it much for a couple of years). Worse, LrDialogs.message likewise gives the wrong output too! - which at the time I had taken as confirmation.  I have now checked with Sublime 2 and it looks good. One problem solved...

                  re 2) I'll get back to you later on this.

                  - Chris

                  • 6. Re: Re: How the SDK handles unicode
                    DonCristobal Level 1

                    You mention using a .bat file in the context of chcp 65001. In my earlier unsuccessful attempts, I simply passed "chcp 65001 && sqlite ... ", assuming that this would switch the codepage before passing the sqlite utf-8 parameter. Now I'm thinking that maybe the command still gets passed with the old codepage, and thus the utf-8 is mangled, is this why you are referring to a batch file?

                    • 7. Re: How the SDK handles unicode
                      johnrellis Most Valuable Participant
                      I simply passed "chcp 65001 && sqlite ... ",

                      Do you mean you passed that string to LrTasks.call()?  A couple of thoughts:

                       

                      - It may be that the command line is completely parsed before it is executed, so by the time chcp executes, the rest of the command line has already been interpreted as ASCII rather than UTF-8.

                       

                      - I'm not sure how LrTasks.call() executes its command line, and in particular, whether it is passing a UTF-8 string or an ASCII string to cmd.exe.  In the past, when I've wanted complete control over how a command line gets executed on Windows, I've written a temporary batch file with "chcp 65001" as the first line and then executed that.