8 Replies Latest reply on Jun 17, 2013 3:04 PM by DFBurns

    UTF-8 and converting to HTML entities

    DFBurns Level 1

      Hoping someone here can shed some light. While handling an export in my plugin, some of the metadata has characters encoded with UTF-8. I'd like to convert those to HTML entities. Anyone have sample code for this? I found this function http://lua-users.org/files/wiki_insecure/users/WalterCruz/htmlentities.lua and tweaked the bottom few lines to this:

       

      return string.gsub( str, "[^a-zA-Z0-9 _]",

          function (v)

              if entities[v] then return entities[v] else return v end

          end)

       

      What I get for a result is the original byte stream passed through without change. Am I missing something with basic Lua syntax or is there a subtlety with Lightroom itself?

       

      Thanks,

      db

        • 1. Re: UTF-8 and converting to HTML entities
          johnrellis Most Valuable Participant

          There are a number of issues with that sample code.  I suggest that you first read this post about how LR Lua handles Unicode characters:

           

          http://forums.adobe.com/message/3251706#3251706

           

          And here's more general information about Lua and Unicode:

           

          http://lua-users.org/wiki/LuaUnicode

           

          Next, make sure your text editor is saving any code file in UTF-8 format.  Otherwise, string literals may not get loaded properly.

           

          The expression:

           

          string.gsub( str, "[^a-zA-Z0-9 _]", function (...

           

          won't work.  The pattern [^a-zA-Z0-9 _] is matching a single 8-bit character.  But the Unicode characters that are the keys of the "entities" table are in fact represented as multiple 8-bit characters in a Lua string.  For example, the string '£' is actually a Lua string of length 2 (2 8-bit characters):

           

          string.len ('£') => 2

          string.byte ('£', 1, 1) => 194

          string.byte ('£', 1, 2) => 168

           

          I think you'll need to write two calls to string.gsub(), one that replaces Unicode characters whose UTF-8 encoding is 1 byte, and one for those that are multibyte.  This paragraph from the above link suggests how to write those patterns:

           

          Happily UTF-8 is designed so that it is relatively easy to count the number of unicode symbols in a string: simply count the number of octets that are in the ranges 0x00 to 0x7f (inclusive) or 0xC2 to 0xF4 (inclusive). (In decimal, 0-127 and 194-244.) These are the codes which can start a UTF-8 character code. Octets 0xC0, 0xC1 and 0xF5 to 0xFF (192, 193 and 245-255) cannot appear in a conforming UTF-8 sequence; octets in the range 0x80 to 0xBF (128-191) can only appear in the second and subsequent octets of a multi-octet encoding. Remember that you cannot use \0 in a Lua pattern.

          • 2. Re: UTF-8 and converting to HTML entities
            DFBurns Level 1

            Thanks, John.

             

            I've done a lot of transcoding work in the past with UTF8 - just wasn't sure what Lightroom's capabilities were. Apparently nothing more than what Lua provides (i.e. zero). The background links you gave were helpful and led to a few code examples. This one seemed best: most succinct and clear: https://github.com/alexander-yakushev/awesompd/blob/master/utf8.lua. After including that file in my project and making a small change to my own code, things now work as expected (I use IntelliJ so yes, my Lua files were saving in UTF8). It remains to be seen though if it's better for maintainability to rely on an editor embedding UTF8 characters directly or if I should translate them all to their decimal equivalents to prevent future issues, i.e. instead of the copyright symbol right in my code, using '\194\169' instead.

             

            It's too bad Lr doesn't include UTF8 functions right in the LrUtils library. Seems useful and essential.

             

            db

            • 3. Re: UTF-8 and converting to HTML entities
              jarnoh Level 1

              Isn't the simplest solution just to export your HTML as UTF-8 and just add meta charset=UTF-8 tag?

              • 4. Re: UTF-8 and converting to HTML entities
                johnrellis Most Valuable Participant

                Great.  That code utf8.lua does indeed look simple, clear, and very useful.

                 

                I agree, it would be better for LR to include more UTF8 functions in LrStringUtils.

                • 5. Re: UTF-8 and converting to HTML entities
                  DFBurns Level 1

                  Ah thank you, I meant LrStringUtils.

                   

                  FWIW, I'd like to hear someone comment on the small function in that code called utf8len(). The code refers to a variable named "chars". I can't tell if that is actually uninitialized (and possibly a bug) or if there's a Lua idiom going on that I don't know about.

                  • 6. Re: UTF-8 and converting to HTML entities
                    DFBurns Level 1

                    Simple but not standards-compliant. The problem is that most if not all modern browsers will properly render UTF-8 documents without problems but this is misleading/incorrect since the standard for most Latin-1 characters and special characters like ampersands, copyright, etc. requires the use of entity transcription, e.g. ©. If it's easy to "do the right thing," then I'll try to do that.

                    • 7. Re: UTF-8 and converting to HTML entities
                      johnrellis Most Valuable Participant

                      I'd like to hear someone comment on the small function in that code called utf8len(). The code refers to a variable named "chars". I can't tell if that is actually uninitialized (and possibly a bug) or if there's a Lua idiom going on that I don't know about.

                      "chars" is definitely unitialized within that file and never assigned.  But as long as its value remains nil, utf8len() looks correct.

                       

                      Perhaps the code involving "chars" is debugging.  If "chars" is a non-nil number, and if the number of UTF-8 characters in the string is greater or equal to "chars", the result is the number of string bytes representing the first "chars" UTF-8 characters of the string.  Can't see why that would be useful as written.

                      • 8. Re: UTF-8 and converting to HTML entities
                        DFBurns Level 1

                        Yeah, I can't figure it out. I think this is vestigial and I'm not going to go digging through the history to figure it out. I've stripped that out in my local copy. I also notice that utf8charbytes() assumes a well-formed utf8 string and isn't robust to bogus byte sequences. There's a potential array bounds problem too. Not hard to fix. Thanks for taking a look.

                         

                        db