There are a number of issues with that sample code. I suggest that you first read this post about how LR Lua handles Unicode characters:
And here's more general information about Lua and Unicode:
Next, make sure your text editor is saving any code file in UTF-8 format. Otherwise, string literals may not get loaded properly.
string.gsub( str, "[^a-zA-Z0-9 _]", function (...
won't work. The pattern [^a-zA-Z0-9 _] is matching a single 8-bit character. But the Unicode characters that are the keys of the "entities" table are in fact represented as multiple 8-bit characters in a Lua string. For example, the string '£' is actually a Lua string of length 2 (2 8-bit characters):
string.len ('£') => 2
string.byte ('£', 1, 1) => 194
string.byte ('£', 1, 2) => 168
I think you'll need to write two calls to string.gsub(), one that replaces Unicode characters whose UTF-8 encoding is 1 byte, and one for those that are multibyte. This paragraph from the above link suggests how to write those patterns:
Happily UTF-8 is designed so that it is relatively easy to count the number of unicode symbols in a string: simply count the number of octets that are in the ranges 0x00 to 0x7f (inclusive) or 0xC2 to 0xF4 (inclusive). (In decimal, 0-127 and 194-244.) These are the codes which can start a UTF-8 character code. Octets 0xC0, 0xC1 and 0xF5 to 0xFF (192, 193 and 245-255) cannot appear in a conforming UTF-8 sequence; octets in the range 0x80 to 0xBF (128-191) can only appear in the second and subsequent octets of a multi-octet encoding. Remember that you cannot use \0 in a Lua pattern.
I've done a lot of transcoding work in the past with UTF8 - just wasn't sure what Lightroom's capabilities were. Apparently nothing more than what Lua provides (i.e. zero). The background links you gave were helpful and led to a few code examples. This one seemed best: most succinct and clear: https://github.com/alexander-yakushev/awesompd/blob/master/utf8.lua. After including that file in my project and making a small change to my own code, things now work as expected (I use IntelliJ so yes, my Lua files were saving in UTF8). It remains to be seen though if it's better for maintainability to rely on an editor embedding UTF8 characters directly or if I should translate them all to their decimal equivalents to prevent future issues, i.e. instead of the copyright symbol right in my code, using '\194\169' instead.
It's too bad Lr doesn't include UTF8 functions right in the LrUtils library. Seems useful and essential.
Isn't the simplest solution just to export your HTML as UTF-8 and just add meta charset=UTF-8 tag?
Great. That code utf8.lua does indeed look simple, clear, and very useful.
I agree, it would be better for LR to include more UTF8 functions in LrStringUtils.
Ah thank you, I meant LrStringUtils.
FWIW, I'd like to hear someone comment on the small function in that code called utf8len(). The code refers to a variable named "chars". I can't tell if that is actually uninitialized (and possibly a bug) or if there's a Lua idiom going on that I don't know about.
Simple but not standards-compliant. The problem is that most if not all modern browsers will properly render UTF-8 documents without problems but this is misleading/incorrect since the standard for most Latin-1 characters and special characters like ampersands, copyright, etc. requires the use of entity transcription, e.g. ©. If it's easy to "do the right thing," then I'll try to do that.
I'd like to hear someone comment on the small function in that code called utf8len(). The code refers to a variable named "chars". I can't tell if that is actually uninitialized (and possibly a bug) or if there's a Lua idiom going on that I don't know about.
"chars" is definitely unitialized within that file and never assigned. But as long as its value remains nil, utf8len() looks correct.
Perhaps the code involving "chars" is debugging. If "chars" is a non-nil number, and if the number of UTF-8 characters in the string is greater or equal to "chars", the result is the number of string bytes representing the first "chars" UTF-8 characters of the string. Can't see why that would be useful as written.
Yeah, I can't figure it out. I think this is vestigial and I'm not going to go digging through the history to figure it out. I've stripped that out in my local copy. I also notice that utf8charbytes() assumes a well-formed utf8 string and isn't robust to bogus byte sequences. There's a potential array bounds problem too. Not hard to fix. Thanks for taking a look.