>> Fair cop. I didn't realise that asc() returned the
codepoint rather than the
> actual character code.
>
> and what would be the difference?
Oh, sorry, whatever the term is (I'm crap with jargon). The
value returned
by asc() for those chars was only two bytes (ie: four hex
digits). I
didn't realise there was more to it than that, and that
2-byte value maps
to some other THREE byte value. I need to do some reading...
> that's what both cf & java counted as the length.
CHARACTER length, sure. No-one's disputing that. On the other
hand,
no-one's asking about it, either.
> by adding a BOM you've already effected the encoding,
which may or may not
> match the original. so you still don't know how many
bytes were in the original
> string.
[groan]
Yes, that's a reasonable strawman there. I was only putting
it in a file
so I could save it and check the number of bytes occupied by
the data.
Clearly... CLEARLY... the OP is not asking for a character
length of that
string. They've said as much.
I copy and pasted the string from their post, and used it as
a
demonstration of how "nine" is not the right answer for the
BYTE LENGTH of
that string. Whether or not the original string was UTF-8,
UTF-16 or
special-marmoset-encoding, it almost certainly was NOT in a
fictitious kind
encoding in which each of those particular characters only
occupied one
byte each, which would mean that "nine" is the correct answer
to the
question.
When I copied those characters from either the web browser
for from my
text-based news agent, notepad identified them (and rendered
them
correctly) as UTF-8, so I'm fairly confident they ARE UTF-8.
Of course
this could be down to some intermediary encoding (pasting
them in to the
original posting, for example, via some encoding-transforming
mechanism),
but Occam's Razor suggests the original question was from a
UTF-8 POV.
But maybe we should quit speculating and ask the OP. Unless
they've
buggered off in despair of how drawn out all this is getting.
For which I
would not blame them.
--
Adam