how to get the string's byte length?

Report · Oct 10, 2007

I have some string,I want to get the string's byte length,how can do it?

for example:

<cfoutput>#len('hihi，这是测试')#</cfoutput>

output is 9

I want to get the byte length is 14, how can i get it?
Thanks.

Report · Oct 10, 2007

Flashcqxg wrote:
> I want to get the byte length is 14, how can i get it?

why do you think that string's length is 14?

<cfprocessingdirective pageencoding="utf-8">
<cfscript>
t="hihiï¼Œè¿™æ˜¯æµ‹è¯•";
jText=createObject("java","java.lang.String").init(t);
b=t.getBytes();
writeoutput("#t#
<br>cf len:=#len(t)#
<br>byte len:=#arrayLen(b)#
<br>java string length:=#t.length()#
<br>java string objectlength:=#jText.length()#");
</cfscript>

all methods return "9".

Report · Oct 10, 2007

But the Chinese characters are double-byte ones, so the BYTE length SHOULDN'T be 9, it should be 14: four single-byte characters and five double-byte ones (the comma is a double-byte comma too).

The STRING length might be 9, sure.

The size of the chars can be seen by doing this (code attached).

I have to dash to work, but will look at this some more later on.

--
Adam

Report · Oct 11, 2007

Adam Cameron wrote:
> But the Chinese characters are double-byte ones, so the BYTE length SHOULDN'T

we don't know those are "double-byte" (and it's been so long since i've thought
about bytes for chars, all of this is stale so i'm probably not remembering
everything i should). if they're utf-8 then they're most likely represented by 3
(or maybe 4) bytes. for instance: å¥½ has a unicode codepoint of U+597D, in utf-8
it's represented as E5 A5 BD, in utf-16 it *is* 2 bytes (59 7D) but then again
*everything* in the BMP is 2 bytes in utf-16 though some "supplemental" chars
need 4 bytes. if it's some windows or whatever codepage then it could be 2-3
bytes per char depending.

the byte size depends on the encoding which i guess depends somewhat on where
the data's coming from, if say from a db that stores unicode (UCS2) then the
length would be 18 bytes for the *whole* string.

<cfquery datasource="lab" name="test">
SET NOCOUNT ON
DECLARE @t nvarchar(100)
SET @t=N'hihiï¼Œè¿™æ˜¯æµ‹è¯•';
SELECT t=@t, l=dataLength(@t)
SET NOCOUNT OFF
</cfquery>

ditto for something that's using utf-16.

utf-8 is trickier (besides being a valid encoding for some db like mysql) as how
many bytes are needed for each char depends on what's being encoded. a guess
might be codepoints <=127 take 1 byte, 128-2047 take 2 bytes, everything else in
the BMP (<=65,535) need 3 bytes and everything outside the BMP needs 4 bytes.
mind you there's "empty" gaps in these ranges which are getting filled as time
goes by (for example, lanna, what they speak in northern thailand, might get
added w/unicode 6 in the codepoints 6784-6895 range--i cheated & looked that up
just now).

java strings are a lot easier.

Report · Oct 11, 2007

> But the Chinese characters are double-byte ones, so the BYTE length SHOULDN'T
> we don't know those are "double-byte"

Well "not SINGLE byte" then, which was my point.

> if they're utf-8 then they're most likely represented by 3 (or maybe 4) bytes

Fair cop. I didn't realise that asc() returned the codepoint rather than the actual character code.

Whichever way one dresses this up, one can't just tally up the number of characters, and say "there: nine characters: nine bytes". Which is what you did.

If one saves that string to a (UTF-8) file, it's actually 22 bytes long.

Which I take to be:
1-3 BOM
4 h
5 i
6 h
7 i
8-10 ，
11-13 这
14-16 是
17-19 测
20-22 试

I'm guessing that, like me, the OP was counting two bytes for each of the latter five characters; ignoring the BOM which is only relevant if it's in a file, there should be SOME way of getting an answer of "19" for this?

--
Adam

Report · Oct 11, 2007

> Fair cop. I didn't realise that asc() returned the codepoint rather than the actual character code.

and what would be the difference?

>Whichever way one dresses this up, one can't just tally up the number of characters, and say "there: nine characters:
>nine bytes". Which is what you did.

that's what both cf & java counted as the length. which given no idea about the encoding, the fact that's it's a string (& that i haven't thought about something like this in years) makes sense to me.

>1-3 BOM

by adding a BOM you've already effected the encoding, which may or may not match the original. so you still don't know how many bytes were in the original string.

> only relevant if it's in a file, there should be SOME way of getting an answer of "19" for this?

assuming it's utf-8 encoded, i suppose he could use my uBlocks CFC to figure out the exact number of chars in each of the unicode blocks in that string, then multiply by the number of byes needed for each char. i think it's still up on the cf exchange. otherwise w/out knowing the original encoding you don't know '19" is the correct answer.

Report · Oct 11, 2007

>> Fair cop. I didn't realise that asc() returned the codepoint rather than the
> actual character code.
>
> and what would be the difference?

Oh, sorry, whatever the term is (I'm crap with jargon). The value returned
by asc() for those chars was only two bytes (ie: four hex digits). I
didn't realise there was more to it than that, and that 2-byte value maps
to some other THREE byte value. I need to do some reading...

> that's what both cf & java counted as the length.

CHARACTER length, sure. No-one's disputing that. On the other hand,
no-one's asking about it, either.

> by adding a BOM you've already effected the encoding, which may or may not
> match the original. so you still don't know how many bytes were in the original
> string.

[groan]

Yes, that's a reasonable strawman there. I was only putting it in a file
so I could save it and check the number of bytes occupied by the data.

Clearly... CLEARLY... the OP is not asking for a character length of that
string. They've said as much.

I copy and pasted the string from their post, and used it as a
demonstration of how "nine" is not the right answer for the BYTE LENGTH of
that string. Whether or not the original string was UTF-8, UTF-16 or
special-marmoset-encoding, it almost certainly was NOT in a fictitious kind
encoding in which each of those particular characters only occupied one
byte each, which would mean that "nine" is the correct answer to the
question.

When I copied those characters from either the web browser for from my
text-based news agent, notepad identified them (and rendered them
correctly) as UTF-8, so I'm fairly confident they ARE UTF-8. Of course
this could be down to some intermediary encoding (pasting them in to the
original posting, for example, via some encoding-transforming mechanism),
but Occam's Razor suggests the original question was from a UTF-8 POV.

But maybe we should quit speculating and ask the OP. Unless they've
buggered off in despair of how drawn out all this is getting. For which I
would not blame them.

--
Adam

Report · Oct 11, 2007

Whilst checking out characters' byte lengths, I found this site: http://www.fileformat.info/info/unicode/char/search.htm, which is good for looking that sort of thing up.

Just FYI.

--
Adam

Report · Oct 11, 2007

you copied & pasted from this forum so of course the encoding is utf-8. if Flashcqxg thinks those chars are 2 bytes each, what would lead you to think the original encoding was utf-8?

Report · Oct 11, 2007

> you copied & pasted from this forum so of course the encoding is utf-8. if Flashcqxg thinks those chars are 2 bytes each, what would lead you to think the original encoding was utf-8?

Oh for goodness sake. YES I KNOW.

I was illustrating the point that simply counting the number of characters
in a string is *not* a way of determining how many bytes it occupies. The
OP's string, Adobe's facsimile of that string, my facsimile of /Adobe's
facsimile/ of that string; *a string*.

There *must* be some way of calling a function thus:

int specialGoodFunction(String s);

Which returns the number of bytes the string (in whatever encoding it is)
occupies. Which is what the OP is actually interested in. Not you finding
straw men to assault for some silly reason.

--
Adam

Report · Oct 12, 2007

Thank you very much,PaulH and Adam Cameron .
I have test this string in sybase sql anywhere with this sql:
select datalength('hihi，这是测试')
the sql result is: 14
This is i wanted.

To PaulH :
i test your code with my test string,the result is:
hihi��ǲ��
cf len:=13
byte len:=13
java string length:=13
java string objectlength:=13

To Adam Cameron :
Your code result is:
[1] [104]
[2] [105]
[3] [104]
[4] [105]
[5][�][65533]
[6][�][65533]
[7][�][65533]
[8][�][65533]
[9][�][65533]
[10][ǲ][498]
[11][�][65533]
[12][�][65533]
[13][�][65533]

both can not get the 14 !!!![/]

Report · Oct 12, 2007

> Your code can not run on my CFMX7.

Sure. Change it so that it does.

;-)

for (i=1; i <= len(t); i++){

=>

for (i=1; i le len(t); i=i+1){

--
Adam

Report · Oct 13, 2007

Thank you,Adam Cameron

The result is 13,but not 14,why?

Adobe Community

how to get the string's byte length?