12 Replies Latest reply on Oct 13, 2007 6:31 AM by Flashcqxg

    how to get the string's byte length?

    Flashcqxg Level 1
      I have some string,I want to get the string's byte length,how can do it?

      for example:

      <cfoutput>#len('hihi,这是测试')#</cfoutput>

      output is 9

      I want to get the byte length is 14, how can i get it?
      Thanks.
        • 1. Re: how to get the string's byte length?
          Level 7
          Flashcqxg wrote:
          > I want to get the byte length is 14, how can i get it?

          why do you think that string's length is 14?

          <cfprocessingdirective pageencoding="utf-8">
          <cfscript>
          t="hihi,这是测试";
          jText=createObject("java","java.lang.String").init(t);
          b=t.getBytes();
          writeoutput("#t#
          <br>cf len:=#len(t)#
          <br>byte len:=#arrayLen(b)#
          <br>java string length:=#t.length()#
          <br>java string objectlength:=#jText.length()#");
          </cfscript>

          all methods return "9".
          • 2. Re: how to get the string's byte length?
            Level 1
            But the Chinese characters are double-byte ones, so the BYTE length SHOULDN'T be 9, it should be 14: four single-byte characters and five double-byte ones (the comma is a double-byte comma too).

            The STRING length might be 9, sure.

            The size of the chars can be seen by doing this (code attached).

            I have to dash to work, but will look at this some more later on.

            --
            Adam
            • 3. Re: how to get the string's byte length?
              Level 7
              Adam Cameron wrote:
              > But the Chinese characters are double-byte ones, so the BYTE length SHOULDN'T

              we don't know those are "double-byte" (and it's been so long since i've thought
              about bytes for chars, all of this is stale so i'm probably not remembering
              everything i should). if they're utf-8 then they're most likely represented by 3
              (or maybe 4) bytes. for instance: 好 has a unicode codepoint of U+597D, in utf-8
              it's represented as E5 A5 BD, in utf-16 it *is* 2 bytes (59 7D) but then again
              *everything* in the BMP is 2 bytes in utf-16 though some "supplemental" chars
              need 4 bytes. if it's some windows or whatever codepage then it could be 2-3
              bytes per char depending.

              the byte size depends on the encoding which i guess depends somewhat on where
              the data's coming from, if say from a db that stores unicode (UCS2) then the
              length would be 18 bytes for the *whole* string.

              <cfquery datasource="lab" name="test">
              SET NOCOUNT ON
              DECLARE @t nvarchar(100)
              SET @t=N'hihi,这是测试';
              SELECT t=@t, l=dataLength(@t)
              SET NOCOUNT OFF
              </cfquery>

              ditto for something that's using utf-16.

              utf-8 is trickier (besides being a valid encoding for some db like mysql) as how
              many bytes are needed for each char depends on what's being encoded. a guess
              might be codepoints <=127 take 1 byte, 128-2047 take 2 bytes, everything else in
              the BMP (<=65,535) need 3 bytes and everything outside the BMP needs 4 bytes.
              mind you there's "empty" gaps in these ranges which are getting filled as time
              goes by (for example, lanna, what they speak in northern thailand, might get
              added w/unicode 6 in the codepoints 6784-6895 range--i cheated & looked that up
              just now).

              java strings are a lot easier.





              • 4. Re: how to get the string's byte length?
                Level 1
                > But the Chinese characters are double-byte ones, so the BYTE length SHOULDN'T
                > we don't know those are "double-byte"

                Well "not SINGLE byte" then, which was my point.


                > if they're utf-8 then they're most likely represented by 3 (or maybe 4) bytes

                Fair cop. I didn't realise that asc() returned the codepoint rather than the actual character code.

                Whichever way one dresses this up, one can't just tally up the number of characters, and say "there: nine characters: nine bytes". Which is what you did.

                If one saves that string to a (UTF-8) file, it's actually 22 bytes long.

                Which I take to be:
                1-3 BOM
                4 h
                5 i
                6 h
                7 i
                8-10 ,
                11-13 这
                14-16 是
                17-19 测
                20-22 试

                I'm guessing that, like me, the OP was counting two bytes for each of the latter five characters; ignoring the BOM which is only relevant if it's in a file, there should be SOME way of getting an answer of "19" for this?

                --
                Adam
                • 5. Re: how to get the string's byte length?
                  Level 1
                  Whilst checking out characters' byte lengths, I found this site: http://www.fileformat.info/info/unicode/char/search.htm, which is good for looking that sort of thing up.

                  Just FYI.

                  --
                  Adam
                  • 6. how to get the string's byte length?
                    tooMuchTrouble Level 3
                    > Fair cop. I didn't realise that asc() returned the codepoint rather than the actual character code.

                    and what would be the difference?

                    >Whichever way one dresses this up, one can't just tally up the number of characters, and say "there: nine characters:
                    >nine bytes". Which is what you did.

                    that's what both cf & java counted as the length. which given no idea about the encoding, the fact that's it's a string (& that i haven't thought about something like this in years) makes sense to me.

                    >1-3 BOM

                    by adding a BOM you've already effected the encoding, which may or may not match the original. so you still don't know how many bytes were in the original string.

                    > only relevant if it's in a file, there should be SOME way of getting an answer of "19" for this?

                    assuming it's utf-8 encoded, i suppose he could use my uBlocks CFC to figure out the exact number of chars in each of the unicode blocks in that string, then multiply by the number of byes needed for each char. i think it's still up on the cf exchange. otherwise w/out knowing the original encoding you don't know '19" is the correct answer.
                    • 7. Re: how to get the string's byte length?
                      Level 7
                      >> Fair cop. I didn't realise that asc() returned the codepoint rather than the
                      > actual character code.
                      >
                      > and what would be the difference?

                      Oh, sorry, whatever the term is (I'm crap with jargon). The value returned
                      by asc() for those chars was only two bytes (ie: four hex digits). I
                      didn't realise there was more to it than that, and that 2-byte value maps
                      to some other THREE byte value. I need to do some reading...


                      > that's what both cf & java counted as the length.

                      CHARACTER length, sure. No-one's disputing that. On the other hand,
                      no-one's asking about it, either.


                      > by adding a BOM you've already effected the encoding, which may or may not
                      > match the original. so you still don't know how many bytes were in the original
                      > string.

                      [groan]

                      Yes, that's a reasonable strawman there. I was only putting it in a file
                      so I could save it and check the number of bytes occupied by the data.

                      Clearly... CLEARLY... the OP is not asking for a character length of that
                      string. They've said as much.

                      I copy and pasted the string from their post, and used it as a
                      demonstration of how "nine" is not the right answer for the BYTE LENGTH of
                      that string. Whether or not the original string was UTF-8, UTF-16 or
                      special-marmoset-encoding, it almost certainly was NOT in a fictitious kind
                      encoding in which each of those particular characters only occupied one
                      byte each, which would mean that "nine" is the correct answer to the
                      question.

                      When I copied those characters from either the web browser for from my
                      text-based news agent, notepad identified them (and rendered them
                      correctly) as UTF-8, so I'm fairly confident they ARE UTF-8. Of course
                      this could be down to some intermediary encoding (pasting them in to the
                      original posting, for example, via some encoding-transforming mechanism),
                      but Occam's Razor suggests the original question was from a UTF-8 POV.

                      But maybe we should quit speculating and ask the OP. Unless they've
                      buggered off in despair of how drawn out all this is getting. For which I
                      would not blame them.

                      --
                      Adam
                      • 8. Re: how to get the string's byte length?
                        tooMuchTrouble Level 3
                        you copied & pasted from this forum so of course the encoding is utf-8. if Flashcqxg thinks those chars are 2 bytes each, what would lead you to think the original encoding was utf-8?
                        • 9. Re: how to get the string's byte length?
                          Level 7
                          > you copied & pasted from this forum so of course the encoding is utf-8. if Flashcqxg thinks those chars are 2 bytes each, what would lead you to think the original encoding was utf-8?

                          Oh for goodness sake. YES I KNOW.

                          I was illustrating the point that simply counting the number of characters
                          in a string is *not* a way of determining how many bytes it occupies. The
                          OP's string, Adobe's facsimile of that string, my facsimile of /Adobe's
                          facsimile/ of that string; *a string*.

                          There *must* be some way of calling a function thus:

                          int specialGoodFunction(String s);

                          Which returns the number of bytes the string (in whatever encoding it is)
                          occupies. Which is what the OP is actually interested in. Not you finding
                          straw men to assault for some silly reason.

                          --
                          Adam
                          • 10. how to get the string's byte length?
                            Flashcqxg Level 1
                            Thank you very much,PaulH and Adam Cameron .
                            I have test this string in sybase sql anywhere with this sql:
                            select datalength('hihi,这是测试')
                            the sql result is: 14
                            This is i wanted.

                            To PaulH :
                            i test your code with my test string,the result is:
                            hihi�����Dz���
                            cf len:=13
                            byte len:=13
                            java string length:=13
                            java string objectlength:=13


                            To Adam Cameron :
                            Your code result is:
                            [1] [104]
                            [2] [105]
                            [3] [104]
                            [4] [105]
                            [5][�][65533]
                            [6][�][65533]
                            [7][�][65533]
                            [8][�][65533]
                            [9][�][65533]
                            [10][Dz][498]
                            [11][�][65533]
                            [12][�][65533]
                            [13][�][65533]

                            both can not get the 14 !!!![/]
                            • 11. Re: how to get the string's byte length?
                              Level 7
                              > Your code can not run on my CFMX7.

                              Sure. Change it so that it does.

                              ;-)

                              for (i=1; i <= len(t); i++){

                              =>

                              for (i=1; i le len(t); i=i+1){

                              --
                              Adam
                              • 12. Re: how to get the string's byte length?
                                Flashcqxg Level 1
                                Thank you,Adam Cameron

                                The result is 13,but not 14,why?