2 Replies Latest reply on Aug 4, 2012 5:44 PM by Sham HC

    ETX Character in jcr:description property

    giang.phan Level 1

      We aren't sure how, but in a few of our pages using CQ5.4, the jcr:description field (when viewed in CRXDE Lite) contains the ASCII ETX character. This character is not visible in OSX Lion, but it can be seen in Windows and maybe older versions of OSX.

       

      An example of our jcr:description:

      "The Nike USATF Collection is a modern mix  of vintage classics infused with military aesthetic."

       

      When we copy paste the description and put it into CRXDE or a different textbox like the CRXDE Lite path box or in Chrome address bar:

      "The Nike USATF Collection is a modern mix ^C of vintage classics infused with military aesthetic."

       

      Note the ^C above is where the ETX character is being placed.

       

      Has anyone ever experienced this issue before? It is breaking our endecca indexing at the moment on certain pages with this issue. Thanks.

       

      Edit: The code that is breaking is actually the JCR API which our search team is using to gather the data.

        • 1. Re: ETX Character in jcr:description property
          liorz_adok Level 1

          Hi all,

          I work with Giang and wanted to add few notes.

           

          The ETX characters has ascii value of 3

           

          Pages with ETX character will break the defualt XML renditions of the page (regardsless the app that consumes them)


          We now also ran into another control character  (VT ascii value of 11) that is breaking the default XML rendition(sling) of the pages.

           

          I think what we are looking for is a solution that will clear all possible invalid characters when users save then changes in the rich text editor or any of page properties dialogs (that are plain text)

           

          Lior

          • 2. Re: ETX Character in jcr:description property
            Sham HC Level 7

            Hi Giang/Lior,

             

               I have see some issue like pasting invisible character like BOM causing issue. I generally recommend authors as best practice for interoperability instead of copying directly from other sites/systems into cq, Use any text editor (Text Wrangler) with encoding as "UTF-8 no BOM" which filter BOM characters etc and from there paste into CQ.

             

               In your case you have given example of two characters & looking at it I am guessing you might be passing Device Control Characters[1].

             

            The posible solution I can think of is

            Option1:-       Inform authors to filter those characters using any editors that and then update the content copying from filtered text editor. I am sure you do not like this.

             

            Option2:-       Implement a Sling Post Processor or POST operation which validates the supplied input when posted.   In that logic check for invalid character, if exists modify the property to replace invalid character with an empty string. Pseudo code of ETX at [2].    This could take care of future content updates.  For existing one write a small scritp to interate through repository and clean those.   With post processor you do not have to worry about the component that is posting.  Hope this make sense.  

             

             

            [1]   http://www.w3ctutorial.com/tags/ref_ascii

             

            [2]

               // The pattern matches control characters

                    Pattern p = Pattern.compile("\\p{Cntrl}");

                    Matcher m = p.matcher("");

                    m.reset(args[0]);

                        //Replaces control characters with an empty

                        //string.

                    if(m.find()){

                        String result = m.replaceAll("");

                    }

             

            Thanks,

            Sham

            1 person found this helpful