Skip navigation

CQ5

Currently Being Moderated

ETX Character in jcr:description property

Jun 22, 2012 2:53 PM

Tags: #error #ascii #cq5.4 #etx

We aren't sure how, but in a few of our pages using CQ5.4, the jcr:description field (when viewed in CRXDE Lite) contains the ASCII ETX character. This character is not visible in OSX Lion, but it can be seen in Windows and maybe older versions of OSX.

 

An example of our jcr:description:

"The Nike USATF Collection is a modern mix  of vintage classics infused with military aesthetic."

 

When we copy paste the description and put it into CRXDE or a different textbox like the CRXDE Lite path box or in Chrome address bar:

"The Nike USATF Collection is a modern mix ^C of vintage classics infused with military aesthetic."

 

Note the ^C above is where the ETX character is being placed.

 

Has anyone ever experienced this issue before? It is breaking our endecca indexing at the moment on certain pages with this issue. Thanks.

 

Edit: The code that is breaking is actually the JCR API which our search team is using to gather the data.

 
Replies
  • Currently Being Moderated
    Aug 2, 2012 3:53 PM   in reply to giang.phan

    Hi all,

    I work with Giang and wanted to add few notes.

     

    The ETX characters has ascii value of 3

     

    Pages with ETX character will break the defualt XML renditions of the page (regardsless the app that consumes them)


    We now also ran into another control character  (VT ascii value of 11) that is breaking the default XML rendition(sling) of the pages.

     

    I think what we are looking for is a solution that will clear all possible invalid characters when users save then changes in the rich text editor or any of page properties dialogs (that are plain text)

     

    Lior

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 4, 2012 5:44 PM   in reply to liorz_adok

    Hi Giang/Lior,

     

       I have see some issue like pasting invisible character like BOM causing issue. I generally recommend authors as best practice for interoperability instead of copying directly from other sites/systems into cq, Use any text editor (Text Wrangler) with encoding as "UTF-8 no BOM" which filter BOM characters etc and from there paste into CQ.

     

       In your case you have given example of two characters & looking at it I am guessing you might be passing Device Control Characters[1].

     

    The posible solution I can think of is

    Option1:-       Inform authors to filter those characters using any editors that and then update the content copying from filtered text editor. I am sure you do not like this.

     

    Option2:-       Implement a Sling Post Processor or POST operation which validates the supplied input when posted.   In that logic check for invalid character, if exists modify the property to replace invalid character with an empty string. Pseudo code of ETX at [2].    This could take care of future content updates.  For existing one write a small scritp to interate through repository and clean those.   With post processor you do not have to worry about the component that is posting.  Hope this make sense.  

     

     

    [1]   http://www.w3ctutorial.com/tags/ref_ascii

     

    [2]

       // The pattern matches control characters

            Pattern p = Pattern.compile("\\p{Cntrl}");

            Matcher m = p.matcher("");

            m.reset(args[0]);

                //Replaces control characters with an empty

                //string.

            if(m.find()){

                String result = m.replaceAll("");

            }

     

    Thanks,

    Sham

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points