25 Replies Latest reply on Apr 18, 2011 7:33 AM by Laubender

    Filter text.contents (removing special characters)

    Loic.Aigon Adobe Community Professional

      Hi guys,

       

      I want to extract a string from a bunch of text (here a selection for example). This text is xml tagged.

       

      If I do selection[0].contents, it captures the text and all the special characters (XML tags, carriage return). I can check something is "wrong" cause contents.length is greater than expected (John(space)Smith > 10 characters but contents.length > 14). I am not really surprised cause I knew this behaviour.

       

      So I tried to filter it to remove any content which is not an alphanumeric character but here is where I fail.

      If I use GREP with contents.match(/[\w]+/g), it's quite perfect. But if the contents has diacritics, this pattern fails to catch them.

      So I could include them in the pattern but it's really probable I miss a lot.

       

      So my question is "how to extract the pure text from the contents ensuring I get all the diacritics if any but without carrying special characters ?

       

      TIA Loiccontents.jpg

        • 1. Re: Filter text.contents (removing special characters)
          Mayhem SWE Level 2

          Rather than trying to extract only the characters you want, how about removing the ones you do not? Something like this perhaps:

           

          contents.replace(RegExp(/\W+/g), "")
          • 2. Re: Filter text.contents (removing special characters)
            Loic.Aigon Adobe Community Professional

            Thx Mayhem for your proposal.

             

            However it fails if string has diacritics. Ex:

             

            "Loïc".replace(RegExp(/\W+/g), "") //Loc

             

            I need Loïc in output.

             

            Thx anyway.

            Loic

            • 3. Re: Filter text.contents (removing special characters)
              Mayhem SWE Level 2

              Ahh, okay... Are the characters you need to remain all within UTF-8? Something like this to filter out unwanted character ranges might be what you need:

               

              replace(RegExp(/[^\x20-\x7E\xA0-\xFF]/g), '')

               

              (I've edited this expression a couple of times, so if you already tried it, copy from above and try again!)

              • 4. Re: Filter text.contents (removing special characters)
                Loic.Aigon Adobe Community Professional

                Hi Mayhem,

                 

                That looks great. Loïc comes nice and length is ok. I think you gave me the perfect pattern.

                 

                Thx a lot Loic

                • 5. Re: Filter text.contents (removing special characters)
                  John Hawkinson Level 5

                  I feel like there's a better solution to this (I'll post again if I come with one), but meanwhile, please note that writing this:

                   

                  contents.replace(RegExp(/whatever/), "");
                  

                   

                  is really just a more verbose way of writing this:

                   

                  contents.replace(/whatever/, "");
                  

                   

                  And, if you don't have the language spec in hand, you would think the first form converts /whatever/ to a string ("whatever") and then calls new RegExp("whatever") returning /whatever/ again. Actually ECMA-262/3rd sec. 15.10.3.1 says it can return the regexp unchanged, but why make it more confusing?

                   

                  I think in general it's better to reserve the RegExp() constructor for making regular expressions out of strings...

                  • 6. Re: Filter text.contents (removing special characters)
                    Harbs. Level 6

                    One reason to use a RegExp constructor is to deal with a performance issue in CS5.

                     

                    Constructing a RegExp once and reusing the reference is much less expensive than using RegExp literals in CS5 (which must get constructed each time it's used)...

                     

                    Harbs

                    • 7. Re: Filter text.contents (removing special characters)
                      John Hawkinson Level 5

                      Harbs: The question re-use is orthogonal from the question of literal versus constructor. You can save a reference either way:

                      var
                        ref1 = /myRE/,
                        ref2 = new RegExp("myRE"),
                        ref3 = RegExp("myRE");
                      
                      for ( ... ) { } // tight loop here
                      

                       

                      But if indeed, either one is expensive, you should not be doing BOTH! Using "RegExp(/myRE/)" creates the literal first, and then passes it through the constructor (well, in this case, technically through "The RegExp Constructor Called as a Function," see sec. 15.10.3 of the spec).

                       

                      My point is simple: don't do both. Pick one.

                      • 8. Re: Filter text.contents (removing special characters)
                        Harbs. Level 6

                        Yes. I understood your point. (use RegExp("") rather than RegExp(//))

                         

                        I was making another point.

                         

                        I was pretty sure that using the RegExp constructor (RegExp("abc")) is very different than a literal (/abc/) in terms of performance in CS5.

                         

                        I just did some test to double check my memory, and I did not remember very well..

                         

                        Here's three tests:

                         

                        Test #1:

                         

                        var regex = /abc/;
                        var string = "abcd";
                        
                        for(i=0;i<100000;i++){
                            string.match(regex);
                        }
                        

                        took about 4.745 sec.

                         

                        Test #2:

                         

                        var regex = RegExp("abc");
                        var string = "abcd";
                        
                        for(i=0;i<100000;i++){
                            string.match(regex);
                        }
                        

                        took about 4.708 sec.

                         

                        Test #3:

                         

                        var string = "abcd";
                        
                        for(i=0;i<100000;i++){
                            string.match(/abc/);
                        }
                        

                        took about 7.509 sec.

                         

                        So the difference between a literal and a RegExp constructor is not the important factor, it's creating the reference and reusing it that's important...

                         

                        Sorry about the confusion...


                        Harbs

                        • 9. Re: Filter text.contents (removing special characters)
                          Mayhem SWE Level 2

                          Meh. I don't care much what others think of my coding style. Obviously I believe my way is easier to read, since there is no syntax coloring for /whatever/ regular expressions but is for the RegExp keyword. I can guarantee it does not get cast to a string and then back, as regular expressions created from strings cannot set modifiers and the global modifier in the example above does not get lost. Unless Adobe's engineers are doing something dead stupid (which admittedly wouldn't be the first time) there cannot possibly be a noticeable performance penalty.

                          • 10. Re: Filter text.contents (removing special characters)
                            John Hawkinson Level 5

                            We think alike! Thanks! I just got done benchmarking, but I'll post it anyhow.

                             

                            BenchmarkTime
                            Literal10.811 sec.
                            Constructor11.469 sec.
                            Both18.322 sec.
                            Literal*31.251 sec.
                            Constructor*35.156 sec.
                            Both*43.507 sec.

                             

                            Code follows. *-variants use eval with different regexps to defeat any potential optimizations (I don't think there is much optimization though).


                            We are, of course, benchmarking different things. you're benchmarking tests, I'm benchmarking instantiation of the regexp. It's funny that we get different results though. For me, the regexp literal is always faster to create. For you, the faux constructor is faster to use. That makes no sense to me, they should be exactly the same.

                             

                            function repeat(times, it) {
                                var i;     
                                for (i=0; i< times; i++) it(i)
                            }
                            
                            function timeit(name, times, it) {
                                var t0,t1;
                                t0 = new Date().valueOf();
                                repeat(times, it);
                                t1 = new Date().valueOf();
                                $.writeln(name+": "+(t1-t0)/1000+" sec.");
                                return t1-t0;
                            }
                            
                            var count=5e5;
                            timeit("Literal", count, function() { var re = /literal/; });
                            timeit("Constructor", count, function() { var re = new RegExp("literal"); } );
                            timeit("Both", count, function() { var re = new RegExp(/literal/); });
                            
                            timeit("Literal*", count, function(n) { eval('var re = /literal'+n+'/') } );
                            timeit("Constructor*", count, function(n) { eval('var re = new RegExp("literal'+n+'")') } );
                            timeit("Both*", count, function(n) { eval('var re = new RegExp(/literal'+n+'/)') } );
                            0;
                            
                            
                            • 11. Re: Filter text.contents (removing special characters)
                              John Hawkinson Level 5

                              > I can  guarantee it does not get cast to a string and then back,

                              > as regular  expressions created from strings cannot set modifiers

                              > and the global  modifier in the example above does not get lost.

                               

                              Err...well, as I said, it (the faux constructor -- "RegExp()" called as a function, without the new) does not convert to a string. But if you call the actual constructor is does in fact do so. But it extracts the flags from the regexp and reuses them.

                               

                              > Unless Adobe's  engineers are doing something dead stupid

                              > (which admittedly wouldn't be  the first time) there canno

                              > possibly be a noticeable performance  penalty.

                               

                              Take a look at my numbers. It's not 2x as slow but it is 1.4x as slow, with the real constructor (new).

                               

                              Rerunning with the "faux" constructor (no New), I get 12.371 sec for the fast case (without eval), and 36.019 sec for the eval case.

                              And for the "Both" case with the faux constructor, 11.537 fast and 37.404 with eval.

                              Other numbers all within 100ms of my original benchmark, so I won't repeat them here.

                               

                              But yeah, with the faux constructor it's not appreciably slower though it is slower by epsilon (7%).

                              • 12. Re: Filter text.contents (removing special characters)
                                Harbs. Level 6

                                Mayhem SWE wrote:


                                Obviously I believe my way is easier to read, since there is no syntax coloring for /whatever/ regular expressions but is for the RegExp keyword.

                                I use BBEdit which does have syntax highlighting for RegExp literals...

                                Mayhem SWE wrote:


                                and the global modifier in the example above does not get lost.

                                I'm not sure what you mean.

                                 

                                var regex = RegExp("abc","g");
                                "abcd".replace(regex,"bca");
                                
                                

                                and

                                 

                                "abcd".replace(/abc/g,"bca");

                                 

                                and

                                 

                                var regex = RegExp(/abc/g);
                                "abcd".replace(regex,"bca");
                                

                                 

                                are all functionally equivalent.

                                 

                                Harbs

                                • 13. Re: Filter text.contents (removing special characters)
                                  John Hawkinson Level 5

                                  Not exactly a fair test, since you dn't have more than one "abc" in your test string. You'd need "abcdabcd" to test this.

                                   

                                  But Mayhem SWE is arguing that because the modifier does not get lost (i.e. the /g works fine), therefore the RegExp() constructor is not converting the pattern back to a string, because a string has no way to represent a /g without being two strings. But that argument isn't really valid, because the complexity of what actally goes on. I was trying to avoid quoting the spec, but here we go:

                                   

                                  15.10.4.1 new RegExp(pattern, flags)
                                  If pattern is an object R whose [[Class]] property is "RegExp" and
                                  flags is undefined, then let P be the pattern used to construct R
                                  and let F be the flags used to construct R. If pattern is an 
                                  object R whose [[Class]] property is "RegExp" and flags is not 
                                  undefined, then throw a TypeError exception. Otherwise, let P be 
                                  the empty string if pattern is undefined and ToString(pattern) 
                                  otherwise, and let F be the empty string if flags is undefined 
                                  and ToString(flags) otherwise.
                                  

                                   

                                  it then goes on to explain what happens to F and P to construct the RegExp.

                                  • 14. Re: Filter text.contents (removing special characters)
                                    Mayhem SWE Level 2

                                     

                                    var regex = RegExp("abc","g");

                                    Hmm, interesting. The CS3 documentation browser merely says RegExp (pattern): RegExp, nothing about setting modifiers separately...?

                                    • 15. Re: Filter text.contents (removing special characters)
                                      Harbs. Level 6

                                      John Hawkinson wrote:

                                       

                                      Not exactly a fair test, since you dn't have more than one "abc" in your test string. You'd need "abcdabcd" to test this.

                                       

                                      It actually was not a test at all...

                                       

                                      I did not feel a need to test what I was writing because I know it to be true. I was simply requesting an explanation -- which you provided. Thanks!

                                       

                                      Harbs

                                      • 16. Re: Filter text.contents (removing special characters)
                                        John Hawkinson Level 5

                                        Yeah, the Adobe documentation on standard JavaScript functions is...incomplete. I'd recommend the MDC documentation. Definitely not w3schools, though, which pops up at the top of google hits (see http://w3fools.com/ for some reasons why not).

                                        • 17. Re: Filter text.contents (removing special characters)
                                          John Hawkinson Level 5

                                          OK, back to the original question.


                                          Loic, what am I doing differently?

                                          johnsmith.png

                                          • 18. Re: Filter text.contents (removing special characters)
                                            [Jongware] Most Valuable Participant

                                            All GREP related fun aside, Loïc, all you need to remove is some very special characters.

                                             

                                            TextChar.h lists the following:

                                             

                                            0x0003 BreakRunInStyle

                                            0x0004 FootnoteMarker

                                            0x0007 IndentToHere

                                            0x0008 RightAlignTab (you might want to convert those to a regular tab, I guess)

                                            0x0016 Table (when it's 'seen' as an inline object)

                                            0x0017 "TableContinued" -- heyheyhey, we have something new here! Wonder when & how this one is gonna pop up.

                                            0x0018 PageNumber (a.k.a. "AutoText")

                                            0x0019 SectionName

                                            0x001a NonRomanSpecialGlyph (you should probably check how this gets used)

                                             

                                            (Then a long list of 'normal' character name definitions. This one comment is fun

                                             

                                            kTextChar_Ellipse                    = 0x2026;          // Actually, it's "ellipsis"

                                             

                                            The original programmers weren't really typesetters, then!)

                                             

                                            The following are *hugely* important because you must do some special parsing if you encounter them! They are for encoding 32-bit Unicode values:

                                             

                                             

                                            HighSurrogateStart = 0xD800; // includes private use 0xDB80 - 0xDBFF

                                            HighSurrogateEnd = 0xDBFF;

                                            LowSurrogateStart = 0xDC00;

                                            LowSurrogateEnd = 0xDFFF;

                                             

                                            This one may pop up for anchored objects (I think):

                                            ReplacementCharacter = 0xFFFD; // an incoming character whose value is unrepresentable in Unicode

                                             

                                            And this one dups for your XML marker codes:

                                            ByteOrderingCharacter = 0xFFFE;

                                             

                                            -- I think I got'em all.

                                            1 person found this helpful
                                            • 19. Re: Filter text.contents (removing special characters)
                                              Loic.Aigon Adobe Community Professional

                                              Hi John,

                                              As far as I can tel (or undestand), you are facing the extra characters issue (xml tags). This is all about getting the pure text without extra content

                                              • 20. Re: Filter text.contents (removing special characters)
                                                Loic.Aigon Adobe Community Professional

                                                Wow Theunis,

                                                 

                                                That looks really great. It's probably better to remove these special characters specifically targeted than pointing a much wider range of characters code.

                                                 

                                                I will give this a try tomorrow

                                                 

                                                Thanks a lot for all you guys, you rock !

                                                Loic

                                                • 21. Re: Filter text.contents (removing special characters)
                                                  John Hawkinson Level 5

                                                  But look at my example? I have XML tags but I have no extra characters! Can you show me an example you have that gets extra characters?

                                                   

                                                  There has got to be a better way to do this. But hopefully one that does not involve checking each character individually (performance). Or exporting stories to external files (again, performance). What's the size of the text you need to do this on and the rough number of times you do it?

                                                  • 22. Re: Filter text.contents (removing special characters)
                                                    [Jongware] Most Valuable Participant
                                                    But look at my example? I have XML tags but I have no extra characters!

                                                     

                                                    John, the characters at #0, 5, 7, and 13 cannot be displayed, and thus show 'nothing'. If you display the charCodes, you'll see it's 16#FFEF for those invisible characters.

                                                     

                                                    These semi-invisible codes are a pain, because there are lots of situations where they pop up and cause mischief; for example, in text exports (not visible in a text editor, but the database that imported it choked on them), or when you create a bookmark from them (in Acrobat you see weird "unknown character" blocks).

                                                    • 23. Re: Filter text.contents (removing special characters)
                                                      John Hawkinson Level 5

                                                      *sigh*. You know, I was looking and expecting to see the SpecialCharacter enumerators, but that's not what this is about.

                                                      Sorry for being sloppy.

                                                      I inserted a current-page-number between the 'Sm', and I get this:

                                                       

                                                      s=app.selection[0]; sc=s.contents;
                                                      for (i=0; i<sc.length; i++) print(i+"<"+sc[i].charCodeAt(0)+"> '"+sc[i]+"'  "+s.characters[i].contents);
                                                      0<65279> ''  
                                                      1<74> 'J'  J
                                                      2<111> 'o'  o
                                                      3<104> 'h'  h
                                                      4<110> 'n'  n
                                                      5<65279> ''  
                                                      6<32> ' '   
                                                      7<65279> ''  
                                                      8<83> 'S'  S
                                                      9<24> ' '  1396797550
                                                      10<109> 'm'  m
                                                      11<105> 'i'  i
                                                      12<116> 't'  t
                                                      13<104> 'h'  h
                                                      14<65279> ''  

                                                       

                                                      But I guess these XML things are not the same as the SpecialCharacters enumerators.

                                                       

                                                      It's certainly easy to filter out the 65279 characters, but that's not really sufficient. And I had thought that using anything other than .characters was supposed to save you from these things... But apparently not...

                                                       

                                                      *confused again*

                                                      (And then the Jive forum just ate my post. grr.)

                                                      • 24. Re: Filter text.contents (removing special characters)
                                                        Loic.Aigon Adobe Community Professional

                                                        I remebered first time I face these "transparent" characters. I was comparing <tag>foo</tag>.contents.length to foo.length and it returned false. It did'nt make any sense that foo was different than foo until I check lengths and got 5 for one and 3 for the other one. This is when I realized there was extra characters.

                                                        I rememebered I mad a topic to warn people cause it's really disturbing when you don't know.

                                                        • 25. Re: Filter text.contents (removing special characters)
                                                          Laubender Adobe Community Professional & MVP

                                                          @John,
                                                          thank you for that line of code.

                                                           

                                                          I just experimented a bit with that (InDesign CS4 6.0.6 German Version).
                                                          In the case of footnotes I get strange results.

                                                           

                                                          If my selection is a single footnote, $.writeln returns absolutely nothing to the JavaScript console.

                                                          If my selection is a footnote plus an arbitrary character (could be a second footnote), JavaScript console is showing both characters.
                                                          In the case of two footnotes:

                                                           

                                                          0    <4>' '      1399221837
                                                          1    <4>' '      1399221837

                                                           

                                                          In the case of a footnote it seems there must be a always a second character to trigger a result.

                                                           

                                                          Uwe