9 Replies Latest reply on Dec 21, 2016 11:51 AM by Marc Autret

    Positive lookahead not working

    Manan Joshi Level 4

      Hi All,

       

      I am trying to use positive lookahead in my script but it does not give me the desired results, the grep expression however does work outside InDesign.

      https://regex101.com/r/QbWDyJ/2

      I want to find all + that does not lie in between <>

      However on using it in a script is get blank results, are there any limitations on using positive lookahead in InDesign?

       

      -Manan

        • 1. Re: Positive lookahead not working
          Marc Autret Level 4

          That should work.

           

          Make sure your findWhat string properly escapes the leading backslash:

           

          app.findGrepPreferences.findWhat = "\\+(?=[^>]*?<)";

           

          @+

          Marc

          • 2. Re: Positive lookahead not working
            Manan Joshi Level 4

            Hi Marc,

             

            Thanks for replying

            I am not using this grep expression to search within the InDesign document, i am using it to search within a string in my JSX. A sample code is as below

             

            Running the code on Extendscript Toolkit engine or InDesign CC 2015 engine gives the same result.

             

            var input = "<ab+c>he+ll+o<de+f>"

            input.match(/\+(?=[^>]*?<)/g)

             

            This returns null

            An interesting thing is if i use the input as "<ab+c>he+ll+o+<de+f>"

            it does find the + after o

             

            Something seems to be missing.

             

            -Manan

            • 3. Re: Positive lookahead not working
              Obi-wan Kenobi Adobe Community Professional

              Aha!

               

              I've tried:

               

              \+(?![^<]*?>)

               

              And it finds the 4 "+"!

               

              (^/)

              • 4. Re: Positive lookahead not working
                Manan Joshi Level 4

                Hi Obi,

                 

                I am trying to find just the + that are present in between the word hello, i.e. no + that are include between <> are to be matched. Based on this i see your results are not correct.

                 

                -Manan

                • 5. Re: Positive lookahead not working
                  Marc Autret Level 4

                  Ah, OK, so you aren't in GREP's wisdom, so welcome in ExtendScript's buggy implementation of JavaScript RegExp :-/

                   

                  [ Remember: GREP ≠ ExtendScript RegExp ≄ JavaScript RegExp ]

                   

                  In your case, I suspect the non-greedy operator ? used after the * quantifier MAY NOT WORK in a lookahead sequence. We must check this though. (There are many bugs related to quantifiers and assertions in ExtendScript regular expressions.)

                   

                  @+

                  Marc

                  • 6. Re: Positive lookahead not working
                    Marc Autret Level 4

                    Due to the way ExtendScript RegExp unexpectedly works with the non-greedy operator, you may try this regex instead:

                     

                    /\+(?=[^><+]*(?:[\+<]|$))/g

                     

                    EDIT: But this won't work if you can have multiple '+' in a tag :-/

                     

                    @+

                    Marc

                    • 7. Re: Positive lookahead not working
                      Marc Autret Level 4

                      In case you have patterns of this form, AA+BB+CC<DD+EE+FF>GG+HH+II<JJ+KK>LL+MM etc., I think you need both a negative and a positive lookahead to only capture '+' outside of the tags.

                       

                      Then try this:

                       

                      /\+(?![^<>]*>)(?=[^><+]*(?:[\+<]|$))/g

                       

                      @+

                      Marc

                      • 8. Re: Positive lookahead not working
                        Manan Joshi Level 4

                        Thank you Marc, this seems to be doing the trick. However a few quick questions.

                        • What are the limitations(bugs) of regex in Extendscript, is it documented somewhere? If this is something you got as experience wisdom, what in your opinion  should be avoided?
                        • The regex you gave, seems to be a bit intimidating to me at first look. I take a fair bit of time to come up with a regex. Will try and understand it, if i fail will give you a cry for a help. Hope it won't be a great inconvenience

                         

                        Thanks a lot Marc and Obi

                        • 9. Re: Positive lookahead not working
                          Marc Autret Level 4

                          Hi Manan,

                           

                          > What are the limitations of regex in Extendscript, is it documented somewhere?

                           

                          It's hard to summarize and I don't think an exhaustive report of ExtendScript RegExp issues has been published. Very basically, we know from experience we can encounter backtracking problems with quantifiers. This may involve either the lastIndex property, greedy vs. non-greedy suffix operator (*?, +?, etc.), and/or lookahead assertions. Some facts—among many others—have been discussed here:

                          Regular Expressions in CS5.5 - something is wrong

                          Indiscripts :: InDesign Scripting Forum: 25 ‘sticky’ posts [roundup]

                          Indiscripts :: InDesign Scripting Forum Roundup #2

                           

                          > The regex you gave, seems to be a bit intimidating to me at first look. (…)

                           

                          Yes, it's not en easy one, and I just realized that I complicated it unnecessarily. A simplified form, /.\+(?![^<>]*>)/g, would probably work as well.

                           

                          Anyway, let's try to explain my reasoning with a picture:

                           

                          RegExplain.png

                           

                          First above all the regex looks after a plus sign \+, then it needs to satisfy two lookahead assertions to validate that match. Keep in mind that an assertion is not supposed to consume further characters in the string (that is, the inner index of the RegExp engine doesn't move during theses validation steps), but the way assertions are designed deeply impacts how the whole regex works.

                           

                          • In red, the NEGATIVE LOOKAHEAD (NL) assertion, (?!pattern), means that, from the current point, the embedded pattern MUST FAIL. In other words, if that pattern is found, then the condition is not satisfied and the plus sign under consideration won't be captured as a match (whatever the other assertion, which won't be tested at all.) Otherwise, the condition is satisfied and the other assertion is tested.

                           

                          • In green, the POSITIVE LOOKAHEAD (PL) assertion, (?=pattern), means that, from the same current point (since the index has not moved), the embedded pattern MUST SUCCEED. If that pattern is found, all is fine and the plus sign under consideration is definitely a match. Otherwise, it is ignored.

                           

                          So those two assertions work as a logical AND: (NL pattern must be KO) AND (PL pattern must be OK.)

                           

                          Why do we need a negative pattern? To prevent any plus sign nested in a <…> tag from being validated. This is done by testing the pattern [^<>]*>, which means non-markup sign (zero or more times) then a closing mark. This pattern can only succeed from within a tag as it needs to find a form "XXX>" where X is neither a "<" nor a ">".

                           

                          Why should we need a positive pattern? We shouldn't, in fact! In your case, indeed, the previous condition is sufficient to assert that the plus sign under consideration is external from any tag. That's it :-) However, as I had no hint about extra-conditions or constraints that your input string may undergo—I don't know what you are actually doing!—I found it safer to positively define the pattern in which the plus sign is expected. So I used the PL as something of a reinforcement.

                           

                          What does the PL require? It looks for the pattern [^><+]*(?:[+<]|$), which is a complicate syntax for just saying "XXXY", where X is neither ">" nor "<" nor "+", and Y is either "+", or "<", or the end of the string ($). This explicitly describes any suffix string that must follow the plus sign under consideration. But, as already said, this positive lookahead is useless here since the negative assertion seems to cover all the issues.

                           

                          I also made a GIF to illustrate how the regex dynamically works:

                           

                          RegExplain.gif

                           

                          Might be of some use.

                           

                          @+

                          Marc

                          2 people found this helpful