Positive lookahead not working

Report · Dec 17, 2016

Hi All,

I am trying to use positive lookahead in my script but it does not give me the desired results, the grep expression however does work outside InDesign.

https://regex101.com/r/QbWDyJ/2

I want to find all + that does not lie in between <>

However on using it in a script is get blank results, are there any limitations on using positive lookahead in InDesign?

-Manan

Report · Dec 18, 2016

That should work.

Make sure your findWhat string properly escapes the leading backslash:

app.findGrepPreferences.findWhat = "\\+(?=[^>]*?<)";

@+

Marc

Report · Dec 18, 2016

Hi Marc,

Thanks for replying

I am not using this grep expression to search within the InDesign document, i am using it to search within a string in my JSX. A sample code is as below

Running the code on Extendscript Toolkit engine or InDesign CC 2015 engine gives the same result.

var input = "<ab+c>he+ll+o<de+f>"

input.match(/\+(?=[^>]*?<)/g)

This returns null

An interesting thing is if i use the input as "<ab+c>he+ll+o+<de+f>"

it does find the + after o

Something seems to be missing.

-Manan

Report · Dec 18, 2016

Aha!

I've tried:

\+(?![^<]*?>)

And it finds the 4 "+"!

(^/)

Report · Dec 18, 2016

Hi Obi,

I am trying to find just the + that are present in between the word hello, i.e. no + that are include between <> are to be matched. Based on this i see your results are not correct.

-Manan

Report · Dec 18, 2016

Ah, OK, so you aren't in GREP's wisdom, so welcome in ExtendScript's buggy implementation of JavaScript RegExp 😕

[ Remember: GREP ≠ ExtendScript RegExp ≄ JavaScript RegExp ]

In your case, I suspect the non-greedy operator ? used after the * quantifier MAY NOT WORK in a lookahead sequence. We must check this though. (There are many bugs related to quantifiers and assertions in ExtendScript regular expressions.)

@+

Marc

Report · Dec 18, 2016

Due to the way ExtendScript RegExp unexpectedly works with the non-greedy operator, you may try this regex instead:

/\+(?=[^><+]*(?:[\+<]|$))/g

EDIT: But this won't work if you can have multiple '+' in a tag 😕

@+

Marc

Report · Dec 18, 2016

In case you have patterns of this form, AA+BB+CC<DD+EE+FF>GG+HH+II<JJ+KK>LL+MM etc., I think you need both a negative and a positive lookahead to only capture '+' outside of the tags.

Then try this:

/\+(?![^<>]*>)(?=[^><+]*(?:[\+<]|$))/g

@+

Marc

Report · Dec 20, 2016

Thank you Marc, this seems to be doing the trick. However a few quick questions.

What are the limitations(bugs) of regex in Extendscript, is it documented somewhere? If this is something you got as experience wisdom, what in your opinion should be avoided?
The regex you gave, seems to be a bit intimidating to me at first look. I take a fair bit of time to come up with a regex. Will try and understand it, if i fail will give you a cry for a help. Hope it won't be a great inconvenience

Thanks a lot Marc and Obi

Report · Dec 21, 2016

Hi Manan,

> What are the limitations of regex in Extendscript, is it documented somewhere?

It's hard to summarize and I don't think an exhaustive report of ExtendScript RegExp issues has been published. Very basically, we know from experience we can encounter backtracking problems with quantifiers. This may involve either the lastIndex property, greedy vs. non-greedy suffix operator (*?, +?, etc.), and/or lookahead assertions. Some facts—among many others—have been discussed here:

• Regular Expressions in CS5.5 - something is wrong

• Indiscripts :: InDesign Scripting Forum: 25 ‘sticky’ posts [roundup]

• Indiscripts :: InDesign Scripting Forum Roundup #2

> The regex you gave, seems to be a bit intimidating to me at first look. (…)

Yes, it's not en easy one, and I just realized that I complicated it unnecessarily. A simplified form, /.\+(?![^<>]*>)/g, would probably work as well.

Anyway, let's try to explain my reasoning with a picture:

First above all the regex looks after a plus sign \+, then it needs to satisfy two lookahead assertions to validate that match. Keep in mind that an assertion is not supposed to consume further characters in the string (that is, the inner index of the RegExp engine doesn't move during theses validation steps), but the way assertions are designed deeply impacts how the whole regex works.

• In red, the NEGATIVE LOOKAHEAD (NL) assertion, (?!pattern), means that, from the current point, the embedded pattern MUST FAIL. In other words, if that pattern is found, then the condition is not satisfied and the plus sign under consideration won't be captured as a match (whatever the other assertion, which won't be tested at all.) Otherwise, the condition is satisfied and the other assertion is tested.

• In green, the POSITIVE LOOKAHEAD (PL) assertion, (?=pattern), means that, from the same current point (since the index has not moved), the embedded pattern MUST SUCCEED. If that pattern is found, all is fine and the plus sign under consideration is definitely a match. Otherwise, it is ignored.

So those two assertions work as a logical AND: (NL pattern must be KO) AND (PL pattern must be OK.)

Why do we need a negative pattern? To prevent any plus sign nested in a <…> tag from being validated. This is done by testing the pattern [^<>]*>, which means non-markup sign (zero or more times) then a closing mark. This pattern can only succeed from within a tag as it needs to find a form "XXX>" where X is neither a "<" nor a ">".

Why should we need a positive pattern? We shouldn't, in fact! In your case, indeed, the previous condition is sufficient to assert that the plus sign under consideration is external from any tag. That's it 🙂 However, as I had no hint about extra-conditions or constraints that your input string may undergo—I don't know what you are actually doing!—I found it safer to positively define the pattern in which the plus sign is expected. So I used the PL as something of a reinforcement.

What does the PL require? It looks for the pattern [^><+]*(?:[+<]|$), which is a complicate syntax for just saying "XXXY", where X is neither ">" nor "<" nor "+", and Y is either "+", or "<", or the end of the string ($). This explicitly describes any suffix string that must follow the plus sign under consideration. But, as already said, this positive lookahead is useless here since the negative assertion seems to cover all the issues.

I also made a GIF to illustrate how the regex dynamically works:

Might be of some use.

@+

Marc

Adobe Community

Positive lookahead not working

1 Correct answer