Skip navigation
Currently Being Moderated

GREP: using positive lookbehind/positive lookahead

Mar 9, 2013 3:03 AM

Tags: #and #grep #lookbehind #lookahead

I fail to find the correct GREP for a seemingly very easy query: within an "abc" string, find "b" (without "a" or "c"). As simple as this.

 

Here are the details: I have a text with many occurrences of this sequence:

 

~hl xxx

~he xxx

~hd xxx

~hf xxx

~hb xxx

 

That is, five consecutive paragraphs, each beginning with a code consisting of a tilde (~) followed by h[ledfb] and a space, and then any characters follow (here represented by xxx). The same codes appear in other sequences as well.

 

Now in such a sequence (and only in such), the paragraph beginning with ~hf should become the second, so that the sequence is changed to:

 

~hl xxx

~hf xxx

~he xxx

~hd xxx

~hb xxx

 

I am planning to use a combination of Positive Lookbehind / Positive Lookahead search. This should find each paragraph beginning with "~hf ", only if it occurs after the three paragraphs and before the one paragraph mentioned above. I could then copy the match to the clipboard and move it to the correct place. So I was trying to use this grep:

 

(?<=hl [^\~]+\r\~he [^\~]+\r\~hd [^\~]+\r)\~hf [^\~]+\r(?=\~hb[^\~]+\r)

 

The [^\~]+ bits make sure no other code is being matched.

 

For some reason this does not match anything. Why? (If I omit the lookbehind and lookahead bits, it works.)

 

Any help greatly appreciated!

 

Message was edited by: samar02

 
Replies
  • Currently Being Moderated
    Mar 9, 2013 4:10 AM   in reply to samar02

    Let me start by saying what I know about GREP I learned here from Jongware and Peter Kahrel, and from Peter's book, which I highly recommend.

     

    Eiher one of them may come along with a better solution as soon as I'm done, but in the meantime I think the lookarounds don't work because the lengths could be variable. I do have an alternate plan that DOES seem to work, however, and eliminates the need to use the clipboard as a bonus.

     

    Use the search expression

    (\~hl [^\~]+)(\~he [^\~]+)(\~hd [^\~]+)(\~hf [^\~]+)(\~hb [^\~]+)

    and then use a sequence of $1 through $5 to rearrange the order, for example $1$2$5$3$4 will move the ~hb paragraph after the ~he and before the ~hd.

     

    I thought you would need (?s), which is the "single line" marker that causes the entire story to be treated as a single paragraph at the beginning of the query, but that doesn't seem to be the case. Also, spaces and returns are bicked up by the [^\~] (not a tilde) class, so you don't need to explicitily include them unless you want to be sure that they are in precise positions, like the space after the ~hl. I left those in, but there is no point in leaving in the the \r as it cannot be used with the negative class to stop the match unless the return itself is included in the class. This leaves the possibility that you could have intervening paragraphs that don't start with a tilde, and they would be considerd, as far as a match, as belonging with whatever paragraph starting with a tilde comes before them.

     

    If you want the query to not match any paragraph that doesn't start with a tilde (in other words it should fail if there is an intervening pargargraph in the list without the prefix), I think you could modify it as follows:

     

    (\~hl [^\~\r]+\r)(\~he [^\~\r]+\r)(\~hd [^\~\r]+\r)(\~hf [^\~\r]+\r)(\~hb [^\~\r]+\r)

    which also will not match if your ~hb paragraph is the last in the story and does not have a trailing return. In the first query it is found, which can lead to combining it with another paragraph (I know this because it happened in my test).

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 9, 2013 4:33 AM   in reply to Peter Spier

    I've been thinking about the [^\~\r] that I used, and although it SEEMS to work, I don't understand why it would. Logically a tilde is not a return, and a return is not a tilde and I would think this would go into a meltdown loop and hang or crash, or match everything, essentially becoming the equivalent of .+. Would one of you experts out there care to comment on the inner workings of a class? It seems like the negation includes an implied boolean "and" rather than the implied "or" of a regular class.

     

    Or is just an accident that it worked?

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 9, 2013 4:52 AM   in reply to Peter Spier

    I just checked, and substituting .+? for [^\~\r] also works (or seems to), the ? being used to limit the + to shortest match. I understand from Peter's book, though, that the negative class is faster.

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 9, 2013 6:51 AM   in reply to samar02

    Indeed, styles are a problem in this sort of movement, and beyond my current abilities to work around. Can you use Nested or GREP styles ( or is that what you are doing?) to apply the character styles?

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points