5 Replies Latest reply on Apr 25, 2015 8:04 AM by Brad_Stone

    Regex: since no negative look behind, what is the best way ...

    Brad_Stone

      I have a great Photoshop scripting routine that uses regular expressions to find all of the parts of a string that are surrounded by underlines.

       

      Regex:  /_([\s\S]*?)_/g

       

      Text:  Match on _this_ and also _on this_ and even _on this too_.

       

      ... and life was nice, until my paragraph contained a URL that had underlines in it!

       

      Now, I want to make sure that if I match on an underline, it isn't an underline within a URL.

       

      I know that URLs don't have spaces, so I modified by regular expression to say "when you find a match, look back to see if there is an http without at least one space between it and the match".

       

      Regex: /(?<!http) +_([\s\S]*?)_/g

       

      Text: Match on _this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_.

       

      ... but alas, it appears that this implementation of Javascript doesn't support negative look behind.

       

      So, can anyone think of an elegant regular expression that matches on parts of a string that are surrounded by underscores, unless they are within a URL?

       

          - Brad

        • 1. Re: Regex: since no negative look behind, what is the best way ...
          Pedro Cortez Marques Level 3
          var myStr = "_this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_";
          // first remove all links from string, then use your RegExp
          $.writeln(myStr.replace(/\shttp.+?\s/g,'').match(/_([\s\S]*?)_/g));
          
          • 2. Re: Regex: since no negative look behind, what is the best way ...
            Brad_Stone Level 1

            Pedro,

             

              Thanks for taking time to reply.  Your solution is nice and concise.  However, I am wondering if only a regular expression can be used.  Let me provide a few more details.

             

              It turns out that I don't need the characters that are within the underline.  What I need are their character position.  Here is the scenario.

             

              The function takes a string and returns [0] the original text; [1] the original text with the underlines removed; [2] the number of times underlines were removed; and [3...] pairs with the start and end positions of the text that used to have underlines around it.

             

            Text into function:

            Match on _this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_.

             

            Function returns:

            returnArray[0] = Match on _this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_.

            returnArray[1] = Match on this and also on this but if http://mysite.com/more_url.with?underscores_then.no.match until after you leave the URL.

            return Array[2] = 3

            returnArray[3...] = 9,13,23,30,115,128

             

             

              I am not an experience Javascript programmer, so at the risk of putting my (inefficient and ugly?) code on display, here is the code that works - except for ignoring underscores in the URL.

             

            function parseLine(textString) {
                    // Will take a textString and find all the words that have _underlines on either side_.
                    // Will then return an array:
                    // [0] = original textString
                    // [1] = textString with the _ removed
                    // [2] = number of replacements made
                    // [a,b ...] = pairs of numbers for the start and end characters where the underlines were
            
            
                    //Regex for find the words between the _underlines_
                    // var myRe = /(?<!http) +_([\s\S]*?)_/g;  <- would work if look behind were supported.
                    var myRe = /_([\s\S]*?)_/g;  
            
                    // Set up the first three indexes in the returned array: orignial textString, newTextString (without underlines), numbrer of replacements
                    var changeIndex = [textString, "", 0];
            
                    var myArray;
                    var chopText = textString;
                    var newText;
                    var numReplace = 0;
            
                    // Loop through all the matches
                    // Remove the underlines, count the number of replacements, and record the places in the text where they were.
                    while ((myArray = myRe.exec(chopText)) !== null) {
                        // record begin and end point
                        changeIndex.push(myArray.index, (myArray.index + myArray[1].length));
            
                        // remove the underlines
                        newText = chopText.replace(myArray[0], myArray[1]);
                        chopText = newText;
            
                        // Count the number of replacements
                        numReplace++;
            
                    }
                    // put the text without the underlines into the index to be returned.
                    changeIndex[1] = newText;
            
                    // put the number of replacements into the index to be returned.
                    changeIndex[2] = numReplace;
            
                    $.writeln(changeIndex);
            
                    return (changeIndex);
            
                }
            
            • 3. Re: Regex: since no negative look behind, what is the best way ...
              Pedro Cortez Marques Level 3

              Hope it helps, Brad

               

              var myStr = "_this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_";  
              // Only one RegExp
              $.writeln(myStr.match(/(^_([\s\S]*?)_\s)|(\s_([\s\S]*?)_\s)|(\s_([\s\S]*?)_$)/g).join('\n'));
              
              • 4. Re: Regex: since no negative look behind, what is the best way ...
                Brad_Stone Level 1

                Pedro,

                 

                Your solution does a good job on the string that I specified.  Thanks!  As I dug in further, I found that there are cases when I also need to deal with _text_text (e.g. no spaces following the underlined group).  I should have posted a more complete example.

                 

                I couldn't figure out a one-liner Regex, so I wrote a small function.  I will post it in the hopes that it helps others.

                 

                Thanks again for taking the time to reply!

                • 5. Re: Regex: since no negative look behind, what is the best way ...
                  Brad_Stone Level 1

                  I gave up trying to figure out a one-liner Regex, so I wrote a function that can take a string containing zero or more URLs and do a search and replace only within the URL.

                   

                  This looks like a really long function, but if you remove the comments and debug code, it is only about a dozen lines long.  It is written for readability rather than efficiency.

                   

                  I hope that this helps others.

                   

                  
                  function replaceURL(textString, charFind, charRepl) {
                      var debug = true // set to true if you want to see all the steps in the console window
                  
                      // Look for any URLs and replace the {charFind} with {charRepl}
                      // NOTE: you must be careful to search for a special Regex character or replace on any character that could be in a URL (otherwise not reversible). 
                      // The # character is a safe replacement since it is neither a Regex character nor is it a valid URL character.
                      //Special Regex:  \^$.|?*+()[{
                      //Special URL:  $-_.+!*'(),
                  
                      // Finds all {charFind} characters within a URL.
                      // Look for at least one word character \w+ followed by a ://  (e.g. http://, ftp://, etc.)
                      // URLs can't have spaces, so continue through non whitespace characters \S*? until you find the {charFind} followed by any number of non whitespace characters \S*.
                      // Note that we have to double escape the special characters because we first build a string and then the string is converted to Regex, which is the only way to put
                      // a variable like charFind into the Regex.
                      var myReURLString = "(.*?)(\\w+:\\/\\/\\S*?" + charFind + "\\S*)(.*)";
                      var myReURL = new RegExp(myReURLString);
                      var myReFind = new RegExp(charFind, "g");
                  
                      if (debug) {
                          $.writeln("Searching for " + charFind + " and replacing with " + charRepl + "\n");
                          $.writeln("Regex to find a URL with a {charFind} within the URL: " + myReURL + "\n");
                          $.writeln("Regex to find the {charFind} within the URL: " + myReFind + "\n\n");
                      }
                  
                      // Each pass through the loop will find a URL with the specific character and process it.
                      // Take the textString, and split it into 3 parts: everything before the nth URL, the URL, and everything after the nth URL.
                      // Then do a search through the URL for all instances of charFind and replace with charRepl
                      // Finally, put all three parts back together again.
                      // Repeat until there are no more URLs to process
                      while ((myParts = myReURL.exec(textString)) !== null) {
                          // [0] is original textString
                          // [1] is everything before the URL
                          // [2] is the URL
                          // [3] is everything after the URL
                  
                          if (debug) {
                              $.writeln("==== Starting ====\ntextString:\n" + textString + "\n");
                              $.writeln("---- Before ----\n");
                              $.writeln("myParts [1]: \n" + myParts[1] + "\n\n" +
                                  "myParts [2]: \n" + myParts[2] + "\n\n" +
                                  "myParts [3]: \n" + myParts[3] + "\n\n");
                          }
                  
                          // Replace all the {charFind} in myParts[2] with {charRepl}
                          myParts[2] = myParts[2].replace(myReFind, charRepl);
                  
                          if (debug) {
                              $.writeln("---- After ----\n");
                              $.writeln("myParts [1]: \n" + myParts[1] + "\n\n" +
                                  "myParts [2]: \n" + myParts[2] + "\n\n" +
                                  "myParts [3]: \n" + myParts[3] + "\n\n");
                          }
                  
                          // Now put it back together again
                          textString = myParts[1].concat(myParts[2], myParts[3]);
                          if (debug) {
                              $.writeln("textString:\n" + textString + "\n\n");
                          }
                      }
                  
                      return (textString);
                  
                  }
                  
                  var textString = "Lots of text _with underlines_ all _ through the _paragraph, but also a http://url.that.has/underlines_in-it__that.we.have.to.avoid followed _by_ more underlines__. And _here_ is another ftp://url.that_also_has_underlines in it_."
                  
                  var newTextString = replaceURL(textString, "_", "#");
                  alert("\n\nOUT: " + newTextString);
                  // Output is:
                  // Lots of text _with underlines_ all _ through the _paragraph, but also a http://url.that.has/underlines#in-it##that.we.have.to.avoid followed _by_ more underlines__. And _here_ is another ftp://url.that#also#has#underlines in it_.
                  
                  var newerTextString = replaceURL(newTextString, '#', "_");
                  alert("\n\nREVERSED: " + newerTextString);
                  // Output is:
                  // Lots of text _with underlines_ all _ through the _paragraph, but also a http://url.that.has/underlines_in-it__that.we.have.to.avoid followed _by_ more underlines__. And _here_ is another ftp://url.that_also_has_underlines in it_.