2 Replies Latest reply on Oct 12, 2009 12:22 PM by msakrejda

    Problem with regular expressions

    michael nieuwenhuizen

      Hello, I'm using a regular expression to translate an external url to an internal link.


          str = str.replace(/<a href="([A-Z]+.*).html">/ig, '<a href="event:\$1">');


      This would - if all goes well - translate <a href="somepage.html"> to <a href="event:somepage">.  Thing is though, if I run the script over this string:


           str = '<a href="somelink.html">link1</a>

                   <a href="anotherlink.html">link2</a>

                   <a href="athirdlink.html">link3</a>';


      the output is quite unexpectedly:


           <a href="event:somelink.html">link1</a>

           <a href="anotherlink.html">link2</a>

           <a href="athirdlink">link3</a>


      All three links have changed differently, and all have translated in the wrong way.  Anyone see what goes wrong?

        • 1. Re: Problem with regular expressions
          UbuntuPenguin Level 4

          Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

            Jamie Zawinski

          • 2. Re: Problem with regular expressions
            msakrejda Level 4

            That's a cute quote, but this is exactly the sort of thing that regular expressions *are* good for.


            The only problem here is that they are a little tricky. Some regular expression match characters (namely, '.' in your case) are greedy, meaning they will "eat" everything that matches as long as the rest of the expression can be satisfied. The problem you're seeing is because the parenthetical group is matching everything after 'href="' up to the *last* 'html' (*not* the first one, as you are expecting). The simplest way to fix this is to be more conservative about the exact page names you're going to accept: if none of them have embedded periods (i.e., you do not have any links named "foo.bar.html"), you can just change your RE to


            /<a href="(A-Z]+[^\.]*)\.html">/ig


            That is, page names start with a character in the range [A-Z] (though case insensitive, as you specify later) followed by zero or more "not a period" characters, followed by ".html" (note that you do have to escape the period).


            Alternately, if Flex REs support lazy matching (which I *think* they do, although I'm not certain), you can just change the RE to


            /<a href="(A-Z]+.*?)\.html">/ig


            Meaning, page names start with [A-Z], then any number of characters, but first check if there is a trailing '.html">' before considering the match.