5 Replies Latest reply on Jun 14, 2006 3:40 PM by Abram Adams

    RegEx find url **CF5**

    Abram Adams Level 1
      I'm beating my head against a wall with this one. I think I just need a fresh pair of eyes to see where I'm going wrong.

      I'm using regular expressions to verify if a link exists on a page (html). The page is outside of my control and will vary in coding format (i.e. a href=' http://www.domain.com">test</a>, a class="something" href=' http://www.domain.com" target=_blank>test</a>, etc...). The regex I wrote works 90% of the time, but chokes when the anchor text contains formatting tags.

      I first lcase() and strip the file of single and double quotes, so spaces are used to separate attributes. Then I strip out #'s so CF doesn't choke on them. I'm also backreferencing the url and title to for validation purposes:
      My regex:

      I believe it is the (.*?) that is the culprit because it only has a problem with nested html tags inside the a tags. Any ideas? I've also tried ((?!:<\s*/a)*) to no avail.

      Example link that should match but doesn't:
        • 1. Re: RegEx find url
          MikerRoo Level 1
          Just do it in two passes.
          See the attached.

          • 2. Re: RegEx find url
            Abram Adams Level 1
            Thanks MikerRoo,

            Your script gave the exact same result, however I think I realize what's going on now. The server I'm using is CF5. When I run my script (yours too) on MX7 it works.

            BTW, I don't think there is any need to run two passes to strip the HTML in attempt to find a match. If the regex doesn't find a match it wouldn't do any good to strip the HTML off the first result (i.e. sLInkEssentials -> sLinksNoMarkup).

            Is there known issues and workarounds with regex's on CF5?

            • 3. Re: RegEx find url

              The construct (.*?) looks very strange for a regular expression.

              . - match any character
              * match 0 or more times
              ? match 0 or 1 times.

              Remember that the ? character has special meaning in regular expressions. I think in your case you can leave it out altogether unless you are trying to match an actual question mark, in which case you would escape it "\?"
              • 4. Re: RegEx find url
                MikerRoo Level 1
                If you carefully run the code I supplied you'll see that it does not give the same result as your regex.

                Yours left the title html untouched and also returned extra garbage like target="_blank".

                Anyway, I no longer support CF5 unless someone pays me. You might start a new thread making it clear that you are using CF5.
                • 5. RegEx find url
                  Abram Adams Level 1

                  I strip the HTML before I dump it in the database, however that is cosmetic only and has zero effect on the actual matching of either regular expressions, which is what the problem is.

                  Your regex works very nice, and I'll be implementing a modified version to do the trick (I can't assume the attribute values are always enclosed with quotes).

                  *no longer support CF5 unless someone pays me...* Must be nice :)

                  Healey_Mark, thanks for the reply.

                  The .*? makes a "lazy star" match, matching the minimum possible match. This means it will match until the next step in the expression. For example: <a[^>]*>(.*?)</a> would return everything between each pair of A tags (<a></a>).

                  If I used .* it would make a greedy match of every printable character from that point to the end of the file.