• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

RegEx find url **CF5**

Participant ,
Jun 14, 2006 Jun 14, 2006

Copy link to clipboard

Copied

I'm beating my head against a wall with this one. I think I just need a fresh pair of eyes to see where I'm going wrong.

I'm using regular expressions to verify if a link exists on a page (html). The page is outside of my control and will vary in coding format (i.e. a href=' http://www.domain.com">test</a>, a class="something" href=' http://www.domain.com" target=_blank>test</a>, etc...). The regex I wrote works 90% of the time, but chokes when the anchor text contains formatting tags.

I first lcase() and strip the file of single and double quotes, so spaces are used to separate attributes. Then I strip out #'s so CF doesn't choke on them. I'm also backreferencing the url and title to for validation purposes:
My regex:
<\s*a[^>]*\s*href=[^>]*(http:\/\/[www\.]*mysite.com[?!\s]*[^>]*)[^>]*>(.*?)<\s*/a\s*>

I believe it is the (.*?) that is the culprit because it only has a problem with nested html tags inside the a tags. Any ideas? I've also tried ((?!:<\s*/a)*) to no avail.

Example link that should match but doesn't:
TOPICS
Advanced techniques

Views

658

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advisor ,
Jun 14, 2006 Jun 14, 2006

Copy link to clipboard

Copied

Just do it in two passes.
See the attached.


Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Jun 14, 2006 Jun 14, 2006

Copy link to clipboard

Copied

Thanks MikerRoo,

Your script gave the exact same result, however I think I realize what's going on now. The server I'm using is CF5. When I run my script (yours too) on MX7 it works.

BTW, I don't think there is any need to run two passes to strip the HTML in attempt to find a match. If the regex doesn't find a match it wouldn't do any good to strip the HTML off the first result (i.e. sLInkEssentials -> sLinksNoMarkup).

Is there known issues and workarounds with regex's on CF5?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jun 14, 2006 Jun 14, 2006

Copy link to clipboard

Copied

SolutionFinder:

The construct (.*?) looks very strange for a regular expression.

. - match any character
* match 0 or more times
? match 0 or 1 times.

Remember that the ? character has special meaning in regular expressions. I think in your case you can leave it out altogether unless you are trying to match an actual question mark, in which case you would escape it "\?"

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advisor ,
Jun 14, 2006 Jun 14, 2006

Copy link to clipboard

Copied

If you carefully run the code I supplied you'll see that it does not give the same result as your regex.

Yours left the title html untouched and also returned extra garbage like target="_blank".

Anyway, I no longer support CF5 unless someone pays me. You might start a new thread making it clear that you are using CF5.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Jun 14, 2006 Jun 14, 2006

Copy link to clipboard

Copied

LATEST
MikerRoo,

I strip the HTML before I dump it in the database, however that is cosmetic only and has zero effect on the actual matching of either regular expressions, which is what the problem is.

Your regex works very nice, and I'll be implementing a modified version to do the trick (I can't assume the attribute values are always enclosed with quotes).

*no longer support CF5 unless someone pays me...* Must be nice :)

Healey_Mark, thanks for the reply.

The .*? makes a "lazy star" match, matching the minimum possible match. This means it will match until the next step in the expression. For example: <a[^>]*>(.*?)</a> would return everything between each pair of A tags (<a></a>).

If I used .* it would make a greedy match of every printable character from that point to the end of the file.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation