I'm beating my head against a wall with this one. I think I
just need a fresh pair of eyes to see where I'm going wrong.
I'm using regular expressions to verify if a link exists on a
page (html). The page is outside of my control and will vary in
coding format (i.e. a href='
http://www.domain.com">test</a>,
a class="something" href='
http://www.domain.com"
target=_blank>test</a>, etc...). The regex I wrote works
90% of the time, but chokes when the anchor text contains
formatting tags.
I first lcase() and strip the file of single and double
quotes, so spaces are used to separate attributes. Then I strip out
#'s so CF doesn't choke on them. I'm also backreferencing the url
and title to for validation purposes:
My regex:
<\s*a[^>]*\s*href=[^>]*(http:\/\/[www\.]*mysite.com[?!\s]*[^>]*)[^>]*>(.*?)<\s*/a\s*>
I believe it is the (.*?) that is the culprit because it only
has a problem with nested html tags inside the a tags. Any ideas?
I've also tried ((?!:<\s*/a)*) to no avail.
Example link that should match but doesn't: