3 Replies Latest reply on Jun 19, 2013 8:52 AM by Balance

    RegEx Help

    Balance Level 1

      Hello,

       

      I have the following function:

       

      <cffunction name="stripHREFs" access="public" returntype="array" output="no" hint="seperate Links from given HTML string, output as a array">

       

      <cfargument name="html" required="yes">

          <cfset local.startpos = 1>

          <cfset local.list = ArrayNew(1)>

         

          <cfloop condition="local.startpos GREATER THAN 0">

          <cfset local.linkpos = reFindNoCase('<a\b[^>]*>(.*?)</a>',arguments.html,local.startpos,'true')>

       

          <cfif val(local.linkpos.len[1])>

                    <cfset local.startpos = local.linkpos.len[1]+local.linkpos.pos[1]>

                    <cfset local.string = mid(arguments.html,local.linkpos.pos[1],local.linkpos.len[1])>

              <cfset local.hrefpos = reFindNoCase('(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+##]*[\w\-\@?^=%&/~\+##])?',local.string,1,'true')>

          <cfif val(local.hrefpos.pos[1])>

                      <cfset local.this.a = mid(local.string,local.hrefpos.pos[1],local.hrefpos.len[1])>               

                      <cfset local.this.title = reReplacenocase(local.string,'<a\b[^>]*.>',"")>

                      <cfset local.this.title = reReplacenocase(local.this.title,'</a*>',"")>

                      <cfset ArrayAppend(local.list,local.this)>

                      <cfset StructDelete(local,'this')>

                  </cfif>

      <cfelse>

               <cfbreak>

      </cfif>

          </cfloop>

         

      <cfreturn local.list>

      </cffunction>

       

      It works great, except now my client has decided to include links with an additional attribute called "alias"

      The code looks like this, <a href="http://www.acme.com" alias="foo">click me</a>

      How can I pull out the "alias" attribute?

       

      TIA

        • 1. Re: RegEx Help
          cherdt Level 1

          You can include the alias attribute in much the same way you are including the URL and the link title.

           

          For example, to find out if an alias attribute exists, you could add this inside the loop:

          <cfset local.alias = REFind('alias="([^"]+)"',local.string,1,true)>

           

          Then, if it exists, add it to the struct before you append it to the array:

          <cfif val(local.alias.pos[1])>

              <cfset local.this.alias = REReplace(mid(local.string,local.alias.pos[1],local.alias.len[1]),'(alias=)?"','','ALL')>

          </cfif>

           

          The example above make certain assumptions, e.g. the alias attribute is in lowercase, the attribute value is enclosed in doublequotes, etc. You may need to adjust if your client's input does not fit that format.

          • 2. Re: RegEx Help
            Balance Level 1

            It almost works but it's outputting this:

             

            local.string = <a href="http://www.google.com" alias="my link alias">learn more</a>

             

            local.aliaspos.pos[1] = 33

            local.aliaspos.len[1] = 21

             

            local.this.alias = alias=my link alias

             

            Should be:

            local.this.alias = my link alias

             

             

            Here's the updated code:

             

            <cfloop condition="local.startpos GREATER THAN 0">

                      <cfset local.linkpos = reFindNoCase('<a\b[^>]*>(.*?)</a>',variables.html,local.startpos,'true')>

             

                      <cfif val(local.linkpos.len[1])>

                                <cfset local.startpos = local.linkpos.len[1]+local.linkpos.pos[1]>

                                <cfset local.string = mid(variables.html,local.linkpos.pos[1],local.linkpos.len[1])>

                                <cfoutput>local.string = <xmp>#local.string#</xmp><br></cfoutput>

             

                                <cfset local.hrefpos = reFindNoCase('(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+##]*[\w\-\@?^ =%&/~\+##])?',local.string,1,'true')>

             

                                <cfset local.aliaspos = reFind('alias="([^"]+)"',local.string,1,'true')>

                                <cfoutput>

                                          local.aliaspos.pos[1] = #local.aliaspos.pos[1]#<br>

                                          local.aliaspos.len[1] = #local.aliaspos.len[1]#<br>

                                </cfoutput><br>

                                <cfif val(local.hrefpos.pos[1])>

                                          <cfset local.this.href = mid(local.string,local.hrefpos.pos[1],local.hrefpos.len[1])>

                                          <cfif val(local.aliaspos.pos[1])>

                                                    <cfset local.this.alias = reReplaceNoCase(mid(local.string,local.aliaspos.pos[1],local.aliaspos.len[1]),'(a lias=)?"','','all')>

                                                    <cfoutput>

                                                    <p>

                                                    local.this.alias = #local.this.alias#

                                                    </p>

                                                    </cfoutput>

                                          </cfif>

                                          <cfset local.this.title = reReplaceNoCase(local.string,'<a\b[^>]*.>',"")>

                                          <cfset local.this.title = reReplaceNoCase(local.this.title,'</a*>',"")>

                                          <cfset ArrayAppend(local.list,local.this)>

                                          <cfset StructDelete(local,'this')>

                                </cfif>

                                <cfelse>

                                          <cfbreak>

                      </cfif>

            </cfloop>

             

            I think the REFind() needs a little tweaking so local.aliaspos.pos[1] is 31 and not 33.

             

            Thanks

            • 3. Re: RegEx Help
              Balance Level 1

              I ended up replacing all this RegEx non-sense with the awesome http://jsoup.org/ library and everything works like a charm!  Thanks