8 Replies Latest reply on Feb 9, 2011 11:11 AM by ufitzi

    Extract email address from html

    emmim44 Level 1


      I am trying to extract "email address"  from an html output query. How would I do that?


      I am on CF9.



      Query col1:

      <html><head></head><body>today they emailed about it from (mailto:xxx@hotmail.com) ...hello there and here</body></html>

        • 1. Re: Extract email address from html
          ilssac Level 5

          Regular Expressions are often the tool to use for that kind of string manipulation.


          ColdFuion has the reFind() and reReplace() functions to tap into a large part of the power of Regular Expressions.

          • 2. Re: Extract email address from html
            emmim44 Level 1

            I cannot setup the reqular expr. I need some sample///

            • 3. Re: Extract email address from html
              JR "Bob" Dobbs Level 4

              Here are some resources to help get you started using regular expressions:


              The CF documentation

              http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec0a38f -7ffb.html


              Tutorial website



              Ben Forta's ColdFusion books have coverage of regular expressions, at least in the CF6,7, and 8 editions that I own.


              • 4. Re: Extract email address from html

                Here's a function I wrote for use on some of our CF sites:


                <cffunction access="public" name="isEmailAddressValid" returntype="boolean">
                    <cfargument name="email" type="string" required="yes">


                    <cfif refindnocase("^([_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.(([a-z]{2,3})|(aer o|coop|info|museum|name)))?$",arguments.email) neq 0>
                        <cfreturn true>
                        <cfreturn false>




                This should get you started, along with the referenced CF documentation, you should be able to use this to extract an address.

                Good luck!


                • 5. Re: Extract email address from html
                  Adam Cameron. Level 5

                  Argh!  No!


                  God I hate it when people knock together a regex like this and go "Look!  Email address validation!"


                  Before one starts down this road, one should read the RFC (http://tools.ietf.org/html/rfc5322, summarised here: http://en.wikipedia.org/wiki/Email_address).


                  Your own regex fails my spamtrap email address (for example: adam.cameron.signup+adobeforums@gmail.com), because you've forgotten that a + is a legitimate character in the local part of an email address.  Along with a bunch of other completely legit characters.


                  Reading on through the RFC you will realise than ANYTHING is valid in the local part of an email address, provided it's quoted (double-quote being another character your regex doesn't accept).


                  If someone doesn't want to give you their valid email address, they won't.  I can give you adam@notmyaddress.com, and that will pass.  If I do want to give you my address, you should make sure your code will actually accept it!


                  I can understand wanting to make sure the punter doesn't key their email address in incorrectly, but your method doesn't help here.  It'd pass adan@ismyaddress.com, despite the fact that it should be adam@ismyaddress.com.  "Close" is not good enough in these cases.


                  The only sensible way of doing this is to ask them to type it in twice.  This will assist people who don't just roll their eyes and copy and paste what they typed in the first box into the second box, wondering why you're wasting their time.  So a typo will be transferred, so it's no help.


                  If you really want to get a person's email address, deprive them of something until they respond to an email that you end them.  At the email address they specified. Because they actually don't mind you having their email address.  This only works if you're not simply trying to harvest email addresses for your own benefit, and not the benefit of your subscribers.


                  Bottom line: email address is a mug's game, and one not often played by people who know the rules.



                  • 6. Re: Extract email address from html
                    ufitzi Level 1

                    Listen, congrats on your thesis, man.


                    My function will get him started, you've yet to provide anything to help get the guy going.


                    He's asking about EXTRACTING email addresses from a lengthy string of HTML.

                    Your advise on "entering twice" is moot in this regard.


                    Instead of getting excited about my apparently insufficient regex, why don't you read the original request and try HELPING.

                    • 7. Re: Extract email address from html
                      Adam Cameron. Level 5

                      Oh, don't get your tits in a tangle because I observed a shortcoming in your approach to something.


                      Your technique of using a regex to extract it is sound: I had nothing to add to that part of things, other than - obviously - the regex is too limited to be useful.


                      However given the sample mark-up, it's going to be difficult to reliably extract the email address via starting from the position that one can have a simple-ish pattern to match the email address, because it's a bogus position to start from.


                      I think in the given situation, if they email address is simply floating around within other text with nothing else to delimit it, then perhaps just extracting a pattern that is a run of characters between whitespace chars, eg: \s*(.+@.+)\s* (and pull out the match for the sub-expression) or something along those lines.  It can't really get any more precise than that, and it will possibly throw up some false positives, but at least it won't exclude valid email addresses.




                      • 8. Re: Extract email address from html
                        ufitzi Level 1

                        Ok, fair enough, I'm defensive of my code.


                        Let's take my original example, your reference to RFC and additional allowed characters, and say that the two combined will provide a pretty good start to his problem.


                        Hopefully this discussion proves useful to someone.