14 Replies Latest reply: Aug 2, 2010 4:53 PM by iBabs2 RSS

    I have a GREP style question

    iBabs2 Community Member

      Hello,

       

      I am playing with a GREP style in CS4.

      I have a paragraph style called body copy and a character style called email.

      I want every email address to show up in the character style email.

       

      In my GREP style, I have only been able to get the @ symbol and everything after that, but not the type in front of it.

      So, for example, in the email address....babs@ibabs.com I can only figure out how to get the @ibabstraining to change, but not the babs before the @ symbol. So, if I have lots of email address's to change, they will all have something different after the @ symbol.

       

      What is the best way to handle this one?

       

      Thanks

      babs.....

        • 1. Re: I have a GREP style question
          [Jongware] MVP

          It might be easier than you think. You already have wildcards after the '@' (I think), but you can use them before it as well!

           

          Consider this string "An example for our budding girl Babs", and we want to find all words containing "dd". You would need something like "any number of characters, then "dd", then some more characters". Now if you use any character (the period, in GREP syntax), it would return "An example for our budding girl Babs". A bit more than you asked for, so you probably should not use "any character". That's because the spaces themselves also count as 'any' character ... Now you could use something like "[a-z]" (everything from a to z), or "\w" (any "word" character: A to Z, a to z, 0 to 9, plus all accented characters) but in this case, you probably just want 'anything that's not a space'. The canonical notation for 'not x' is this:

           

          [^x]

           

          -- where the 'x' may be anything, for example, a single space: [^ ]. Adding a plus after it repeats it as much as possible: [^ ]+ will match anything comprised of non-spaces, and [^ ]+dd[^ ]+ will match anything not-a-space, "dd", more-not-a-spaces.

           

          But wait! It gets Better! Checking for a single space is something that occurs frequently, but you can have lots of different "space" characters: the non-breaking space, hard returns (those are also 'white space' characters), tabs, en-space, em-space, thin space, sixth space .. the list goes on and on. Fortunately, the "Gets Better" part is GREP has a shorthand notation for 'any kind of white space' -- \s.

           

          So, [^\s]+dd[^\s]+ comes closer -- but that's not even the perfect solution!

           

          Checking for Not A Space is also a frequently occurring thing, and GREP even has a shortcut for that: \S

          So, \S+dd\S+ is all that's necessary to find words containing 'dd', and I bet you can find out how you can translate that into Finding Email Addresses.

           

          You probably can guess now what \D, \W, \L and \U will find too.

          • 2. Re: I have a GREP style question
            iBabs2 Community Member

            HI Jongware!!

             

            This is great information....there were a few things I played with, but this is the one I finally settled on.

             

            [a-z]+@[a-z]+

             

            love it ;-)

            thanks

            babs

            • 3. Re: I have a GREP style question
              Joel Cherney MVP

              Well, I hope that none of your email addresses have numbers in 'em, iBabs2. Not to mention dashes, periods, or other valid non-letter entities. E.g. joel.cherney@fake-address.com

               

              Actually, one of the first things I learned to do a decade ago when I actually knew how to make good GREP queries was to hunt for email addresses in a large body of text. It's deeper than it looks...

              • 4. Re: I have a GREP style question
                iBabs2 Community Member

                ;-(

                 

                I did not think of that......

                • 5. Re: I have a GREP style question
                  Joel Cherney MVP

                  Fortunately, because it's the traditional tooth-cutting Perl newbie task, the internet is littered with decent greps to find email addresses. For example the first Google result for "email grep" yields

                  [[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*

                   

                  Throw that into Jongware's WhatTheGrep script, and you'll see that... it uses techniqes that WhatTheGrep doesn't understand. It may not be a completely valid InDesign grep, but it works to find many varieties of email address, and some invalid ones. The places where it diverges from what I understand to be good InDesign grep query construction are, firstly, that it uses the old-fashioned Posix-style [[:alnum:]] which is equivalent to [A-Za-z0-9] which finds any alphanumeric character, one at a time. The + makes it find any group of alphanumeric characters - the maximum-length string of alphanumerics. Then, in between the first and second closing square brackets, it adds some escaped characters - period, underscore, and hyphen - all of which are valid characters in an email address. Then, after the second closing bracket, it has an asterisk - which matches "zero or more times." So, this grep would find

                  joel.cherney@

                  and

                  @notanemail.address

                  which are not valid email addresses. So, I'll change it to

                  [[:alnum:]+\.\_\-]+@[[:alnum:]+\.\_\-]+

                  and now it finds only alphanumeric strings on either side of an ampersand. (The plus means "one or more times.") It still finds invalid email addresses, like

                  joel.cherney@invalid.address

                  at least it'll find any string with an ampersand in the middle, and alphanumeric strings on either side of the ampersand, even those interrupted by dashes, underscores, and periods, but not those interrupted by colons, dingbats, or other characters which would make an invaild email address. It still finds

                  joelcherney@invalidaddress

                   

                  so it's still not all the way there. A really useful grep would only find actually valid email addresses - but then we'd have to gin up some way to look only for valid top-level domains (.com, .uk, .ru, et cetera) preceded by a period, and there are too many of those for me to think of a good way to grep for those without getting a decent list of all valid TLDs and then figuring out what minimal query would find 'em. Also, I have to go do some actual work at the moment, so while I'd love to bang my head against this until I have the perfect email-address-finding grep, I have to drop this and move onto formatting some Tibetan.

                   

                  So, yeah, it's deep.

                  • 6. Re: I have a GREP style question
                    Joel Cherney MVP

                    Disclaimer: I'm really rusty. If I've mislead you in my explanation, I'm sure that someone <cough> will come along and correct me.

                    • 7. Re: I have a GREP style question
                      [Jongware] MVP

                      I could totally follow it, if that's any consolation.

                       

                      \_ and \- are 'not recognized' because WhatTheGrep checks only for backslash-escaped characters that change the meaning of the following character. Both \_ and \- will find the same character without backslash -- hence, they are superfluous.

                       

                      (The code \- actually does have a use: you can use it to prevent finding everything from a to z in [a\-z]. On the other other hand, though, the writer of your GREP made double sure the hyphen didn't get used as a from..to spec: when the hyphen is at the very start or end inside the group, as in either [-az] or [az-], its magic property is removed and the actual hyphen will be used to match. In your GREP, it's at the very end of the OR-group.)

                       

                      Personally, I like to prevent finding domain-less e-mail addresses by adding this little titbit after the end:

                       

                      \.[a-z]+

                       

                      because every e-mail address should end in .com, .co.uk, .org, and so on. I've never seen anything else than [a-z] in top-domain extensions.

                       

                      So, Babs, putting it altogether, you can use this

                       

                      [[:alnum:]+\.\_\-]+@[[:alnum:]+\.\_\-]+\.\l+

                       

                      to find almost every good mail address and skip almost every bad one. [[:alnum:]] is even smart enough to allow accented characters (which do occur in mail addresses):

                       

                      joel.cherney@invålid.address

                      and the final dot-more letters will make it skip this invalid one:

                      joelcherney@invalidaddress

                      • 8. Re: I have a GREP style question
                        Joel Cherney MVP

                        This is good stuff - I'm keeping this one.

                        Personally, I like to prevent finding domain-less e-mail addresses by adding this little titbit after the end:

                         

                        \.[a-z]+

                         

                        because every e-mail address should end in .com, .co.uk, .org, and so on. I've never seen anything else than [a-z] in top-domain extensions.

                         

                        So, Babs, putting it altogether, you can use this

                         

                         

                        [[:alnum:]+\.\_\-]+@[[:alnum:]+\.\_\-]+\.\l+

                         

                        to find almost every good mail address and skip almost every bad one. [[:alnum:]] is even smart enough to allow accented characters (which do occur in mail addresses):

                         

                         

                         

                        joel.cherney@invålid.address

                        and the final dot-more letters will make it skip this invalid one:

                        joelcherney@invalidaddress

                        Well, now that top-level domains can be internationalized,  it won't get all of them. Also, nice catch on "alnum" including extended Latin script; I had no idea, back when I was using grep on a BSD command line, I really didn't have any Swedish in my life at all. But yeah, your a-z "tidbit" at the end is a great fix, and only lets a few invalid emails slip through:

                        joel.cherney@fake.email

                        But, at some point, you have to make a judgment call: what are the chances that you will have to sort almost-valid but obviously fake email addresses from real ones? If your body of text came from a Web submission form, those chances are high. If you just want to make a grep style so that your company's literature has a predefined email-address character style without manually applying it to each address, then including .info and .uk and .ru in your grep, but excluding .joke and .fake would just be a waste of effort.

                         

                        (Unless you derived joy from playing with regular expressions, of course. But we're a rare breed.)

                        • 9. Re: I have a GREP style question
                          [Jongware] MVP

                          Yup, it'll totally work for regular e-mail addresses. I've used something like this for years on end, to automatically make mail addresses clickable in PDFs. I used to show an alert for each and every one ("Is this okay: bob@something something") until I got tired of never having to press "No".

                           

                          Non-latin top-domain names: well... InDesign's GREP is Unicode aware. [[:alnum:]] is not restricted to just Latin, and the expression will happily match this address:

                           

                          theun@ιονγυαρε.gr

                           

                          (Apparently the stuff after the final dot must still be a..z.)

                          • 10. Re: I have a GREP style question
                            iBabs2 Community Member

                            Hi Jongware and Joel,

                            Thank you so miuch for all of this!!!!

                            I have been away for a few days and I lots of reading and testing here to catch up with ;-)

                            I love it!!!! It's so cool!!!!

                            you guys are the best!

                            babs

                            • 11. Re: I have a GREP style question
                              iBabs2 Community Member

                              Hi Guys,

                              OK....I ahve been playing with this one and it has worked for everything I ahve tried and I understnad everything, I think, except one thing Let me make sure I have this right:

                               

                              [[:alnum:]+\.\_\-]+@[[:alnum:]+\.\_\-]+\.\l+

                               

                              OK

                              [alnum] (all capital, lowercase  letters and numbers)

                              + one or more

                              \. any period

                              \_ any underscore

                              \- any hyphen

                              + one or more

                              @ the @ symbol

                              alnum] (all capital, lowercase  letters and numbers)

                              + one or more

                              \. any period

                              \_ any underscore

                              \- any hyphen

                              + one or more

                               

                              now-what is this last part for? \|+

                              I only know the | as the or symbol???

                              so, what does that last part mean after the last \

                               

                              Also, does the alnum have to be in []'s

                               

                              thanks!!!

                              babs

                              • 12. Re: I have a GREP style question
                                Joel Cherney MVP

                                That was Jongware's clever way to avoid collecting

                                joelc@fakeaddress

                                with the grep query. That character is a lowercase L, so \l+ means "one or more lowercase letters." So, \.\l+ means "a period, followed by one or more lowercase letters."

                                 

                                I might have used \.[A-Za-z]+ which means "a period, followed by one or more letters between a and z, either uppercase or lowercase" if I had spotted that.

                                • 13. Re: I have a GREP style question
                                  Joel Cherney MVP

                                  About the brackets - really, I don't know. I group stuff in brackets because it's how I was taught. For certain, the format of [[:alnum:]] or [[:digit:]] is really old-fashioned, it's just what I learned back in the old days (the mid-90s).

                                  • 14. Re: I have a GREP style question
                                    iBabs2 Community Member

                                    got it ;-)

                                    thanks Joel!!

                                    babs