Skip navigation
Currently Being Moderated

I have a GREP style question

Jul 30, 2010 1:32 PM

Hello,

 

I am playing with a GREP style in CS4.

I have a paragraph style called body copy and a character style called email.

I want every email address to show up in the character style email.

 

In my GREP style, I have only been able to get the @ symbol and everything after that, but not the type in front of it.

So, for example, in the email address....babs@ibabs.com I can only figure out how to get the @ibabstraining to change, but not the babs before the @ symbol. So, if I have lots of email address's to change, they will all have something different after the @ symbol.

 

What is the best way to handle this one?

 

Thanks

babs.....

 
Replies
  • Currently Being Moderated
    Jul 30, 2010 1:49 PM   in reply to iBabs2

    It might be easier than you think. You already have wildcards after the '@' (I think), but you can use them before it as well!

     

    Consider this string "An example for our budding girl Babs", and we want to find all words containing "dd". You would need something like "any number of characters, then "dd", then some more characters". Now if you use any character (the period, in GREP syntax), it would return "An example for our budding girl Babs". A bit more than you asked for, so you probably should not use "any character". That's because the spaces themselves also count as 'any' character ... Now you could use something like "[a-z]" (everything from a to z), or "\w" (any "word" character: A to Z, a to z, 0 to 9, plus all accented characters) but in this case, you probably just want 'anything that's not a space'. The canonical notation for 'not x' is this:

     

    [^x]

     

    -- where the 'x' may be anything, for example, a single space: [^ ]. Adding a plus after it repeats it as much as possible: [^ ]+ will match anything comprised of non-spaces, and [^ ]+dd[^ ]+ will match anything not-a-space, "dd", more-not-a-spaces.

     

    But wait! It gets Better! Checking for a single space is something that occurs frequently, but you can have lots of different "space" characters: the non-breaking space, hard returns (those are also 'white space' characters), tabs, en-space, em-space, thin space, sixth space .. the list goes on and on. Fortunately, the "Gets Better" part is GREP has a shorthand notation for 'any kind of white space' -- \s.

     

    So, [^\s]+dd[^\s]+ comes closer -- but that's not even the perfect solution!

     

    Checking for Not A Space is also a frequently occurring thing, and GREP even has a shortcut for that: \S

    So, \S+dd\S+ is all that's necessary to find words containing 'dd', and I bet you can find out how you can translate that into Finding Email Addresses.

     

    You probably can guess now what \D, \W, \L and \U will find too.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 30, 2010 2:18 PM   in reply to iBabs2

    Well, I hope that none of your email addresses have numbers in 'em, iBabs2. Not to mention dashes, periods, or other valid non-letter entities. E.g. joel.cherney@fake-address.com

     

    Actually, one of the first things I learned to do a decade ago when I actually knew how to make good GREP queries was to hunt for email addresses in a large body of text. It's deeper than it looks...

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 2, 2010 10:17 AM   in reply to iBabs2

    Fortunately, because it's the traditional tooth-cutting Perl newbie task, the internet is littered with decent greps to find email addresses. For example the first Google result for "email grep" yields

    [[:alnum:]+\.\_\-]*@[[:alnum:]+\.\_\-]*

     

    Throw that into Jongware's WhatTheGrep script, and you'll see that... it uses techniqes that WhatTheGrep doesn't understand. It may not be a completely valid InDesign grep, but it works to find many varieties of email address, and some invalid ones. The places where it diverges from what I understand to be good InDesign grep query construction are, firstly, that it uses the old-fashioned Posix-style [[:alnum:]] which is equivalent to [A-Za-z0-9] which finds any alphanumeric character, one at a time. The + makes it find any group of alphanumeric characters - the maximum-length string of alphanumerics. Then, in between the first and second closing square brackets, it adds some escaped characters - period, underscore, and hyphen - all of which are valid characters in an email address. Then, after the second closing bracket, it has an asterisk - which matches "zero or more times." So, this grep would find

    joel.cherney@

    and

    @notanemail.address

    which are not valid email addresses. So, I'll change it to

    [[:alnum:]+\.\_\-]+@[[:alnum:]+\.\_\-]+

    and now it finds only alphanumeric strings on either side of an ampersand. (The plus means "one or more times.") It still finds invalid email addresses, like

    joel.cherney@invalid.address

    at least it'll find any string with an ampersand in the middle, and alphanumeric strings on either side of the ampersand, even those interrupted by dashes, underscores, and periods, but not those interrupted by colons, dingbats, or other characters which would make an invaild email address. It still finds

    joelcherney@invalidaddress

     

    so it's still not all the way there. A really useful grep would only find actually valid email addresses - but then we'd have to gin up some way to look only for valid top-level domains (.com, .uk, .ru, et cetera) preceded by a period, and there are too many of those for me to think of a good way to grep for those without getting a decent list of all valid TLDs and then figuring out what minimal query would find 'em. Also, I have to go do some actual work at the moment, so while I'd love to bang my head against this until I have the perfect email-address-finding grep, I have to drop this and move onto formatting some Tibetan.

     

    So, yeah, it's deep.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 2, 2010 10:34 AM   in reply to Joel Cherney

    Disclaimer: I'm really rusty. If I've mislead you in my explanation, I'm sure that someone <cough> will come along and correct me.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 2, 2010 12:45 PM   in reply to Joel Cherney

    I could totally follow it, if that's any consolation.

     

    \_ and \- are 'not recognized' because WhatTheGrep checks only for backslash-escaped characters that change the meaning of the following character. Both \_ and \- will find the same character without backslash -- hence, they are superfluous.

     

    (The code \- actually does have a use: you can use it to prevent finding everything from a to z in [a\-z]. On the other other hand, though, the writer of your GREP made double sure the hyphen didn't get used as a from..to spec: when the hyphen is at the very start or end inside the group, as in either [-az] or [az-], its magic property is removed and the actual hyphen will be used to match. In your GREP, it's at the very end of the OR-group.)

     

    Personally, I like to prevent finding domain-less e-mail addresses by adding this little titbit after the end:

     

    \.[a-z]+

     

    because every e-mail address should end in .com, .co.uk, .org, and so on. I've never seen anything else than [a-z] in top-domain extensions.

     

    So, Babs, putting it altogether, you can use this

     

    [[:alnum:]+\.\_\-]+@[[:alnum:]+\.\_\-]+\.\l+

     

    to find almost every good mail address and skip almost every bad one. [[:alnum:]] is even smart enough to allow accented characters (which do occur in mail addresses):

     

    joel.cherney@invålid.address

    and the final dot-more letters will make it skip this invalid one:

    joelcherney@invalidaddress

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 2, 2010 1:04 PM   in reply to [Jongware]

    This is good stuff - I'm keeping this one.

    Personally, I like to prevent finding domain-less e-mail addresses by adding this little titbit after the end:

     

    \.[a-z]+

     

    because every e-mail address should end in .com, .co.uk, .org, and so on. I've never seen anything else than [a-z] in top-domain extensions.

     

    So, Babs, putting it altogether, you can use this

     

     

    [[:alnum:]+\.\_\-]+@[[:alnum:]+\.\_\-]+\.\l+

     

    to find almost every good mail address and skip almost every bad one. [[:alnum:]] is even smart enough to allow accented characters (which do occur in mail addresses):

     

     

     

    joel.cherney@invålid.address

    and the final dot-more letters will make it skip this invalid one:

    joelcherney@invalidaddress

    Well, now that top-level domains can be internationalized,  it won't get all of them. Also, nice catch on "alnum" including extended Latin script; I had no idea, back when I was using grep on a BSD command line, I really didn't have any Swedish in my life at all. But yeah, your a-z "tidbit" at the end is a great fix, and only lets a few invalid emails slip through:

    joel.cherney@fake.email

    But, at some point, you have to make a judgment call: what are the chances that you will have to sort almost-valid but obviously fake email addresses from real ones? If your body of text came from a Web submission form, those chances are high. If you just want to make a grep style so that your company's literature has a predefined email-address character style without manually applying it to each address, then including .info and .uk and .ru in your grep, but excluding .joke and .fake would just be a waste of effort.

     

    (Unless you derived joy from playing with regular expressions, of course. But we're a rare breed.)

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 2, 2010 1:20 PM   in reply to Joel Cherney

    Yup, it'll totally work for regular e-mail addresses. I've used something like this for years on end, to automatically make mail addresses clickable in PDFs. I used to show an alert for each and every one ("Is this okay: bob@something something") until I got tired of never having to press "No".

     

    Non-latin top-domain names: well... InDesign's GREP is Unicode aware. [[:alnum:]] is not restricted to just Latin, and the expression will happily match this address:

     

    theun@ιονγυαρε.gr

     

    (Apparently the stuff after the final dot must still be a..z.)

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 2, 2010 4:49 PM   in reply to iBabs2

    That was Jongware's clever way to avoid collecting

    joelc@fakeaddress

    with the grep query. That character is a lowercase L, so \l+ means "one or more lowercase letters." So, \.\l+ means "a period, followed by one or more lowercase letters."

     

    I might have used \.[A-Za-z]+ which means "a period, followed by one or more letters between a and z, either uppercase or lowercase" if I had spotted that.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 2, 2010 4:52 PM   in reply to iBabs2

    About the brackets - really, I don't know. I group stuff in brackets because it's how I was taught. For certain, the format of [[:alnum:]] or [[:digit:]] is really old-fashioned, it's just what I learned back in the old days (the mid-90s).

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points