Skip navigation
Holloweaver
Currently Being Moderated

Can somebody please help me creating a GREP style to reformat all the website adresses in my layout?

Aug 14, 2012 1:30 AM

i've got this text from a client full with webadresses, some end with .com or .nl and i would like to highlight these adressess in bold and blue... i created a textstyle for it but the only thing i manage to get as a result is only the beginning of the website 'www.' i know it's possible but my skills are lacking in this departement...

can anyone help me getting started...?

 
Replies
  • Currently Being Moderated
    Aug 14, 2012 1:56 AM   in reply to Holloweaver

    It would be as easy as

     

    \bwww.+?(\.com|\.nl)\b

     

    were it not for this:

     

    Holloweaver wrote:

     

    some end with .com or .nl

     

    ... so others do not? Well, then the only thing left is to highlight all www's:

     

    \bwww.+?\b

     

    Hm, doesn't work. The \b (Word Break) stops at every hyphen and period. Let's try

     

    \bwww\S+

     

    Nope. It picks up following commas and parentheses. Then let's just include alphanumerics. Oh, and the period. Oh, and the hyphen:

     

    \bwww\.[-.[:alnum:]]+

     

    Aargh! Now it also picks up a period after the URL. Alrighty, then split it up in two parts. First halve contains the period as well, second halve does not:

    

    \bwww\.[-.[:alnum:]]+[[:alnum:]]

     

    Matching URLs using GREP is actually one of the Dark Arts, and "one GREP to rule them all" is probably just not gonna work.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 14, 2012 2:28 AM   in reply to [Jongware]

    Funny enough - under Type>Hyperlinks and Cross References> Convert all URLS to Hyperlinks

     

    This copies this code into the GREP Find area

     

    (\w:\\[\w\d:\-\\]*)|(([\w\d]\.?\-?)+@([\w\d]+\.)+([\l\u]{2,}))|(((((ht |f)tp(s?))|www|afp|smb|ntp|nfs):\/\/)([\w\d\~\-_]+[\.\/])+[\w\d]{2,}(\ /?[\?\_][\w\d:#\.\$\—\=\%\?\-&\+\^\,]+)?)|([\w\-?]+\.([\w\d]+[\.\/])*[ \l\u]{2,})

     

    Why it's not included in the list of sample GREPs is beyond me?

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 14, 2012 2:56 AM   in reply to Eugene Tyson

    Eugene Tyson wrote:
    Why it's not included in the list of sample GREPs is beyond me?

     

    Does it work for you if you paste it into GREP search? Maybe it's not included in the samples because it's not very good:

     

    badgrep.PNG

     

    I tried with CS4, but I would not think there would be a large difference inside GREP itself for newer versions.

     

    (5 min. later)

     

    Hah -- sorry, the forum inserted a couple of spaces. Nevertheless, it's better without but only by a very small amount. It still breaks an URL at a hyphen. Isn't that a common complaint on this Built-in Feature? It also stops at '?' escapements, does not pick up a leading "http://" (despite this being 'called upon' in the GREP!), and picks up random pieces of text such as a DOI number and common phrases such as "prof.dr.".

     

    The one I use every day is way better than this.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 14, 2012 3:18 AM   in reply to [Jongware]

    Ah - well then I wonder what that action does from the drop down menu, why it's copied to GREP fields? Is it doing a GREP search with some sort of script function along with the GREP?

     

    I was trying to a catch all for an email (sorry for going off topic) something like

     

    \b.+?@

     

    But that selects the entire line before the start of the URL, including the space.

     

    Even though I've instructed it to be at a word boundary?

     

    Why doesn't this work when you include a "@" symbol.

     

    It's very frustrating.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 14, 2012 3:20 AM   in reply to [Jongware]

    A little more on this Built-in Feature.

     

    Whatthegrep's breakdown is 148 lines and includes 11 warnings: the red highlighted characters do not need to be escaped.

     

    badgrep2.PNG

     

    (They need a 'warning' because GREP's behavior for unrecognized escape codes is not really defined -- should it fail? should it match literal backslash plus literal character? ID's GREP matches the escaped character only and ignores the backslash.)

     

    ((There is actually one small error in Whatthegrep's breakdown: inside a [bracketed] section, the sequence \- is required to match a literal hyphen. So that's one spurious warning less.))

     

    You can see the black literal text on the second line that attempts to match "http" and "ftp". In addition, you can spot numerous hpyhens in the "allowed" parts, so there must be something fundamentally wrong because InDesign does not match URLs containing these. Let's try to find why it doesn't work, shall we? (I'm not actually getting paid by Adobe to do this, you know.)

     

    The second major group (parentheses #2 to #5) only matches e-mail addresses, so we can safely remove it and check what's left.

     

    The first major group (parentheses #1) doesn't seem to match anything in my sample document. Perhaps it should match DOS style path names? (Trying, please hold on...) Oh. My. God. ...

     

    Well, viewers, to quote Jeremy Clarkson, on that bombshell it's time to end the show. Maybe I'll revisit this, when I have recovered from that shock.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 14, 2012 2:59 PM   in reply to [Jongware]

    Got it.

     

    parentheses group #1 (the first | separated OR group): DOS path name, because a link can also refer to a file.

     

    parentheses groups #2-5 (2nd OR group): e-mail address. Note: a "mailto://" prefix does not get matched here.

     

    parentheses groups #6-13 (3rd OR group): URL, with a mandatory prefix "http://", "https://", "ftp://", and some more ("smb://", for example, is a Samba server file prefix). This also includes "ftps://", which was unknown to me but it appears it's perfectly valid.

    It does not match the "mailto://" prefix for an e-mail address, but that's a moot point because the URL may not contain a @ character anyway.

     

    This group correctly recognizes URLs that contain one or more hyphens in the base name (it also correctly rejects hyphens in the domain extension).

     

    parentheses groups #14-15: (4th and last OR group): any non-prefixed URL, recognized solely on the basis of "any sequence of word characters, including the hyphen, followed by a period, optionally followed by some more characters (including both period and forward slash but excluding the hyphen), followed by at least two uppercase or lowercase letters.

     

    This last one is responsible for (a) recognizing both "deadline.com" and "keefe-studios.com" as URLs, but (b) also rejecting "www.test-test.com" -- a hyphen in the 'center', fully optional, part.

     

    Interestingly, there are more differences between the more exact 'match only with prefix' and the catch-all 'match about anything'. The former includes among the allowed characters: ':' (which, if I recall correctly, is only valid if a port number follows!), '#', '$', '%', '&', '?' and '+', so it matches a server query as well:

     

    http://www.google.com/search?hl=enl&tbo=d&site=&source=hp&q=allowed+ch aracters+in+url&oq=allowed+characters+in+url

     

    The latter does not include all of these characters, just letters, numbers, the hyphen (but not in all positions), and a period. It's probably because this is an unreliable way to match URLs after all -- hence the number of 'false hits'. These would only increase with even more allowed characters.

     

    Lesson learned: to be assured "Create Hyperlinks Automatically" works as expected, make sure URLs are prefixed with "http://", and mail addresses are not prefixed with "mailto://". I didn't test if the function actually creates a correct e-mail link, but even if it does the prefix is not included in the clickable linked text.

     

    (Also interestingly, it appears a GREP style, fed with this same query, doesn't stumble over the hyphen whereas a GREP find does. I don't think I'm capable of figuring that one out.)

     
    |
    Mark as:
  • Currently Being Moderated
    Sep 11, 2013 5:46 AM   in reply to [Jongware]

    I found this discussion a year late to join, but I'll add a couple points.

     

    • GREP find/replace or GREP styles would only apply styling and still not account for creating a Hyperlink.  That part has to be done by the script.
    • Adobe makes all Hyperlinks by default a Shared Destination.  If you change this, you lose the data in the panel captured by the script or reg espression. 
    • Shared Destination is more or less useless when sending data from InCopy or InDesign to a web content management system.  Whereas Email or URL, found in the dropdown of that hyperlink, sticks.

     

    I am after an automated solution in InCopy for a gang of writers who can run a script against their stories and have the links captured and viably placed into the panel.  Viably is key, though.  I don't know how many people are aware of the Shared Destination problem.  Great if you are simply using a flat-file/folder system, not great in the CMS world.

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points