1 Reply Latest reply on Apr 17, 2012 12:47 PM by Mary Posner

    GREP to find URLs

    allisonblake Level 1

      Hi everyone,

       

      I picked up this grep from somewhere (thank you whoever you are!) to help me find URLs in my magazine:

       

      (?i)(http|ftp|www)(\S+)(\.\l{2,4})|(\S+)(\.\l{2,4})

       

      It works pretty well. I've tried to read through it and understand exactly what it's searching for because it is finding only part of certain URLs

       

      It finds

      http://www.cdc.gov/hepatitis/HCV/StatisticsHCV.htm

      http://www.iom.edu/Reports/2010/Hepatitis-and-Liver-Cancer-A-National-Strategy-for-Prevent ion-and-Control-of-Hepatitis-B-and-C.aspx

       

      But in this URL:

      http://www.who.int/mediacentre/factsheets/fs164/en
      it only finds http://www.who.int

       

      How can I edit the grep so that it will capture the whole of this URL?

       

      Thanks!

        • 1. Re: GREP to find URLs
          Mary Posner Level 3

          I recommend installing Jongware's "What the GREP" script. It's a great tool for pulling apart GREP strings and telling you exactly what each part of it is doing.

           

          The above script actually breaks for me on your second example, capturing only "http://www.iom.edu". I think it's the space after "National-" that's causing it to break and is probably not actually in your original text. The next "find" after that picks up this:

           

          Strategy-for-Prevention-and-Control-of-Hepatitis-B-and-C.aspx

           

          Is there a pattern to what follows immediately after the URLs? Depending on what the rest of your text looks like, you might be able to get away with truncating the search term to just this:

           

          (?i)(http|ftp|www)(\S+)