6 Replies Latest reply on Nov 30, 2011 8:17 AM by [Jongware]

    [CS3] GREP question

    Tom Tomasko Level 1

      I am using the following code to find all compound words:

       

      (\w\w+)-(\w\w+)

       

      However, if the first word has initials, such as "U.S.-China," it does not find it.

       

      So I change the code to be:

       

      (\w(.?)\w+.?)-(\w\w+)

       

      This works. Now GREP finds any compound word whether or not there are initials but when there are no initials it also selects the last letter of the word prior to the compound. For instance what is in brackets here is selected:

       

      th[e mid-September]

       

      Why is this happening and how can I fix it?

       

      Thanks,

      Tom

        • 1. Re: [CS3] GREP question
          [Jongware] Most Valuable Participant

          It appears to work because the single period is an "any character" wildcard. Thus, it finds the period in your "U.S.", but in addition it also matches any other character, including a preceding space, as you have seen.

           

          Use \. to escape the regular meaning of the period and make it behave as the litera code.

          • 2. Re: [CS3] GREP question
            Tom Tomasko Level 1

            Thanks Jongware!

             

            That works.

             

            In additionI made a litte error in the code above. It will only find a word with one period in it. So rewritten it should be:

             

            (\w(\.?)\w+\.?)-(\w\w+)

             

            However, I found that following also works:

             

            ([\w+.*]+)-(\w+)

            • 3. Re: [CS3] GREP question
              [Jongware] Most Valuable Participant

              Tom Tomasko wrote:

               

              [...] However, I found that following also works:

               

              ([\w+.*]+)-(\w+)

               

              Yes it will work, and as a side effect ... it will also pick up '+' and '*' !

               

              A list of characters inside [ Character Set brackets ] loose the magical properties they have outside them. \w is still regarded as 'any word character' (0..9, A..z, and the equivalent in other scripts), but the period does not longer match "any character" but only itself. There is no need to grab more characters using either + or * because anything inside the Character Set brackets always will match one single character, and you need the '+' right after it to get it to repeat.

               

              So all you need is this:

               

              ([\w.]+)-(\w+)

               


              WhatTheGrep's breakdown:

               

              ( Begin Group #1
              [ Inclusion: any character in this group
                \w Any word character (A..Z, a..z, _, 0..9)
                . The character “.”
              ] End Inclusion Group
              + Any character in this group may occur once or more times; longest possible match will be taken
              ) End Group #1
              - Literal character “-”
              ( Begin Group #2
              \w+ Any word character (A..Z, a..z, _, 0..9); may occur once or more times; longest possible match will be taken
              ) End Group #2


              (1[\w.]+1)-(2\w+2)

              • 4. Re: [CS3] GREP question
                Tom Tomasko Level 1

                Thanks Jongware for that correction and fuller explanation.

                Tom

                • 5. Re: [CS3] GREP question
                  rajsiva

                  hi use this below i mentioned grep. your will the correct answer

                   

                  ([A-z]{1,})

                  • 6. Re: [CS3] GREP question
                    [Jongware] Most Valuable Participant

                    No it will certainly not result in "the correct answer".

                     

                    Apart from not fulfilling the primary object, your expression has lots of unexpected -- and probably unwanted -- side-effects.