8 Replies Latest reply on Dec 26, 2015 12:41 PM by JR_Boulay

    Grep question: in a list detect duplicated words with/without accented vowels

    camilo umana Level 1

      In Spanish, some words have a different meaning if accented vowels are present. For example:

      Pelo, is hair

      Peló, is peel off

       

      How to use the find duplicates function to catch in a sequenced list these words?


      Perhaps the code


      [^ a-z|A-Z]+

       

      finds accented letters. But could not find a solution to integrate it in a query.

        • 1. Re: Grep question: in a list detect duplicated words with/without accented vowels
          Peter Kahrel Adobe Community Professional & MVP

          Camilo -- Look for Pel[[=o=]] and you'll find both Pelo and Peló.

           

          Peter

          • 2. Re: Grep question: in a list detect duplicated words with/without accented vowels
            camilo umana Level 1

            Peter, hi.

             

            In some long lists the words are not previously known.

             

            Alma

            Aro

            Aró

            Azul

             

            The idea is adapt a grep to isolate the similar ones for spelling routines. Checking visually is possible but terrible.

             

            Merry Christmass!

            • 3. Re: Grep question: in a list detect duplicated words with/without accented vowels
              Peter Kahrel Adobe Community Professional & MVP

              I see. Interesting problem. This (fairly naive) script below marks all duplicate items if the accents are stripped away.

              Select a text frame or click somewhere in the list, then run the script. It assumes that the used font is Regular and that Bold is available.

               

              (function () {
              
                var i, j;
                var story;
                var list;
                var found;
                function neutraliseAccents (s) {
                  s = s.toUpperCase();
                  return s.replace (/[ÁÀÂÄÅĀĄĂÆ]/g, '[[=a=]]').
                    replace (/[ÇĆČĊ]/g, '[[=c=]]').
                    replace (/[ĎĐ]/g, '[[=d=]]').
                    replace (/[ÉÈÊËĘĒĔĖĚ]/g, '[[=e=]]').
                    replace (/[ĢĜĞĠ]/g, '[[=g=]]').
                    replace (/[ĤĦ]/g, '[[=h=]]').
                    replace (/[ÍÌÎÏĪĨĬĮİ]/g, '[[=i=]]').
                    replace (/[ĵ]/g, '[[=j=]]').
                    replace (/[ķ]/g, '[[=k=]]').
                    replace (/[ŁĹĻĽ]/g, '[[=l=]]').
                    replace (/[ÑŃŇŅŊ]/g, '[[=n=]]').
                    replace (/[ÓÒÔÖŌŎŐØŒ]/g, '[[=o=]]').
                    replace (/[ŔŘŖ]/g, '[[=r=]]').
                    replace (/[ŚŠŜŞȘß]/g, '[[=s=]]').
                    replace (/[ŢȚŤŦ]/g, '[[=t=]]').
                    replace (/[ÚÙÛÜŮŪŲŨŬŰŲ]/g, '[[=u=]]').
                    replace (/[Ŵ]/g, '[[=w=]]').
                    replace (/[ŸÝŶ]/g, '[[=y=]]').
                    replace (/[ŹŻŽ]/g, '[[=z=]]');
                }
                function markDuplicates () {
                  app.findGrepPreferences = app.changeGrepPreferences = null;
                  app.findGrepPreferences.fontStyle = 'Regular';
                  story = app.selection[0].parentStory;
                  list = story.contents.split('\r');
                  for (i = 0; i < list.length; i++) {
                    app.findGrepPreferences.findWhat = '(?i)^' + neutraliseAccents (list[i]) + '$';
                    found = story.findGrep();
                    if (found.length > 1) {
                      for (j = 0; j < found.length; j++) {
                        found[j].fontStyle = 'Bold';
                      }
                    }
                  }
                }
                if (app.selection.length > 0 && app.selection[0].hasOwnProperty ('parentStory')) {
                  markDuplicates();
                }
              }());
              
              

               

              Peter

              • 4. Re: Grep question: in a list detect duplicated words with/without accented vowels
                camilo umana Level 1

                Peter, what a superb script.

                Sorry, I was in the Moon thinking In a little grep...

                Thanks.

                Yes, fortunately it seems an interesting problem. In French there are also similar duplicated words.

                Happy new year.

                • 5. Re: Grep question: in a list detect duplicated words with/without accented vowels
                  JR_Boulay Adobe Community Professional

                  Hi.

                   

                  Yes yes yes.

                  It look likes very interesting for French language too.I will test it soon.

                   

                  Thank you Peter (and Camillo).

                   

                  Joyeux Noël.

                  • 6. Re: Grep question: in a list detect duplicated words with/without accented vowels
                    Peter Kahrel Adobe Community Professional & MVP

                    In long lists, when speed becomes an issue, (at least) three improvements can be made. The first improvement to make the neutraliseAccents function language-specific (if possible). For French this version can be used (not sure if I got this completely right but you get the picture):

                     

                    function neutraliseAccents (s) {
                      return s.toUpperCase().
                        replace (/[ÁÀÂ]/g, '[[=a=]]').
                        replace (/Ç/g, '[[=c=]]').
                        replace (/[ÉÈÊË]/g, '[[=e=]]').
                        replace (/Î/g, '[[=i=]]').
                        replace (/[ÓÒÔ]/g, '[[=o=]]').
                        replace (/Û/g, '[[=u=]]')
                    }
                    

                     

                    The second improvement is not to use character classes for single letters, though this is a minor speed advantage.

                    The third improvement, one that will make a big difference, is to precompile all regular expressions. That does make the script a bit harder to change if you want to make it language-specific, but not much. Here is the version that precompiles the regexes (and incorporates some minor fixes):

                     

                    (function () {
                    
                      var re = {
                        A: /[ÁÀÂÄÅĀĄĂÆ]/g,
                        C: /[ÇĆČĊ]/g,
                        D: /[ĎĐ]/g,
                        E: /[ÉÈÊËĘĒĔĖĚ]/g,
                        G: /[ĢĜĞĠ]/g,
                        H: /[ĤĦ]/g,
                        I: /[ÍÌÎÏĪĨĬĮİ]/g,
                        J: /Ĵ/g,
                        K: /Ķ/g,
                        L: /[ŁĹĻĽ]/g,
                        N: /[ÑŃŇŅŊ]/g,
                        O: /[ÓÒÔÖŌŎŐØŒ]/g,
                        R: /[ŔŘŖ]/g,
                        S: /[ŚŠŜŞȘß]/g,
                        T: /[ŢȚŤŦ]/g,
                        U: /[ÚÙÛÜŮŪŲŨŬŰŲ]/g,
                        W: /Ŵ/g,
                        Y: /[ŸÝŶ]/g,
                        Z: /[ŹŻŽ]/g
                      }
                    
                      function neutraliseAccents (s) {
                        return s.toUpperCase().
                          replace (re.A, '[[=a=]]').
                          replace (re.C, '[[=c=]]').
                          replace (re.D, '[[=d=]]').
                          replace (re.E, '[[=e=]]').
                          replace (re.G, '[[=g=]]').
                          replace (re.H, '[[=h=]]').
                          replace (re.G, '[[=i=]]').
                          replace (re.J, '[[=j=]]').
                          replace (re.K, '[[=k=]]').
                          replace (re.L, '[[=l=]]').
                          replace (re.N, '[[=n=]]').
                          replace (re.O, '[[=o=]]').
                          replace (re.R, '[[=r=]]').
                          replace (re.S, '[[=s=]]').
                          replace (re.T, '[[=t=]]').
                          replace (re.U, '[[=u=]]').
                          replace (re.W, '[[=w=]]').
                          replace (re.Y, '[[=y=]]').
                          replace (re.Z, '[[=z=]]');
                      }
                    
                      function markDuplicates () {
                        var i, j;
                        var story;
                        var list;
                        var found;
                    
                        app.findGrepPreferences = null;
                        app.findGrepPreferences.fontStyle = 'Regular';
                        story = app.selection[0].parentStory;
                        list = story.contents.split('\r');
                        for (i = 0; i < list.length; i++) {
                          app.findGrepPreferences.findWhat = '(?i)^' + neutraliseAccents (list[i]) + '$';
                          found = story.findGrep();
                          if (found.length > 1) {
                            for (j = 0; j < found.length; j++) {
                              found[j].fontStyle = 'Bold';
                            }
                          }
                        }
                      }
                    
                      if (app.selection.length > 0 && app.selection[0].hasOwnProperty ('parentStory')) {
                        markDuplicates();
                      }
                    
                    }());
                    

                     

                    A final speed improvement would be to keep track of the words that are processed, which could be added to the markDuplicates function.

                     

                    Peter

                    • 7. Re: Grep question: in a list detect duplicated words with/without accented vowels
                      camilo umana Level 1

                      Peter,

                       

                      Your first script (24.12) perfectly solved both finding the duplicated words — when differentiated only by an accent — and capturing same words but in upper and lowercase (Gobierno, gobierno) something very useful, as sometimes in different contexts nouns may be proper or common, and in a spelling process is a very sensitive variable.*

                       

                      Same words with consonants like n / ñ were also tagged, what's very good as sometimes typos are living there.

                      You cleverly integrated consonants: in some languages not only vowels are accented.

                       

                      I am sure for spelling routines and other tasks these scripts will be a must.

                       

                      Thanks for your time and generosity.

                       

                      C.

                       

                       

                      *Referring to today versions (25.12) I used both in the same piece of text (although only 1800 entries) and could not find any difference among them.

                      • 8. Re: Grep question: in a list detect duplicated words with/without accented vowels
                        JR_Boulay Adobe Community Professional

                        Thank you Peter.

                        I cannot do it now but I will test it next year.

                         

                        Happy new year.