8 Replies Latest reply on Apr 6, 2010 11:27 PM by Harbs.

    Iterating over all text?

    John Hawkinson Level 5

      I'd like to iterate over all the text in a document (inside groups, tables, etc., etc.) and not miss any bizarre corner cases.

       

      I thought I had seen a script from Marc Autret that addressed this, but I couldn't find it (instead, I found the [JS][CS3] Getting Page Number but I thread, which basically goes the other way).

       

      I recently discovered that the method I had been using misses text inside tables. And the version I wrote  a while ago initially missed items inside groups. So I'm wondering if somone has a tried-and-true function that does this kind of thing.

       

      Hre's what I have -- this works fine without tables. Looks for "@@" in any textbox in the document:

       

      var i,j;
        for (i=0; i<doc.pages.length; i++) {
             var p = doc.pages[i];
             for (j=0; j<p.masterPageItems.length; j++)
                  check_at("Master on p."+p.name, p.masterPageItems[j]);
             for (j=0; j<p.pageItems.length; j++)
                  check_at("On p."+p.name, p.pageItems[j]);
        }
      
      function check_at(name, pi) {
            if (debug) $.writeln(pi.constructor.name+" on "+name);
            if ('contents' in pi &&
                   pi.contents.match("@@")) {
                            var i = pi.contents.indexOf("@@");
                            var s = Math.max(i-23,0);
                            var e = Math.min(i+23,s+37);
                            lines.push(name+":  "
                            +pi.contents.substring(s,e).replace(/\r/g,"\\n"));
                  }
           if ('pageItems' in pi) // recurse into groups
                  for (var k=0; k<pi.pageItems.length; k++)
                    check_at(name+"[g]", pi.pageItems[k]);
      }
      

       

      but this fails on tables, because a TextFrame's contents property does not return the contents of a table.

      (I also realized today that the group handling could be ignored if I just used "allPageItems" instead of "pageItems").

       

      Anyhow, I guess I could also iterate over pi.tables and for each one, check .contents.join("\n").match("@@"). Since the contents of a table are an array, that would be joining all the cells together into one string and searching that string.

       

      But I'm worried this is insufficiently robust? And it certainly is ugly.

       

      Any good experience on this sort of thing? Thanks.

        • 1. Re: Iterating over all text?
          [Jongware] Most Valuable Participant

          If you really want to catch all and every text in your document, you don't have to check all TextFrames. The basic text object is a "Story", so it's sufficient to loop over all stories. This will also catch stuff inside anchored frames, and even on the pasteboard. To differentiate, you'll need something like Marc Autret's routine, checking the parent of the story (it'll be something like Page, Spread, Document, or Character -- for an anchored object --, but do check as this is from memory).

           

          Every single story can contain one or more tables, and these can be accessed immediately, as you found out. But as soon as you have a handle on a table, you can check its Cells array, which is a linear array containing each unique *cell* (including its contents).

           

          This quick sample loops over all stories in your document -- whether anchored or on the pasteboard or elsewhere --, and all tables inside those, pasting together their contents. (And a notable exception is "Footnotes" -- but these are quite similar to tables, except you can have a table inside a footnote but not the other way around. Tables inside footnotes are *not* caught by a Story's tables array.)

           

           

          string = '';
          for (st=0; st<app.activeDocument.stories.length; st++)
          {
               s = app.activeDocument.stories[st];
               string += "Story: "+s.contents+"\r";
               for (a=0; a<s.tables.length; a++)
               {
                    for (b=0; b<s.tables[a].cells.length; b++)
                         string += "Cell: "+s.tables[a].cells[b].contents+"\r";
               }
          }
          alert (string);
          

          • 2. Re: Iterating over all text?
            John Hawkinson Level 5

            Yeah, I realize I can iterate over each Story. I'd prefer to know the PageItem that matches, and to be able to report it, potentially select it or focus on it, and to report back (the lines array in my original example) one hit per PageItem, rather than one hit per Story. I imagine wanting to report the position

            of the enclosing PageItem, etc.

             

            I guess I'd really like to have some methods that are a bit more type-agnostic, that don't rely on knowing that a Story has both contents as well as Tables that themselves have contents or cells that also have contents.

             

            Maybe to recurse over all properties of the object and see if they have a contents sub-property, and if so check that. Though if I just let that run, it would return both the contents of the table (an array) as well as the contents of the cells of the table (strings). I suspect it would also be slow.

             

            Maybe I should reevaluate my priorities, though, and accept Story as a better index to this stuff.

            • 3. Re: Iterating over all text?
              [Jongware] Most Valuable Participant

              Yes, you are correct: to immediately be able to select the frame, you could do a run-by per frame. It's possible to get the actual frame in which some threaded text is displayed, but that doesn't seem to be necessary. (And you'd happily skip overset text as well -- since this has *no* frame.)

               

              So checking the tables inside frames ought to work. A warning ;-) Text in a table that threads into another frame is remarkably reluctant to return its actual  'parent frame' -- advanced scripters than me have discussed this before, on this very forum.

               

              [looping over "everything"] .. it would return both the contents of the table (an array) as well as the contents of the cells of the table (strings) ..

               

              I don't think there is a special need to loop over 'everything'. I'd have to browse back to your original post (which I can't, courtesy of Jive -- "Thou Shalt Reply Only To The Most Recent Post"), but in essence *every* text on your page has to be in at least one text frame per page. No text frame -> no text. And all tables, in turn, ought to be contained inside the text in that frame. Given a table, you don't have to loop over it and then its "children" objects (cells), you can collect $100, then directly inspect the Cells of that table.

               

              If you need to select the penultimate, actual page item that may contain your text (the one that's right smack bang placed on your page, not nested-into-a-table-into-a-footnote-into-an-anchored-object), one way to do so would be to:

               

              1. loop over all text frames on a certain page

              2. using a function, inspect if it, or its tables, anchored objects, etc. contain your text -- this function may use some recursion to step inside objects-in-objects

              3. select the frame from #1 if so.

              • 4. Re: Iterating over all text?
                Harbs. Level 6

                You can use story and then work back up to get the TextFrame which 

                contains it.

                 

                You can also iterate through doc.stories.everyItem().textContainers.

                 

                Of course, if you have nested stuff, you'll need stuff like: 

                stories.everyItem().footnotes.everyItem().tables.....

                 

                Harbs

                • 5. Re: Iterating over all text?
                  John Hawkinson Level 5

                  Jongware, the idea of looping over everything is to deal with all container -type objects inside a pageItem. So if CS5 adds

                  a contentAwareTextResizing object as a proprty that contains a contents property, then I won't have to modify my script. I'd prefer to write a script that doesn't have to know that Stories can contain tables that can contain cells that contain contents. I'd much rather know that PageItems can have multiple childfren all of which that might have contents.

                   

                  Re Jive: surely you use "open in new tab"?

                   

                  Oh, thanks for pointing ou that this won't work on footnotes. I don't really care about footnotes, but it would be nice to have a simple and generic solution.

                   

                  Harbs, I have stories that thread through multiple textFrames, so going up to the Story loses that.

                   

                  I guess the short answer is no one has a robust canned snippet that deals with all this stuff. Ah well.

                  • 6. Re: Iterating over all text?
                    Harbs. Level 6

                    Story.textContainers gives you an array of all the text frames which 

                    hold the story...

                     

                    As long as you are not dealing with tables or footnotes,

                     

                    doc.stories.everyItem().textContainers should give you every text 

                    frame in the doc.

                     

                    Harbs

                    • 7. Re: Iterating over all text?
                      John Hawkinson Level 5

                      The whole point here is that yes, I have to deal with tables.

                       

                      Why is Storiy.textContainers better than Document.pageItems?

                      • 8. Re: Iterating over all text?
                        Harbs. Level 6

                        Because it'll get nested textFrames as well [without the need to filter out allPageItems]...

                         

                        To deal with the tables, you'd need two separate loops.

                         

                        doc.stories.everyItem().tables.everyItem().cells.everyItem() gives you all cells except ones in nested tables, ones in footnotes, header cells and footer cells.

                         

                        HTH,

                        Harbs