Skip navigation
Currently Being Moderated

SortParagraphs Rules Do Not Follow Standards

Jul 11, 2011 9:41 PM

SortParagraphs seems to use different sets of rules in determining the sorting order of hyphenated words and quotes. I sorted the following in InDesign’s SortParagraphs.applescript and compared the result with Excel and BBEdit. The latter two are the same.

 

InDesign Result:

 

Bacon-Feta Chicken   Rolls
Bacon Mushroom Chicken
Cheddar Chicken Spaghetti
Chicken-Pesto Pan Pizza
Chicken and Barley Boiled Dinner
Chicken ‘n’ Chilies Casserole

 

Excel, BBEdit Results:

 

 

Bacon Mushroom   Chicken
Bacon-Feta Chicken Rolls
Cheddar Chicken Spaghetti
Chicken ‘n’ Chilies Casserole
Chicken and Barley Boiled Dinner
Chicken-Pesto Pan Pizza

 

I read the guidelines in www.niso.org/publications/tr/tr03.pdf and it appears to me that Excel and BBEdit follow the rules stated in the guidelines.

 

Any comments or suggestion to make InDesign to sort the way Excel and BBEdit do?

 
Replies
  • Currently Being Moderated
    Jul 12, 2011 5:59 AM   in reply to jaychow99

    The SortParagraphs.jsx I have (CS4) has a number of settings -- you don't say which one(s) you used. And they are here because they make a BIG difference in the output order ...

     

    If you examine SortParagraphs.jsx, you can see exactly how it prepares the strings before comparing. Any special processing should be done inside the mySortXXXX functions, and you are welcome to suggest any changes, or even implement them however you want. (Now go and try that with Excel and TextWrangler!)

     

    A small word of warning: it's possible you are getting other results than you were expecting because of unexpected differences in character values. For example, the ' character in "Chicken 'n' Chilies" could be a straight quote in InDesign (in which case it comes before alphanumerics at the same position), or it could be a curly one (and this code is way beyond alphanumerics). And if you copied the text into TextWrangler and Excel, it might get translated again into either a straight or a curly quote again, and have another effect on that sorting order (if any).

     

    You are correct to observe that SortParagraphs uses "different" rules, rather than "wrong". Sorting order is one of those things you can discuss at length, usually with people you disagree with. And you cannot change the sorting rules of Excel and TextWrangler, or even get to find out what they are -- but you can tinker with the script to get it to sort just as you like.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 12, 2011 6:08 AM   in reply to [Jongware]

    Oh that PDF you refer to is good! With some very careful programming, I got the integer decimals to sort correctly in this:

     

    0.25 mm

    .300 Vickers machine gun

    .303-inch machine guns

    007 James Bond

    1 2 3 for Christmas

    1, 2, buckle my shoe

    1-4-5 boogie-woogie

    2 kinetic sculptors

    2-phase flow in turbines

    2 x 2 = 5

    3-D scale drawing

    3 point 2 and what goes with it

    3M Company

    3.1416 and all that

    10 stars from the forties

    17 days to better living

    XVIIe & XVIIIe siècles

    XVIIme siècle

    XX century encyclopedia

    20 funny stories

     

    and I think I could revise that code to cater for fractions as well. But those roman numerals ...

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 12, 2011 1:26 PM   in reply to [Jongware]

    Very interesting resource, thanks!!

     

    I also want to mention TCollator, a JS library that you can use in any script:

    http://www.indiscripts.com/post/2010/10/alphabetical-sort-in-javascrip t-and-indesign

     

    (TCollator does not parse numerals as specified in the NISO's guidelines, but it offers other helpful options—especially when ordering foreign words.)

     

    @+

    Marc

     
    |
    Mark as:
  • John Hawkinson
    5,572 posts
    Jun 25, 2009
    Currently Being Moderated
    Jul 12, 2011 6:11 PM   in reply to jaychow99

    Those guidelines are in no way canonical. In fact, if you read the notes, they explain they couldn't even agree on an ANSI standard, much less an ISO standard.

     

    There are lots of different ways to sort, and lots of knobs to use. If you really care, I suggest you stop looking at the Applescript sorting script, because it is dependant on how comparisons work in AppleScript, which I don't believe has good support for locales which are essential for modern sorting.

     

    The JavaScript version is much more ameable to tweaking...I would start there.

    It might even do what you want from the start!

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 12, 2011 9:11 PM   in reply to John Hawkinson

    Fellow Scripters,

     

    The point of the SortParagraphs  sample script was just to demonstrate a simple bubble sort in each  scripting language. I'll admit that I didn't think to much about standards for sorting when I wrote it! So I expect that there are differences between the AppleScript, JavaScript, and VBScript versions of the script.

     

    If nothing else, it's a better sort than the ExtendScript Toolkit uses for sorting entries in the scripting DOM!:-)

     

    Thanks,

     

    Ole

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 12, 2011 10:30 PM   in reply to Max Dunn

    Hey Ole,

     

    I'm not sure if you're aware, but you're logged in as Max Dunn.

     

    I'm not sure if that's subliminal messaging about where you're working now or what...

    <LOL>

     

    Harbs

     
    |
    Mark as:
  • John Hawkinson
    5,572 posts
    Jun 25, 2009
    Currently Being Moderated
    Jul 12, 2011 10:37 PM   in reply to jaychow99

    Have you tried the Javascript version?

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 12, 2011 11:29 PM   in reply to Harbs.

    Hi Harbs,

     

    There, I think I'm myself again.

     

    Anyway, this sample script should probably be revisited at some point. I expect that JavaScripts conforming to some standard or other (never my best thing, conforming to standards...) are available "off the shelf." We should look around for some examples and see what we find.

     

    Thanks,

     

    Ole

     

    PS/edit: heh, it says I have 3 (three) posts. What a newbie I am!

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 13, 2011 1:56 AM   in reply to Olav Kvern

    Olav Kvern wrote:

     

    PS/edit: heh, it says I have 3 (three) posts. What a newbie I am!

     

    Hi Ole! Yup. It's been going downhill fast since you left Adobe. Coincidence?

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 13, 2011 2:03 AM   in reply to jaychow99

    If you disable "Ignore Spaces" the Javascript sorts the exact same way as BBEdit does!!

     

    That is, if you also pay heed to what I stated about the ' character: when straight up, Chicken 'n' Chilies is in the middle, when curly it's one position down. So it seems I was right all along.

     

    Ole, I wasn't aware of any compare differences between Applescript, VBScript, and Javascript. Surely they all sort plain ASCII (/Unicode) strings according to the Unicode values?

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 13, 2011 2:38 AM   in reply to jaychow99

    After a little experimenting, I got this output:

     

    Bacon-Feta Chicken Rolls

    Bacon Mushroom Chicken

    Cheddar Chicken Spaghetti

    Chicken and Barley Boiled Dinner

    Chicken ‘n’ Chilies Casserole

    Chicken-Pesto Pan Pizza

     

    This is still different from your target solution, but on the other hand it satisfies the rules you lay out elsewhere: hyphen gets sorted as space, all non-alphanumeric characters are ignored.

     

    The changes I made in SortParagraphs.jsx were on lines 210-211:

     

     a = a.toLowerCase().replace(/\s\s+/g, ' ').replace(/-/g, ' ').replace(/[^\w]/g, '');
     b = b.toLowerCase().replace(/\s\s+/g, ' ').replace(/-/g, ' ').replace(/[^\w]/g, '');
    

     

    And of course I disabled the "Ignore Spaces" checkbox. (That annoying sorting, by the way, is the same as in ID's built-in Indexng function!)

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 13, 2011 4:53 AM   in reply to [Jongware]

    Correction:

     

     a = a.toLowerCase().replace(/\s+/g, ' ').replace(/-/g, ' ').replace(/[^\w ]/g, '');
     b = b.toLowerCase().replace(/\s+/g, ' ').replace(/-/g, ' ').replace(/[^\w ]/g, '');
     
     
     
    

     

    ... The previous version first replaced all double spaces with a single space, then all dashes with a space, then all not-word characters with nothing. That last step also removed all spaces ...

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 13, 2011 7:59 AM   in reply to [Jongware]

    Hi Jongware,

     

    re: "It's been going downhill fast since you left Adobe."

     

    Oh, I don't see that at all--looks like it's thriving, and in good hands. I have a bit more time now, so I hope to stop by more often.

     

    re: "I wasn't aware of any compare differences between Applescript, VBScript,  and Javascript. Surely they all sort plain ASCII (/Unicode) strings  according to the Unicode values?"

     

    I'm not aware of any differences at this point, but I recall AppleScript being slightly different early on when comparing some Unicode values. I don't *know* that the three languages sort text the same way, so I try to be cautious.

     

    Thanks,

     

    Ole

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 14, 2011 2:48 AM   in reply to jaychow99

    That's some good experimenting you did there. But alas, the solution is not so simple.

     

    Did you notice the InDesign sorting is not stable? If you run it again, some items may move around! That's because of an inherit flaw in the sorting algorithm. If you hand it two different strings, you should always get the same result, right? "a" is always sorted before "b".

    Similarly, if you hand it two similar strings, you should always get the same result: when you hand over "a¹" and "a²" (superscripts are only to differentiate the two, not actual part of the sorted strings) the algorithm should not switch them around. Quite logical; and the reverse is also true: when given "a²" and "a¹" (the reverse), the algorithm also should not switch them. And due to the way sorting works, you cannot know in advance whether two arguments are going to be in the order 1-2 or in the order 2-1. Usually, it's not important either.

     

    So why this longish preamble? Because the comparing routine may FORCE two different strings to end up the same!

     

    Step #1: all strings are forced to lowercase. That means you run into the above problem when comparing "Stuart" and "stuart".

    Step #2: all double spaces are forces into a single one. That means, uh, "Stuart Little" will be the same as "Stuart    Little".

    Step #3: all hyphens are replaced with spaces. Uh. You think of an example.

    Step #4: all non-word characters are removed.

     

    It's this final step that causes the instability in comparing "Chicken 'n' Chilies Casserole" with "Chicken n Chilies Casserole" (and all variants thereof) -- they all get compared as "chicken n chilies casserole", and they all are the same.

     

    A fairly good solution is to check if the strings end up the same after pre-processing, and then progressively undo the changes until they are not. Only if they end up actually being the same, you should report so; otherwise, you should take the original differences into account.

     

    If you use this new mySort function, you get a stable result:

     

    function mySort (a, b) {
     var aa = a.toLowerCase().replace(/\s+/g, ' ').replace(/-/g, ' ').replace(/[^\w ]/g, '');
     var bb = b.toLowerCase().replace(/\s+/g, ' ').replace(/-/g, ' ').replace(/[^\w ]/g, '');
     if (aa == bb)
     {
      aa = a.toLowerCase();
      bb = b.toLowerCase();
     }
     if (aa == bb)
     {
      aa = a;
      bb = b;
     }
     
     if(aa > bb){
      return 1;
     }
     if(aa < bb){
      return -1;
     }
     return 0;
    }
    

     

    but, as always, there is one caveat. Funny enough it's the same I mentioned in my very first post in this thread

    Javascript compares a string to another by comparing the ASCII (Unicode) values of the containing characters, one by one, and the very first one to differ determines the "result" for the entire string. So "a" < "b" because the value 97 ('a') is less than 98 ('b'), and "A" < "a" for the same reason. Now the problem here are the straight apostrophes, double and single. Their values are way lower than that of 'n' (the next character they should be compared to), so they always end up at the top of your Quote Comparing list:

     

    Bacon-Feta Chicken Rolls

    Bacon Mushroom Chicken

    Cheddar Chicken Spaghetti

    Chicken and Barley Boiled Dinner

    Chicken "n" Chilies Casserole

    Chicken 'n' Chilies Casserole

    Chicken n Chilies Casserole

    Chicken ‘n’ Chilies Casserole

    Chicken "n" Chilies Casserole

    Chicken-Pesto Pan Pizza

     

    Circumventing this to get your preferred sort order is not a trivial task, I'm afraid.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 14, 2011 3:54 AM   in reply to [Jongware]

    You probably need to weed out the and and 'n' as well, so maybe add replace (/\s(n|and)\s/, " ") to the replacemts you already have there. But can't your mySort function not be rephrased like this:

     

    function mySort (a, b){
        var aa = a.toLowerCase().replace(/\s+/g, ' ').replace(/[^\w ]/g, '').replace (/\s(n|and)\s/, " ");
        var bb = b.toLowerCase().replace(/\s+/g, ' ').replace(/[^\w ]/g, '').replace (/\s(n|and)\s/, " ");
        return aa > bb;
    }

     

    I didn't quite understand all the comparisons.

     

    Peter

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 14, 2011 4:16 AM   in reply to Peter Kahrel

    Peter, the general idea is that weeding out all the usual suspects may give you the same strings when the originals are not.

     

    Compare, for example, the 'toLowerCase' usage. If you compare "abc" and "DEF", you want "abc" before "DEF", not after it as the Unicode order would prescribe.

    But if you have "Abc" and "abc", you always want the capital version first. Since toLowerCase unifies the strings and makes them the same, the sorting routine won't see any difference and the result is unstable.

     

    So if toLowerCase makes both strings equal, you have to compare the original strings. Only if these are the same, there is 'no difference at all'.

     
    |
    Mark as:
  • John Hawkinson
    5,572 posts
    Jun 25, 2009
    Currently Being Moderated
    Jul 14, 2011 4:41 AM   in reply to [Jongware]

    Jongware:

    Did you notice the InDesign sorting is not stable? If you run it again, some items may move around! That's because of an inherit flaw in the sorting algorithm. If you hand it two different strings, you should always get the same result, right? "a" is always sorted before "b".

    I think you mean a flaw in the COMPARISON algorithm, not the sorting algorithm, per se.

     

    I think you are supposed to be able to solve this in JavaScript with localeCompare(), a method inherited from String.prototype, though I'll belive you if you say it does not work here.

     

    I feel like your proposed implementation was a lot more complicated than it should have been. I'm not going to wade in any further on this one, and I know that's horribly unfair...

     

    Circumventing this to get your preferred sort order is not a trivial task, I'm afraid.

     

    Can't we just set LC_COLLATE to something appropriate (hopefully "C") and go? Probably not.

    It may well be advisible to use an external sorting tool. oh, blah.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 14, 2011 5:05 AM   in reply to John Hawkinson

    John Hawkinson wrote:

     

    Jongware:

    Did you notice the InDesign sorting is not stable? If you run it again, some items may move around! That's because of an inherit flaw in the sorting algorithm. If you hand it two different strings, you should always get the same result, right? "a" is always sorted before "b".

    I think you mean a flaw in the COMPARISON algorithm, not the sorting algorithm, per se.

     

    Duh. Yeah, tried to keep up the typing with the thinking. The sorting algorithm has nothing to do with it, you (= we) are providing the comparison.

     

    I think you are supposed to be able to solve this in JavaScript with localeCompare(), a method inherited from String.prototype, though I'll belive you if you say it does not work here.

     

     

    That should only give differences in sort order for different languages, i.e., the "AE" ligature comes last in Danish but right after the "A" in Norwegian -- or the other way around. Theoretically, it also might have unified the straight/double/curly quote stuff here, but a quick experiment shows It Does Not.

     

    I feel like your proposed implementation was a lot more complicated than it should have been ...

     

    I see no other solution. The problem is introduced by changing the strings inside the comparison routine; and if they end up 'the same' when they initially were not, you should compare the originals. ¿No?

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 14, 2011 6:53 AM   in reply to [Jongware]

    > But if you have "Abc" and "abc", you always want the capital version first.

     

    True -- hadn't thought of that.

     
    |
    Mark as:
  • John Hawkinson
    5,572 posts
    Jun 25, 2009
    Currently Being Moderated
    Jul 23, 2011 1:31 AM   in reply to [Jongware]

    I said, quoting Jongware:

     

    Did you notice the InDesign sorting is not stable? If you run it again, some items may move around! That's because of an inherit flaw in the sorting algorithm. If you hand it two different strings, you should always get the same result, right? "a" is always sorted before "b".

    I think you mean a flaw in the COMPARISON algorithm, not the sorting algorithm, per se.

     

    I'm not sure why I said that, because I was wrong -- or at least incomplete.

     

    If you are going to use a comparison where different characters can compare equally, you need to use a stable sort -- a sort where the relative positions of equally comparing items is maintained before and after the sort.

     

    If you use an unstable sort with comparisons that test equal, then you have a flaw, but it is not in the comparison algorithm. Depending on how you look at it, the flaw is either in the choice of an unstable sort, or in the sorting algorithm (because it is unstable). I would say the former, but I think the argument can be made both ways.

     

    As Jongware rightly observed, InDesign's JavaScript's sort (Array.prototype.sort) is not a stable sort. It is not required to be by the specification (see 15.4.4.11). Curiously, though, most browsers now implement a stable sort.

     

    Anyhow, back to Jongware and I:

    I feel like your proposed implementation was a lot more complicated than it should have been ...

     

    I see no other solution. The problem is introduced by changing the strings inside the comparison routine; and if they end up 'the same' when they initially were not, you should compare the originals. ¿No?

     

    Well...no. I think the correct answer is to use a stable sort. It is not to try to doctor the comparison method.

    It's pretty well-acknowledged that the correct stable sort to use most of the time is a merge sort.

     

    So probably the correct answer is go find an already-written implementation of merge sort for Javascript, and use that instead.

     

    Of course, it turns out this is a bit trickier than you would like to hope for. The top few google hits aren't quite optimal. The #2 hit is fairly reputable, literateprograms.org (http://en.literateprograms.org/Merge_sort_%28JavaScript%29), but it's implementation doesn't really conform well to Javascript standards. In particular, it pollutes the global namespace with helper functions, and it doesn't conveniently package it up in one file. It also doesn't define it as Array.prototype.mergesort() though I suppose that might be a blessing.

     

    Anyhow, so, I'd probably use that, but it's a bit annoying to assemble together (or, at least, it is more work than it should be, IMO).

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 23, 2011 8:12 AM   in reply to John Hawkinson

    Hey, thanks for getting back on this! Yes, it seems your research supports my intuitive reasoning.

     

    Fortunately, I think most ID scripters can get away with the built-in sort function. Can you come up with a scenario where it would be warranted to use a perfectly stable sort? (I can think of only one: when you have *very* large objects moving around in memory; but my background of JS is not sufficient enough to guess how that could happen.)

     
    |
    Mark as:
  • John Hawkinson
    5,572 posts
    Jun 25, 2009
    Currently Being Moderated
    Jul 23, 2011 1:08 PM   in reply to [Jongware]

    Can you come up with a scenario where it would be warranted to use a perfectly stable sort?

    I thought it was exactly this case -- you want to sort a list with a hyphen sorted just like a space.

    So you use:

     

    function myComp(a,b) {
      a=a.replace(/-/g, " ");
      b=b.replace(/-/g, " ");
      return a.localeCompare(b);
    }
    sortedParagraphs = myParagraphs.mergeSort(myComp);
    

     

    (if you don't like localeCompare, you could use "return a<b?-1:(a>b?1:0);")

     

    This is nice and sweet and all except you have to go find the .mergeSort() method elsewhere. C'est la vie.

     

    Am I confused?

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 24, 2011 2:09 PM   in reply to John Hawkinson

    John Hawkinson wrote:

     

    Can you come up with a scenario where it would be warranted to use a perfectly stable sort?

    I thought it was exactly this case -- you want to sort a list with a hyphen sorted just like a space.

    [...]

    Am I confused?

     

    These are two separate issues. Your replace hyphen with a space introduces another one:

     

    Suppose you have three strings, in this order

    "a b"

    "a-b"

    "a b"

     

    Your comparison routine will treat them the same, and thus will (ideally [*]) not swap them around. They will thus appear in the same order after sorting.

    My previous solution to this is that if strings appear to be the same after this pre-processing, make sure they are not by comparing the original strings. Using the code I provided before, this will yield in the more correct sorting order

     

    "a b"

    "a b"

    "a-b"

     

    [*] "Ideally", because the original problem with the adjusted SortParagraphs was that even though these strings were treated the same, they still would move around. That's the 'unstability' I was addressing. A stable sort ought to always return the same results, independent of any pre-processing.

     
    |
    Mark as:
  • John Hawkinson
    5,572 posts
    Jun 25, 2009
    Currently Being Moderated
    Jul 25, 2011 8:09 PM   in reply to [Jongware]

    Oh dear, an Unexpected Error has eaten my post. Trying again:

     

    These are two separate issues. Your replace hyphen with a space introduces another one:
    Your comparison routine will treat them the same, and thus will (ideally [*]) not swap them around. They will thus appear in the same order after sorting. My previous solution to this is that if strings appear to be the same after this pre-processing, make sure they are not by comparing the original strings. Using the code I provided before, this will yield in the more correct sorting order

    Wait! I thought our goal was to treat hyphens like spaces! If that's not true, it changes everything!

     

    If we just want to change the behavior of the lexicographic sort such that the hyphens are sorted after spaces (but before exclamation points), that's a totally different problem.

     

    At first it looks daunting, because the sort function uses the relational operators < and > (or localeCompare, which is equally tricky), which are implemented, according to the spec, like this:

     

    11.8.5 The Abstract Relational Comparison Algorithm
    The comparison x < y, where x and y are values, produces true, false, or undefined (which indicates that
    at least one operand is NaN). Such a comparison is performed as follows:
    1. Call ToPrimitive(x, hint Number).
    2. Call ToPrimitive(y, hint Number).
    3. If Type(Result(1)) is String and Type(Result(2)) is String, go to step 16. (Note that this step differs
    from step 7 in the algorithm for the addition operator + in using and instead of or.)
    16.If Result(2) is a prefix of Result(1), return false. (A string value p is a prefix of string value q if q
    can be the result of concatenating p and some other string r. Note that any string is a prefix of itself,
    because r may be the empty string.)
    17. If Result(1) is a prefix of Result(2), return true.
    18.Let k be the smallest nonnegative integer such that the character at position k within Result(1) is
    different from the character at position k within Result(2). (There must be such a k, for neither string
    is a prefix of the other.)
    19. Let m be the integer that is the code point value for the character at position k within Result(1).
    20. Let n be the integer that is the code point value for the character at position k within Result(2).
    21. If m < n, return true. Otherwise, return false.
    NOTE
    The comparison of strings uses a simple lexicographic ordering on sequences of code point value values.
    There is no attempt to use the more complex, semantically oriented definitions of character or string
    equality and collating order defined in the Unicode specification. Therefore strings that are canonically
    equal according to the Unicode standard could test as unequal. In effect this algorithm assumes that
    both strings are already in normalised form.

     

    But fortunately there's an easy solution -- err, wait, damn. That is, I thought I had an easy solution, but it was wrong.

    Hmm, back to the drawing board...

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points