Skip navigation
Eleivana07
Currently Being Moderated

Removing extra spaces from a long document

Jul 29, 2012 12:33 AM

Hello

I have seen a number of find/change and GREP formulas to do similar things. I have NO scripting or coding experience and have labored to understand GREP.

So I am a little afraid to use it as I don't know what all the modifiers refer to (I do have a printout of some neat GREP cheatsheets like Mike Witherell's that I can absorb until I obtain a good reference )

 

I need something I can copy and paste into either find/change or GREP dialog that will do the following in less than 12 steps (hopefully) without doing something catastrophic like removing all of my paragraph marks (which I almost did using someones GREP expression)

 

  1. No spaces before any comma, period, exclamation mark, question mark, colon, semicolon
  2. One space only after any of those
  3. One Space before any opening parenthesis and one space after the closing parenthesis
  4. No space after the opening parenthesis or before the closing parenthesis
  5. Remove any double or extra spaces ( en, em etc.)
  6. Remove any commas before parentheses
  7. Remove any spaces after a paragraph mark

 

I think that's it

 

 

 

I did find this one recently (Maybe Jongware?)

 

[~m~>~f~|~S~s<~/~,~3~4%]{2

 

Which from my dim understanding addresses em, en, flush and hair space ,  nonbreaking space ,figure space,third space--not sure of the rest. Really this is way over my head.

 

 

 

I know this will be a piece of cake for you guys

 

Thanks

 
Replies
  • Currently Being Moderated
    Jul 29, 2012 7:57 AM   in reply to Eleivana07

    I was hoping Jongware would come in with something really elegant (and maybe he still will) but in the meantime, my approach would be to start by eliminating all multiple whitespaces except paragraph returns and forced line breaks. This seems to do it:

     

    Find   (\s)(\p{space_separator}|\t)+ and replace with $1

     

    This will leave the first whitespace and remove any following whitespace up to the point that a line or paragraph break (and not completely tested, but I suppose other sorts of breaks) is encountered, leaving the paragraph or line break intact.  Note that this will destroy tables built with tabs (as opposed to "real" tables) that have multiple tabs between items, and it will not remove a single whitespace before a paragraph or line break.

     

    Next I would remove the whitesapce at the ends of paragraphs and the like:

     

    Find (\s)(\n|\r) and repace with $2 seems to do that, and it also seems to leave multiple returns (I don't know if you want to remove those) and to work with other breaks as well (again, not fully tested). The simpler \s$ and replace with nothing removes the first return in a two-return sequence and seems to ignore th other types of breaks completely.

     

    At this point there should not be any multiple whitespaces other than possibly blank paragraphs. If you want to get rid of those, you can run the Find/Change By list script of the built-in multple returns to single return query in the find/change dropdown list.

     

    So now you need to find opening single and double quotes, parentheses, brackets or braces and remove a space after them if it exists:

     

    Find ([\[\{\(~{~[])(\s) and replace with $1

     

    and finally remove any space before your selected punctuation and the closing cases of the items above:

     

    Find (\s)([.,;:!\)\]\}~}~]]) and replace with $2

     

    The last two queries will probably also work with look-bhind for the first and look-ahead for the second (putting the classes in the look expressions) and repalcing with nothing, but I'm not sure which method is more efficient. The last query could conceivably also miss a space followed by an apostrophe or mistakenly remove a space before a work that starts with an apostophe (again, not thoroughly tested). and is ignoring straigh quotes of any type as they are ambdextrous and might want space on either side.

     

    Hopefully the forum didn't mess up any of those expressions...

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 29, 2012 11:58 AM   in reply to Eleivana07

    \p{space_separator}  (exactly as written) is a comprehensive wildcard for a large variety of spaces. It works like \s, but does not include the linebreaks and paragraph breaks in the found results.

     

    It would be tempting, for example to use (\s)(\s)+ to find any whitespace followed by any amount of other whitespace, but the \s will also pick up the paragraph breaks, so if you have a space at the end of a paragraph, you lose the paragraph break. The \p{space_separator} won't see that as two whitepaces, so the paragraph is preserved, but you then must go back and remove any spaces before a paragraph break in a second pass ( the second query).

     

    No need to feel stupid. I had to do a bit of research this morning to come up with that myself. I've never seen it in use before.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 29, 2012 3:27 PM   in reply to Eleivana07

    Peter Kahrel (whose ebook is the source I used this morning, and a reference I highly recommend at only about $10) has a lot of free GREP and scripting aides on his website. Take a look at http://www.kahrel.plus.com/indesign/grep_query_manager.html which will allow you to make a "chain" from this set of queries that you can then run in one step.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 30, 2012 2:09 AM   in reply to Eleivana07

    Peter is too modest, he's doing just great.

     

     

    • No space BEFORE-One Space after ---period,semicolon,colon, exclamation, question mark,CLOSING Parenth,Bracket,Brace, single & Dbl. quotation marks
    • Find (\s)([.,;:!\)\]\}~}~]])
    • Replace with $2

     

    • No space AFTER-One Space Before----OPENING bracket,brace,parenthesis,Dbl & single quotes
    • Find ([\[\{\(~{~[])(\s)
    • Replace with $1

     

    These remove the space before/after but do not automatically add a space after/before.

     

    In the first case, you could add a space right after the '$2' in the Replace With string, but it already may have a space, in which case you suddenly have two. One alternative is to optionally remove it with the Find string (add it as an optional match) and always add it with the Replace string, but remember that this string will only be found if there is a space preceding it. That way you'd only check the space after in cases where there was a bad space before.

     

    So I propose you add another two find/changes to add the space, only when necessary.

     

    One Space After: find

    ([.,;:!\)\]\}~}~]])(?!\s)

     

    replace:

    $0 [followed by one single space]

     

    One Space Before: find

    (?<!\s)([\[\{\(~{~[])

     

    replace:

     

    [one single space] $0

     

    Color-coding with my WhatTheGrep might make it just a tad clearer what's going on in that jumble of codes:

     

    (1[.,;:!\)\]\}~}~]]1)(?!!\s!)

     

    and

     

    (?<!<!\s<!)(1[\[\{\(~{~[]1)

     

    (Orange is lookahead/lookbehind, blue is a regular escaped character, pink is an InDesign special character, green is normal grouping parentheses, and lavender is a character group.)

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 30, 2012 2:53 AM   in reply to Eleivana07

    Some more about this notation:

    Eleivana07 wrote:

     

    What is the space separator in the first solution?

     

    Find   (\s)(\p{space_separator}|\t)+ and replace with $1

     

    Its not an underscore is it?

     

    A funny thing: it doesn't matter The name of this character group is "Space Separator", but

    1. it is case insensitive (other than almost all other GREP codes!)

    2. it is separator insensitive! You can use 'space-separator', 'space_separator', 'space separator', and even 'spaceseparator'

     

    It also has a shortcut: "Zs" (which also is case and separator insensitive, so you can use "\p{z-s}" or "\p{zS}"). The simple search string

     

    \p{zs}{2,}

     

    will find any two or more spaces in succession (excluding tabs, though).

     

    Another freebie is that you can use the same code negated: \P{zs} will match anything not in this set.

     

    There are loads and loads of useful named character groups described in Peter Kahrel's O'Reilly shortcut about GREP.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 30, 2012 5:57 AM   in reply to [Jongware]

    Theun,

     

    I actually tried the short version of /p{zs} suggested by Peter K before posting, and it was givin me strange results, returning single spaces and the first character following in a word. I did my testing in CS5.

     

     

    Another point that you didn't bing up about the summary description:

     

    • Remove Whitespace after paragraphs:
    • Find (\s)(\n|\r)
    • Repace with $2

     

    This is actually removing whitespace BEFORE the paragraphs or forced linebreaks. Sapce after a paragraph break is actually leading space onthe first line of the following paragrapgh, and the first query would have caught that and removed it since \s recognizes the paragraph break, and the \p{space-separator} recognizes the other types of space except the tab, which we also included in the "or" statement so the only types of whitepsace left after the paragraph break would hav been another break.

     

    I actually left out the the last of Theun's (jongware's) quries on purpose. It would not be unusual to have a parenthetical where it should be followed by some punctuation mark, nor a quote that ends a paragraph. Granted adding a space back before the return would be invisible in the output, but we just went to a lot of trouble to tremove them, and even more importantly we removed spaces preceding most punctuation and we defiinitely don't want to add them back.

     

    Likewise, I can think of plenty of cases where you might be starting a paragraph with one of those punctuation marks (many of them restricted to technical sorts of work, of course), but I'm not sure it's a great idea to blindly add spaces as in his first query. I'd be more inclined to let Spell Check pick up that sort of odd situation and fix them on a case by case basis.

     

    Cheers.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 30, 2012 5:58 AM   in reply to Peter Spier

    An thanks, by the way, for the kind word.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 30, 2012 3:35 PM   in reply to Eleivana07

    Eleivana07 wrote:

     

    Hi Peter,

    So are you saying that the query

    • Find (\s)(\n|\r)
    • Repace with $2

    Is redundant to the first query?

    No. It's necessary (or at least, in my opinion, desirable) in order to remove the extra space that you will occasionally see after the last real character in a paragraph, so it's supplemental, rather than redundant, to pick up the cases that didn't get fixed in order to preserve the paragraph breaks.

     

    In a case where a paragraph ends period space space return the first query will find the first space after the period, and it will see the second space as extra, but it will ignore the return, so the result will be period space return (the $1 in the change filed is always the first space in a group and it is always preserved. In the case where the paragraph already ends period space return there will be no change because the query does not recognize a group of spaces.

     

    In the query above we are looking specifically at the case of <last non-space character> space return (though we don't look for the <last non-space character>). Because the first query has already removed all but on space everyplace there are multiples, this query looks specifically for the space/return combination and discards the space ($2 is the return).

     

    Would this be a fatal error if it didn't run? I would say no, and you didn't actually requet the removal of whitespace at the breaks, but you struck me as the sort of person who would want a clean file.

     

    Was that any clearer?

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 30, 2012 3:42 PM   in reply to Eleivana07

    Eleivana07 wrote:

     

    I do know that the following are the most important at this step for my book to look finished.

    • No space BEFORE-One Space after ---period,semicolon,colon, exclamation, question mark,CLOSING Parenth,Bracket,Brace, single & Dbl. quotation marks
    • No space AFTER-One Space Before----OPENING bracket,brace,parenthesis,Dbl & single quotes

     

    Would there be a way to write a query so that it only added a space at the correct location ONLY if it did NOT find one?

     

    Curious

    The query that jongware provide above does exactly that -- adds a space after those punctuation marks if it doesn't see one, but as I said I don't think this is a good thing to automate. Consider this text:

     

    "(1) GREP is a very powerful tool for automating changes by pattern recognition (but dangerous if misused)."

     

    Adding a space before the first open parnethesis or after the last close would be mistakes, as would be adding a space after the period.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 31, 2012 3:05 AM   in reply to Peter Spier

    > Adding a space before the first open parenthesis or after the last close would be mistakes, as would be adding a space after the period.

     

    True, but you could narrow it down, e.g.

     

    Find: \)(?=[\u\l])

    Replace with: )\s

     

    which could be made more precise. And something before the opening parenthesis.

     

    Another useful addition is to remove all white space at the end of a story, which I don't think is caught by any of the queries mentioned here:

     

    Find: \s+\Z

    Replace with nothing

     

    Unwanted space at the beginning of a story is less likely, and maybe you do want a tab there, but if you need to remove story-initial space you can do it using these:

     

    Find: \A\s+

    Replace with nothing

     

    Peter

    [thanks for the kind words about the ShortCut!]

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 31, 2012 3:23 AM   in reply to Peter Kahrel

    Peter Kahrel wrote:

     

    Find: \)(?=[\u\l])

    Replace with: )\s

     

    Does \s work in the change field? I thought that would be literal there...

     

    I think maybe I'm just not convinced that the probability of a missing space is anywhere near as great as the probability of finding excess multiple spaces, and to automate a 100% foolproof way to add them is worth the effort, or even possible. Much as I think it's a mistake to trust in spell checkers for doing your proofing, a missing space after a parenthesis is the sort of thing I think would get picked up, just the way missing space after a period is flagged. I'm a lousy typist, but even I don't tend to miss when I lose a space, so I guess I'd rather see them on a case-by-case basis. Certainly that can be done with Find/Change, but not if you are scripting the queries, right?

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 31, 2012 3:42 AM   in reply to Peter Spier

    > Does \s work in the change field? I thought that would be literal there...

     

    It does, in the same way that \t inserts a tab in your document. It's handy to use \s and \t in the change field in things like forum posts, where you can't see space and tab characters.

     

    > I'd rather see them on a case-by-case basis

     

    I agree. But the challange to find queries is sometimes irresistable!

     

    > but not if you are scripting the queries, right?

     

    Well, it could, but you'd just be repeating Indesign's Find/Change interface. The grep editor I scripted is useful for these things (I think in all immodesty). It highlights all matches in a document in the way that new versions of Word do. So rather than pressing Find all the time, you simple page through the document and you can clearly see all you matches.

     

    Peter

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points