Skip navigation
D+DD
Currently Being Moderated

Finding text with RegEx

Jun 13, 2013 10:25 AM

Tags: #find_string #regular_expression

I want to create an EScript which converts temporary citations inserted from EndNote by the final citations according to the desired standard. I had done such utilities for FM 7 and 8 (see daube.ch/docu/fmaker41.html). The first step is to collect the temporary citations form the document and write them to a new document which then will be exported as RTF (to be handled by EndNote).

Due to the limitation of Wildcard-search i use RegEx.

tempCit = GetTempCitation ("Hello fans [[Dante, #712]] and another one [[DuçanÌsídõrâ, #312]].");
alert ("tempCit = " + tempCit); 

function GetTempCitation (pgfText) {
     var regex = /(\[\[[^\]]+\]\])/;
     var tempCit = "$1";
     if (pgfText.search(regex) !== -1) {
          return pgfText.replace (regex, tempCit);
     } else {
     return null;
     }
}

Due to lack of documentation (and lack of knowledge) i have modelled this after Rick's "Using a regular expression to convert an image name

to a path":

1) The test script does not return the first occurrence of a temp. citation but the full string. Where is my error?

2) How can i make use of the property (?) rightContext to get the other occurances?

 

Thank You for your help to a newcomer

Klaus

 
Replies
  • Currently Being Moderated
    Jun 13, 2013 10:58 AM   in reply to D+DD

    Hello Klaus,

     

    By coincidence I am working on a similar problem today, using regular expressions in ExtendScript. My problem cannot be solved directly, as the regular expression engine in ExtendScript has a serious flaw. The usage of the "\1", "\2" etc strings to use the matched substrings does NOT work, so having a replacement string use some of the stuff that was matched has to be done differently. Bummer. The problem has been reported earlier but is located in the ExtendScript Toolkit, not in FrameMaker, and I am not sure how quick the ESTK development team at Adobe will pick this up.

     

    I have created a workaround, which may also help you.

     

    Instead of using the search method with a regular expression, you can use the match method. This either returns null (in which case there was no match) or an array of matched strings (even if it is only 1 string long). The elements in that array can then easily be replaced by the placeholder you are intending to put in. The trick in finding all matches to a single regular expression is to add the indicator "g" for global search. Try the followind and see if that is what you wanted to have.

     

    function GetTempCitation (pgfText) {

        var regex = /(\[\[[^\]]+\]\])/g;

              return pgfText.match ( regex );

    }

     

    Note that the alert( ) function shows the full array contents, separated by a comma. If there were no matches, the returned value is null. So if you want to access the separate strings, you will have to loop through the resulting string array, after first testing for a null value.

     

    Good luck

     

    Jang

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 13, 2013 11:22 AM   in reply to D+DD

    Hi Klaus,

     

    Thanks for posting here. With JavaScript regular expressions, you have to use the "g" flag to get it to find more than the first occurrence (g = global).

     

    #target framemaker
    
    var pgfText = "Hello fans [[Dante, #712]] and another one [[DuçanÌsídõrâ, #312]].";
    var citations = getCitations (pgfText);
    alert(citations);
    
    function getCitations (pgfText) {
    
        // Regular expression to isolate the citations.
        var regex = /(\[\[[^\]]+\]\])/g;
        // Array to store the citations.
        var citations = [], result;
    
        // Execute the regular expression.
        while (result = regex.exec (pgfText)) {
            // Push the result onto the array.
            citations.push (result[1]);
        }
        // Return the array
        return citations;
    }
    
    

     

    You can see where I have added the g flag at the end of the regular expression. Then I execute the regular expression in a loop; the loop will continue as long as there are matches in the string. For each match, I push the string into the array. When all strings are found, I return the array from the function.

     

    A couple of notes: I like to use the #target framemaker instruction at the top of my scripts to ensure that they run with the FrameMaker object model instead of the default ExtendScript Toolkit. (In this case, it doesn't matter, because this is all native JavaScript code and not dependent on FrameMaker.)  Also, you should always declare your variables with the var keyword.

     

    Please let me know if there are any questions or comments.

     

    -- Rick

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 13, 2013 11:27 AM   in reply to 4everJang

    Hi Jang,

     

    Can you send me an example that shows this bug? I would like to verify it myself. Thanks.

     

    Rick

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 13, 2013 11:36 AM   in reply to frameexpert

    Actually, now that I see Jang's method, it is simpler for this case. If your regular expression has capturing groups, you have to use my method. For example, let's say that you wanted to capture the citations without the brackets. Then, you would use this:

     

    var pgfText = "Hello fans [[Dante, #712]] and another one [[DuçanÌsídõrâ, #312]].";
    var citations = getCitations (pgfText);
    alert(citations);
    
    function getCitations (pgfText) {
    
        // Regular expression to isolate the citations.
        var regex = /\[\[([^\]]+)\]\]/g;
        // Array to store the citations.
        var citations = [], result;
        
        // Execute the regular expression.
        while (result = regex.exec (pgfText)) {
            // Push the result onto the array.
            citations.push (result[1]);
        }
        // Return the array
        return citations;
    }
    

     

    Notice that I moved the parenthesis to exclude the enclosing square brackets.

     

    Also note that if you don't need to capture any subparts of the string, you don't need the parenthesis at all in your regular expression.

     

    Rick

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 13, 2013 11:36 AM   in reply to frameexpert

    Hi Rick,

     

    Anything that uses the "\1" etc operators in the replace method does not give anything but whitespaces where the match results should be. I have in fact tried to reproduce the example from Adobe's Javascript Tools Guide that was written for the Creative Suite 5. Page 26 shows the following:

     

    In a replace operation, you can use the captured regions of a match in the replacement expression by using the placeholders \1 through \9, where \1 refers to the first captured region, \2 to the second, and so on.

    For example, if the search string is Fred\([1-9]\)XXX and the replace string is Sam\1YYY, when applied to Fred2XXX the search generates Sam2YYY.

     

    Well, it doesn't. Look for extendscript regular expressions in the general Adobe forum and you will find at least two posts mentioning this as a bug in ESTK.

     

    Ciao

     

    Jang

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 13, 2013 11:56 AM   in reply to 4everJang

    Hi Klaus and Rick,

     

    As you are looking for particular strings, which always have double sets of square brackets around them, it is dead easy to remove those characters from the resulting match strings. That is what I do in my workaround, where I have to remove redundant single bracket pairs around parts of an expression.

     

    sResult = sMatches[i].substring( 2, sMatches[i].length - 2 );

     

    A more elegant way of keeping the delimiters out of the match would involve using both the look behind and a look ahead features of regular expressions, but the look behind seems to be unsupported in javascript.

     

    Ciao

     

    Jang

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 13, 2013 12:09 PM   in reply to 4everJang

    Hi Jang, It looks like they are confusing the \1 find operator with the $1 operator. In your example, you would use this for the replacement and it works:

     

    var regex = /Fred([0-9])XXX/;
    var string = "Fred2XXX";
    
    alert (string.replace(regex, "Sam$1YYY"));
    

     

    The \1, \2, etc. operators are used on the find side. For example, let's say I have this:

     

    var string = "<p>This is a paragraph with <em>emphasized</em> text.</p>";
    var regex = /<[^>]+>.+?<\/[^>]+>/;
    alert (string.match (regex)[0]);
    

     

    The alert gives me this:

     

    <p>This is a paragraph with <em>emphasized</em>

     

    I don't get a matching open and close tag. To insure that my close tag matches the open tag, I need to capture the open tag and use the \1 in the close tag to get a matched set.

     

    var string = "<p>This is a paragraph with <em>emphasized</em> text.</p>";
    var regex = /<([^>]+)>.+?<\/\1>/;
    alert (string.match (regex)[0]);
    

     

    Now I get this

     

    <p>This is a paragraph with <em>emphasized</em> text.</p>

     

    which is what I want.

     

    -- Rick

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 13, 2013 12:15 PM   in reply to 4everJang

    Hi Jang, You are correct that JavaScript regular expressions don't have look behind. But in this case, it is not necessary anyway. Just use this:

     

    var string = "[Content with square brackets]";
    var regex = /\[([^\]]+)\]/;
    
    if (string.match (regex) !== null) {
        alert (string.match (regex)[1]);
    }
    

     

    Rick

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 13, 2013 12:23 PM   in reply to frameexpert

    Hi Rick,

     

    Yes, now that you have found the missing information about using $1 etc instead of \1 in the replacement result, this opens up the path that I previously thought was blocked due to the supposedly missing functionality. With the \1 \2 etc operators and their $1 $2 etc counterparts in the replacement strings, I do not need the look behind anymore.

     

    Thanks

     

    Jang

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 17, 2013 12:16 PM   in reply to 4everJang

    Dear Rick and Jang,

    Thank You for Your help and advice during my absence from the discussion...

    It brings me forward quite a big step, because You already discussed the issue of replacement - which comes sooner or later in this project.

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points