Hi there!
I am converting a bit of code from a java base and I am really stuck at a point where I need to strip all non-letters (unicode) from a string. In general I am searching for some kind of flex support to mimick some regexp expressions such as:
\p{L} or \p{Letter}: any kind of letter from any language (seehttp://www.regular-expressions.info/unicode.html)
and then use String.replace(/\p{L}/g,"") or something similar.
If there is no such regexp facility I could still try to loop through the string character by character and check its unicode property bits for a set "isLetter()", as java and several other languages provide it.
Just to make clear what I am searching for, I will give a short example:
If we take a unicode string containing "this is a unicode [@@Русский@@] multilangual string containing some cyrillic letters and //]] 1234 numbers"
it should strip out the @@ and [//, so basically everything BUT the letters.
If anyone knows a decent solution I would appreciate any help!
I cannot find anything about it in the documentation. The only thing i found about flex RegExp is the \w (which is all letters) or \W (which is all non-letters) pattern.
The problem is that those have no unicode support (or better: I did not find it yet, might be there are some undocumented patterns or character classes to use).
They afaik only match against character classes like [a-z A-Z]. I need something that is more general and supports non-english letters, which is stated in the unicode standard as a property (which can be matched against in other languages using regexp patterns like \p{L}
It might be possible to do it with range, but it's a more high-level regexp property called "character properties".
Before I try to explain, take a look at the page http://www.regular-expressions.info/unicode.html under the heading "Unicode Character Properties".
It is supported by java and .NET obviously, but it is missing in the flex RegExp class.
For my special needs I want to match against \p{L} (which means all non-letter unicode characters) in order to remove them from my string (such as copyright, smileys and other stuff people tend to use somewhere in their nicknames and chat message)
Thanks for your hint, but could you tell me where exactly I have to request such a new feature?
Ok, now I am messing around with regexp ranges and unicode. I cannot say that it is working either.
I am trying to strip all non-cyrillic characters from a string.
To accomplish it, I would like to remove all characters but a specified range using a regexp such as : /[^\u0400-\u050f]/g
I encountered a really strange behaviour in flex though, which can be reproduced by this small example:
var myString : String = "HelloWorld";
// first replace (all non-cyrillic characters)
trace("first replace: " + myString.replace(/[^\u0400-\u050f]/g,""));
// second replace (all non-cyrillic characters + 1)
trace("second replace: " + myString.replace(/[^\u0400-\u0510]/g,""));
As you can see, the regexp only differs in the upper bound of the character class (actually the difference is only a single character).
The outcome of both replaces should be that they return an empty string, since there is no cyrillic character in myString.
But nonexpextedly, the output is this:
first replace:
second replace: HelloWorld
which indicates that actually regexp and unicode are not working as expected.
Can anyone reproduce this problem?
I am pretty stuck on this one ...
Only solution I found was a workaround using a whitelist of all unicode characters i wanted. You can find ranges for certain languages and letters on wikipedia http://en.wikipedia.org/wiki/Unicode_block
Apply it on a string by checking the unicode-number of each single character in the source string against the whitelist and remove (e.g. not copy to result string) it, if it's missing.
Thanks for the workaround, it works. Since I only have short strings it's ok to loop through them, but I can't imagine how long it would take on a long text.
Here's my code if it's of any help to someone:
private function cleanup(input:String):String{
var output: String = "";
var hexValue:uint;
for (var k:int=0; k < input.length; k++) {
hexValue = input.charCodeAt(k);
if( (hexValue >= 0x00C0 && hexValue <= 0xD7FF)||
(hexValue >= 0xF900 && hexValue <= 0xFDCF)||
(hexValue >= 0xFDF0 && hexValue <= 0xFFEF)||
(hexValue >= 0x0041 && hexValue <= 0x005A)|| // A-Z
(hexValue >= 0x0061 && hexValue <= 0x007A)|| // a-z
(hexValue >= 0x0030 && hexValue <= 0x0039) // 0-9
){
output = output + input.charAt(k);
}
}
return output;
}
I really wish I could do that with a regex...
North America
Europe, Middle East and Africa
Asia Pacific