Copy link to clipboard
Copied
I am parsing a WordPress Blog and I am getting weird characters in the descriptions?
Â
“
�
etc?
It lookslike it's replacing quotes, double quotes and the likes - Any advice on how to fix this issue?
I am using ColdFusion 8
Copy link to clipboard
Copied
Those are probably so-called "smart quotes" pasted in from a word document or some such. The only way I have ever been able to deal with them is by find-and-replace with dumb quotes (" and ')
Copy link to clipboard
Copied
Tried that, can't get replace to recognize the chars?
Copy link to clipboard
Copied
Did you try doing the find with these ASCII values?
chr(145), chr(146), chr(147), chr(148), chr(151) and then replac with ', ' ", ", and — respectively?
Copy link to clipboard
Copied
I am parsing a WordPress Blog and I am getting weird characters in the descriptions?
Â
“
�
etc?
What does the raw XML look like?
Is it a case of CF8 munging it, or is the feed just munged?
--
Adam
Copy link to clipboard
Copied
Okay, so Adam's gone from "jingoistically" to "munged" in the space of a week?
Copy link to clipboard
Copied
Okay, so Adam's gone from "jingoistically" to "munged" in the space of a week?
I am very very hungover.
I suppose I could say my head is munged. More-so than usual, I mean.
--
Adam
Copy link to clipboard
Copied
Ah that would explain it. I reckon I'm about to *start* drinking, having discovered that the VMware forums I have to sign up to also use this shanty Godforsaken Jive software.
Copy link to clipboard
Copied
Owain North wrote:
shanty Godforsaken Jive software.
In my opinion, this forum, a Jive web site, just got better.
Copy link to clipboard
Copied
The problem is encoding. You seem to be using an encoding like Windows 1252. The weird characters should revert to readable characters when you convert to UTF-8 encoding.
Two suggestions on correcting this:
1) Make sure the XML declaration has UTF-8 encoding, thus
<?xml version="1.0" encoding="UTF-8" ?>
or 2) Place the following tag at the top of the page
<cfprocessingdirective pageEncoding="UTF-8">
or 3) To correct the encoding just for rendering the output in the browser, place this tag at the top of the page
<cfcontent type="text/xml; charset=UTF-8"> (alternatively,
<cfcontent type="application/xhtml+xml; charset=UTF-8"> )Edited: 3rd suggestion added
Copy link to clipboard
Copied
or 2) Place the following tag at the top of the page <cfprocessingdirective pageEncoding="UTF-8">
This is a compiler directive, and only pertinent if there's UTF-8-encoded text in the CFM file. It's not relevant in this situation.
--
Adam
Copy link to clipboard
Copied
Adam Cameron. wrote:
or 2) Place the following tag at the top of the page<cfprocessingdirective pageEncoding="UTF-8">
This is a compiler directive, and only pertinent if there's UTF-8-encoded text in the CFM file. It's not relevant in this situation.
What you say is correct. However, I was after something else, which is relevant to this situation.
ColdFusion's encoding is by default UTF-8. However, since the output contains unreadable characters, it means the effective encoding isn't UTF-8 (As I said, it is likely Windows 1252). This could mean that ColdFusion guessed the encoding from the byte-order-mark. If so, then using the cfprocessingdirective as suggested would force an error, giving us more information.
Copy link to clipboard
Copied
This could mean that ColdFusion guessed the encoding from the byte-order-mark. If so, then using the cfprocessingdirective as suggested would force an error, giving us more information.
No, it wouldn't. Like I said, that's a COMPILER directive. It's only relevant at compile time. Not runtime.
It has no bearing on anything other than how the CFM file is compiled. That's it.
--
Adam
Copy link to clipboard
Copied
Adam Cameron. wrote:
This could mean that ColdFusion guessed the encoding from the byte-order-mark. If so, then using the cfprocessingdirective as suggested would force an error, giving us more information.
No, it wouldn't. Like I said, that's a COMPILER directive. It's only relevant at compile time. Not runtime.
It has no bearing on anything other than how the CFM file is compiled. That's it.
The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.
Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.
Copy link to clipboard
Copied
The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.
Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.
The stuff that's encoded "wrong" is the XML feed, right? So that's an external file. Without the code being posted, we don't know how the XML is getting to the CFM code processing it (it could be CFHTTP, xmlParse(), CFFEED, etc), but one can tell just from the URL and what the OP says: the XML is not in his CFM file.
CFPROCESSINGDIRECTIVE only works on CFM files (well: and CFCs... but CF "source code files"). And only works on the specific source code within the file the tag is within. And only does anything at compile time. It has no impact on how the CF code then handles encoding of any data it encounters. And never has had.
Also worthy of note is that the CF compiler does not pay any attention to BOMs, so one needs to specify this tag if the file has UTF-8-encoded text within it.
All CFPROCESSINGDIRECTIVE does is to say to the compiler "treat this CFM file's contents as UTF-8 (etc) encoded, rather than plain old ASCII when you compile this file, please". This is so that if - in your source code - you have UTF-8-encoded text, it doesn't get munged by the compiler. The tag is not itself compiled; there is nothing in the resultant class file (which is what gets executed at runtime) to indicate anything about encoding (ie: CFPROCESSINGDIRECTIVE does not indicate to take any action at runtime, because it's not there at runtime).
If you have this code:
<cfprocessingdirective pageencoding="UTF-8">
<cfset foo = "bar">
And this code:
<cfset foo = "bar">
in separate files, the resultant class files after the code is compiled are identical (other than timestamps and internal references to file names). And there is no reference to encoding in there at all. All there is is the wherewithall to set foo to "bar".
So... deep breath... CFPROCESSINGDIRECTIVE is irrelevant to this situation.
I'm not going to say it again 😉
--
Adam
Copy link to clipboard
Copied
Adam Cameron. wrote:
The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.
Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.
The stuff that's encoded "wrong" is the XML feed, right?
No, not necessarily. That's just one of the possibilities we have to rule out. That one happens at runtime.
However, a CFM, CFC, or a page included in it, might introduce a byte-order-mark. That could make ColdFusion interpret the encoding as, for example, Windows 1252. Even though you the developer know for sure that ColdFusion's default encoding is UTF-8.
One way I can think of to find this out is to put the cfprocessingdirective tag (with UTF-8 pageEncoding) at the very beginning of the page. When you get an error, then you will know the page likely has a conflicting encoding resulting from a byte-order-mark. This one can happen at compile-time.
Copy link to clipboard
Copied
Adam Cameron. wrote:
Also worthy of note is that the CF compiler does not pay any attention to BOMs, ...
A bold statement that needs correction nevertheless. The ColdFusion compiler does pay attention to the byte-order-mark, if there is one. In fact, using the cfprocessingdirective tag to set the encoding is roughly equivalent to setting the byte-order-mark!
Copy link to clipboard
Copied
Adam Cameron. wrote:
The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.
Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.
If you have this code:
<cfprocessingdirective pageencoding="UTF-8">
<cfset foo = "bar">
And this code:
<cfset foo = "bar">
in separate files, the resultant class files after the code is compiled are identical (other than timestamps and internal references to file names). And there is no reference to encoding in there at all. All there is is the wherewithall to set foo to "bar".
All very well, but besides my point. I'll use your example to illustrate.
Case 1:
[None-UTF-8 Byte Order Mark at beginning of page]
<cfprocessingdirective pageencoding="UTF-8">
<cfset foo = "bar">
Case 2:
[None-UTF-8 Byte Order Mark at beginning of page]
<cfset foo = "bar">
I am saying that ColdFusion will produce an error in Case 1, but not in Case 2.
Copy link to clipboard
Copied
The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.
Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.
If you have this code:
<cfprocessingdirective pageencoding="UTF-8">
<cfset foo = "bar">
And this code:
<cfset foo = "bar">
in separate files, the resultant class files after the code is compiled are identical (other than timestamps and internal references to file names). And there is no reference to encoding in there at all. All there is is the wherewithall to set foo to "bar".
All very well, but besides my point. I'll use your example to illustrate.
Case 1:
[None-UTF-8 Byte Order Mark at beginning of page]
<cfprocessingdirective pageencoding="UTF-8">
<cfset foo = "bar">Case 2:
[None-UTF-8 Byte Order Mark at beginning of page]
<cfset foo = "bar">I am saying that ColdFusion will produce an error in Case 1, but not in Case 2.
Sorry, you're quite correct here. What I should have said is that the BOM is ignored by CF insofar as working out whether to compile for UTF-8 or not. IE: simply having a UTF-8 BOM in a file should be enough to have the compiler work it out, but it doesn't. One still needs the CFPROCESSINGDIRECTIVE to tell the compiler to do it correctly. It's odd that it'll work out that it shouldn't do it if the BOM doesn't match the pageencoding value, but it won't then work out what it should be doing on that basis.
--
Adam
Copy link to clipboard
Copied
Adam Cameron. wrote:
It's odd that it[ColdFusion]'ll work out that it shouldn't do it if the BOM doesn't match the pageencoding value, but it won't then work out what it should be doing on that basis.
Perhaps because there are many more types of encoding than ColdFusion can process. Just a guess.