• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

xmlParse() is generating weird characters?

Engaged ,
Sep 27, 2011 Sep 27, 2011

Copy link to clipboard

Copied

At: http://www.icontrolwebstudio.com/RSSFeedImport.cfm?rssfeedurl=http://steveholtonline.org/feed/&webid...

I am parsing a WordPress Blog and I am getting weird characters in the descriptions?

Â

“

�

etc?

It lookslike it's replacing quotes, double quotes and the likes - Any advice on how to fix this issue?

I am using ColdFusion 8

Views

5.6K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Sep 27, 2011 Sep 27, 2011

Copy link to clipboard

Copied

Those are probably so-called "smart quotes" pasted in from a word document or some such.  The only way I have ever been able to deal with them is by find-and-replace with dumb quotes (" and ')

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Engaged ,
Sep 27, 2011 Sep 27, 2011

Copy link to clipboard

Copied

Tried that, can't get replace to recognize the chars?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Sep 27, 2011 Sep 27, 2011

Copy link to clipboard

Copied

Did you try doing the find with these ASCII values?

chr(145), chr(146), chr(147), chr(148), chr(151) and then replac with ', ' ", ", and — respectively?


Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

At: http://www.icontrolwebstudio.com/RSSFeedImport.cfm?rssfeedurl=http://s teveholtonline.org/feed/&webi...

I am parsing a WordPress Blog and I am getting weird characters in the descriptions?

Â

“

�

etc?

What does the raw XML look like?

Is it a case of CF8 munging it, or is the feed just munged?

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

Okay, so Adam's gone from "jingoistically" to "munged" in the space of a week?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

Okay, so Adam's gone from "jingoistically" to "munged" in the space of a week?

I am very very hungover.

I suppose I could say my head is munged.  More-so than usual, I mean.

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

Ah that would explain it. I reckon I'm about to *start* drinking, having discovered that the VMware forums I have to sign up to also use this shanty Godforsaken Jive software.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

Owain North wrote:

shanty Godforsaken Jive software.

In my opinion, this forum, a Jive web site, just got better.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

The problem is encoding. You seem to be using an encoding like Windows 1252. The weird characters should revert to readable characters when you convert to UTF-8 encoding.

Two suggestions on correcting this:

1) Make sure the XML declaration has UTF-8 encoding, thus

<?xml version="1.0" encoding="UTF-8" ?>

or 2) Place the following tag at the top of the page

<cfprocessingdirective pageEncoding="UTF-8">

or 3) To correct the encoding just for rendering the output in the browser, place this tag at the top of the page

<cfcontent type="text/xml; charset=UTF-8"> (alternatively,

<cfcontent type="application/xhtml+xml; charset=UTF-8"> )

Edited: 3rd suggestion added

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

or 2) Place the following tag at the top of the page

<cfprocessingdirective pageEncoding="UTF-8">

This is a compiler directive, and only pertinent if there's UTF-8-encoded text in the CFM file.  It's not relevant in this situation.

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

Adam Cameron. wrote:

or 2) Place the following tag at the top of the page

<cfprocessingdirective pageEncoding="UTF-8">

This is a compiler directive, and only pertinent if there's UTF-8-encoded text in the CFM file.  It's not relevant in this situation.

What you say is correct. However, I was after something else, which is relevant to this situation.

ColdFusion's encoding is by default UTF-8. However, since the output contains unreadable characters, it means the effective encoding isn't UTF-8 (As I said, it is likely Windows 1252). This could mean that ColdFusion guessed the encoding from the byte-order-mark. If so, then using the cfprocessingdirective as suggested would force an error, giving us more information.  

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

This could mean that ColdFusion guessed the encoding from the byte-order-mark. If so, then using the cfprocessingdirective as suggested would force an error, giving us more information.  

No, it wouldn't.  Like I said, that's a COMPILER directive.  It's only relevant at compile time.  Not runtime.

It has no bearing on anything other than how the CFM file is compiled.  That's it.

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

Adam Cameron. wrote:

This could mean that ColdFusion guessed the encoding from the byte-order-mark. If so, then using the cfprocessingdirective as suggested would force an error, giving us more information.  

No, it wouldn't.  Like I said, that's a COMPILER directive.  It's only relevant at compile time.  Not runtime.

It has no bearing on anything other than how the CFM file is compiled.  That's it.

The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.

Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 28, 2011 Sep 28, 2011

Copy link to clipboard

Copied

The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.

Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.

The stuff that's encoded "wrong" is the XML feed, right?  So that's an external file.  Without the code being posted, we don't know how the XML is getting to the CFM code processing it (it could be CFHTTP, xmlParse(), CFFEED, etc), but one can tell just from the URL and what the OP says: the XML is not in his CFM file.

CFPROCESSINGDIRECTIVE only works on CFM files (well: and CFCs... but CF "source code files").  And only works on the specific source code within the file the tag is within.  And only does anything at compile time.  It has no impact on how the CF code then handles encoding of any data it encounters.  And never has had.

Also worthy of note is that the CF compiler does not pay any attention to BOMs, so one needs to specify this tag if the file has UTF-8-encoded text within it.

All CFPROCESSINGDIRECTIVE does is to say to the compiler "treat this CFM file's contents as UTF-8 (etc) encoded, rather than plain old ASCII when you compile this file, please".  This is so that if - in your source code - you have UTF-8-encoded text, it doesn't get munged by the compiler.  The tag is not itself compiled; there is nothing in the resultant class file (which is what gets executed at runtime) to indicate anything about encoding (ie: CFPROCESSINGDIRECTIVE does not indicate to take any action at runtime, because it's not there at runtime).

If you have this code:

<cfprocessingdirective pageencoding="UTF-8">

<cfset foo = "bar">

And this code:

<cfset foo = "bar">

in separate files, the resultant class files after the code is compiled are identical (other than timestamps and internal references to file names).  And there is no reference to encoding in there at all.  All there is is the wherewithall to set foo to "bar".

So... deep breath... CFPROCESSINGDIRECTIVE is irrelevant to this situation.

I'm not going to say it again 😉

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 29, 2011 Sep 29, 2011

Copy link to clipboard

Copied

Adam Cameron. wrote:

The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.

Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.

The stuff that's encoded "wrong" is the XML feed, right?

No, not necessarily. That's just one of the possibilities we have to rule out. That one happens at runtime.

However, a CFM, CFC, or a page included in it, might introduce a byte-order-mark. That could make ColdFusion interpret the encoding as, for example, Windows 1252. Even though you the developer know for sure that ColdFusion's default encoding is UTF-8.

One way I can think of to find this out is to put the cfprocessingdirective tag (with UTF-8 pageEncoding) at the very beginning of the page. When you get an error, then you will know the page likely has a conflicting encoding resulting from a byte-order-mark. This one can happen at compile-time.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 29, 2011 Sep 29, 2011

Copy link to clipboard

Copied

Adam Cameron. wrote:

Also worthy of note is that the CF compiler does not pay any attention to BOMs, ... 

A bold statement that needs correction nevertheless. The ColdFusion compiler does pay attention to the byte-order-mark, if there is one. In fact, using the cfprocessingdirective tag to set the encoding is roughly equivalent to setting the byte-order-mark!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 29, 2011 Sep 29, 2011

Copy link to clipboard

Copied

Adam Cameron. wrote:

The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.

Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.

If you have this code:

<cfprocessingdirective pageencoding="UTF-8">

<cfset foo = "bar">

And this code:

<cfset foo = "bar">

in separate files, the resultant class files after the code is compiled are identical (other than timestamps and internal references to file names).  And there is no reference to encoding in there at all.  All there is is the wherewithall to set foo to "bar".

All very well, but besides my point. I'll use your example to illustrate.

Case 1:

[None-UTF-8 Byte Order Mark at beginning of page]
<cfprocessingdirective pageencoding="UTF-8">
<cfset foo = "bar">

Case 2:

[None-UTF-8 Byte Order Mark at beginning of page]
<cfset foo = "bar">

I am saying that ColdFusion will produce an error in Case 1, but not in Case 2.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 29, 2011 Sep 29, 2011

Copy link to clipboard

Copied

The distinction between compile-time and runtime is irrelevant to my argument. ColdFusion is knowm to guess the encoding from the byte-order-mark (if it encounters one when parsing a page or document). That will override ColdFusion's own default UTF-8 encoding.

Suppose it guesses Windows 1252. Then if you use the cfprocessingdirective with UTF-8, the result will be an error (I had automatically used the word 'exception', but replaced it with 'error' to avoid the suggestion of runtime!). What I have said is well-known (to this forum, too). Unless you're saying the behaviour of the cfprocessingdirective tag has recently changed.

If you have this code:

<cfprocessingdirective pageencoding="UTF-8">

<cfset foo = "bar">

And this code:

<cfset foo = "bar">

in separate files, the resultant class files after the code is compiled are identical (other than timestamps and internal references to file names).  And there is no reference to encoding in there at all.  All there is is the wherewithall to set foo to "bar".

All very well, but besides my point. I'll use your example to illustrate.

Case 1:

[None-UTF-8 Byte Order Mark at beginning of page]
<cfprocessingdirective pageencoding="UTF-8">
<cfset foo = "bar">

Case 2:

[None-UTF-8 Byte Order Mark at beginning of page]
<cfset foo = "bar">

I am saying that ColdFusion will produce an error in Case 1, but not in Case 2.

Sorry, you're quite correct here.  What I should have said is that the BOM is ignored by CF insofar as working out whether to compile for UTF-8 or not.  IE: simply having a UTF-8 BOM in a file should be enough to have the compiler work it out, but it doesn't.  One still needs the CFPROCESSINGDIRECTIVE to tell the compiler to do it correctly.  It's odd that it'll work out that it shouldn't do it if the BOM doesn't match the pageencoding value, but it won't then work out what it should be doing on that basis.

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 29, 2011 Sep 29, 2011

Copy link to clipboard

Copied

LATEST

Adam Cameron. wrote:

It's odd that it[ColdFusion]'ll work out that it shouldn't do it if the BOM doesn't match the pageencoding value, but it won't then work out what it should be doing on that basis.

Perhaps because there are many more types of encoding than ColdFusion can process. Just a guess.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation