• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

invalid character in XML

Guest
May 01, 2009 May 01, 2009

Copy link to clipboard

Copied

Hi,

I'm trying to parse this public comment feed, which had been working until recently -

http://www.ntia.doc.gov/broadbandgrants/btopcomments.xml

- when I started getting this error -

"An error occured while Parsing an XML document. An invalid XML character (Unicode: 0x14) was found in the element content of the document."

I'm trying to strip out the character using this replace command, but it doesn't seem to be working.

XMLText = rereplace(XMLText, chr(14),"","ALL");

Any ideas?

thanks!

TOPICS
Advanced techniques

Views

6.6K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
May 01, 2009 May 01, 2009

Copy link to clipboard

Copied

Was that a dummy URL you posted?  Because when I pull up the document you linked to, I get the following HTML:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=windows-1252"></HEAD>
<BODY></BODY></HTML>

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
May 01, 2009 May 01, 2009

Copy link to clipboard

Copied

Thats the real feed - though it seems to have stopped responding. Hmmmm. Thankfully I've got a copy of it locally - I'll attach a copy of it to this post.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
May 01, 2009 May 01, 2009

Copy link to clipboard

Copied

A link to the feed can be found in the top right corner of this page -

http://www.ntia.doc.gov/broadbandgrants/comments.cfm

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
May 02, 2009 May 02, 2009

Copy link to clipboard

Copied

Hi, Wingo,

I was perusing the XML file this morning after a bit of Googling and running some failed tests yesterday afternoon (I was able to get the same error you got on both CF 8.0.1 and Railo 3.1). I think I might see the problem.

I stripped out all of the Base64 code from the <content> elements in the document (I first downloaded the XML file via CFHTTP and saved the contents to a local file) and it worked. I was able to read and parse the XML and then output the XML to the browser. No errors or glitches.

I spent a little time trying to determine if it was a particular Base64 section but it seemed like I needed to remove all of them. Admittedly, I kind of got lost in the XML when trying to remove the Base64 sections and add them back in to test various combinations and such!

The only thing that bugs me about this is that I wonder if you can really replace the offending character(s). If these "problematic" characters are in a section that's Base64, can you change them (i.e., find/replace) and still get the right content when the Base64 data is converted to it's "real" format.

Might be worth spending some more time with the static XML version to see if there really is just one (or two) sections of Base64 content that's causing the issue (and not all of the Base 64 sections).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
May 03, 2009 May 03, 2009

Copy link to clipboard

Copied

Hmmm. Thanks for the thoughtful insight/exploration.

So if I removed the base64 sections, everything should work properly? Seems all of the base64 sections begin with "Content-Transfer-Encoding: base64" and end with the "</content>" closing tag. I'll try removing them and see if I can't get the feed working again - and report back.

Many thanks!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
May 03, 2009 May 03, 2009

Copy link to clipboard

Copied

I seem to have gotten it, though admittedly all of the attachments are being removed... which is less than desirable.

Here is the code I'm using.

<cfhttp url="http://www.ntia.doc.gov/broadbandgrants/btopcomments.xml" resolveurl="no" path="/mypath/" />

<cffile action="read" file="/mypath/btopcomments.xml" variable="XMLText" charset="utf-8">

<cfloop from="1" to="2000" index="i">
    <cfset start64 = find("Content-Type: application/", XMLText, 1)>
    <cfif start64 IS 0>
        <cfbreak>
    <cfelse>
        <cfset end64 = find("</content>", XMLText, start64)>
        <cfset length64 = end64-start64>
    </cfif>
   
    <cfoutput>
    #start64# - #end64# -- #length64#<br />
    </cfoutput>
   
   <cfset XMLText = RemoveChars(XMLText, start64, length64)>

</cfloop>

<cffile action="write" file="/mypath/btopcomments-clean.xml" output="#XMLText#" charset="utf-8">

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
May 03, 2009 May 03, 2009

Copy link to clipboard

Copied

EDIT: You can ignore this one, I think! I was typing my response and didn't see your recent post until after I submitted. I'll leave it in case the curly quote details help at all.

Hi, Wingo,

I was able to get the feed to parse without the Base64 sections but I think of that more as a trouble-shooting step than a solution because I wanted to see if there was an odd character (or more) in these elements.

A perusal skimmed the ATOM documents this morning related to Base64 (if I recall from reading the feed, it was an ATOM feed and not an RSS feed) at http://www.atomenabled.org/developers/syndication/atom-format-spec.php, indicates there should not be any issues with Base64 content in the feed.

I also tested the feed with a PHP ATOM parser this morning. I just wanted to see if another server-side language could do it. Nope! PHP threw an invalid character at line 1243. Of course, the line number PHP/ColdFusion see does not necessarily match what we see when we copy the XML into a new file.

After testing the feed with PHP, I downloaded the feed XML again (in case it was updated since I messed with it the other day) and started to check in the area around this line number. I saw something of potential interest: a lot of instances of what I think are curly quotes in the document.

Search for the test "inherently competing definitions of". Immediately following the word "of " are two funny symbols and then the words 'public interest'. If looks almost like someone had curly quotes around these words and, I believe, curly quotes can cause issues in XML parsing. A document search for this 'character' revealed many of them, all in places that look like a curly quote would go (they were typically in long blocks of text.

Sorry not to be of more help but maybe this (the curly quotes) will get things rolling...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 03, 2009 May 03, 2009

Copy link to clipboard

Copied

"An error occured while Parsing an XML document. An invalid XML character (Unicode: 0x14) was found in the element content of the document."
I'm trying to strip out the character using this replace command, but it doesn't seem to be working.
XMLText = rereplace(XMLText, chr(14),"","ALL");

The hexadecimal 0x14 corresponds to decimal 20, hence to chr(20).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advisor ,
May 05, 2009 May 05, 2009

Copy link to clipboard

Copied

LATEST

It looks like you have a UTF 8 encoded XML file that is trying to include some characters from a different character set, specifically curly quotes.  CF does not do a good job in converting text from non-unicode to unicode forms.

See this blog from Ray Camden entry for info and a possible fix.
http://www.coldfusionjedi.com/index.cfm/2006/11/2/xmlFormat-and-Microsofts-Funky-Characters

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation