invalid character in XML

Report · May 01, 2009

Hi,

I'm trying to parse this public comment feed, which had been working until recently -

http://www.ntia.doc.gov/broadbandgrants/btopcomments.xml

- when I started getting this error -

"An error occured while Parsing an XML document. An invalid XML character (Unicode: 0x14) was found in the element content of the document."

I'm trying to strip out the character using this replace command, but it doesn't seem to be working.

XMLText = rereplace(XMLText, chr(14),"","ALL");

Any ideas?

thanks!

Report · May 01, 2009

Was that a dummy URL you posted? Because when I pull up the document you linked to, I get the following HTML:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=windows-1252"></HEAD>
<BODY></BODY></HTML>

Report · May 01, 2009

Thats the real feed - though it seems to have stopped responding. Hmmmm. Thankfully I've got a copy of it locally - I'll attach a copy of it to this post.

Report · May 01, 2009

A link to the feed can be found in the top right corner of this page -

http://www.ntia.doc.gov/broadbandgrants/comments.cfm

Report · May 02, 2009

Hi, Wingo,

I was perusing the XML file this morning after a bit of Googling and running some failed tests yesterday afternoon (I was able to get the same error you got on both CF 8.0.1 and Railo 3.1). I think I might see the problem.

I stripped out all of the Base64 code from the <content> elements in the document (I first downloaded the XML file via CFHTTP and saved the contents to a local file) and it worked. I was able to read and parse the XML and then output the XML to the browser. No errors or glitches.

I spent a little time trying to determine if it was a particular Base64 section but it seemed like I needed to remove all of them. Admittedly, I kind of got lost in the XML when trying to remove the Base64 sections and add them back in to test various combinations and such!

The only thing that bugs me about this is that I wonder if you can really replace the offending character(s). If these "problematic" characters are in a section that's Base64, can you change them (i.e., find/replace) and still get the right content when the Base64 data is converted to it's "real" format.

Might be worth spending some more time with the static XML version to see if there really is just one (or two) sections of Base64 content that's causing the issue (and not all of the Base 64 sections).

Report · May 03, 2009

Hmmm. Thanks for the thoughtful insight/exploration.

So if I removed the base64 sections, everything should work properly? Seems all of the base64 sections begin with "Content-Transfer-Encoding: base64" and end with the "</content>" closing tag. I'll try removing them and see if I can't get the feed working again - and report back.

Many thanks!

Report · May 03, 2009

I seem to have gotten it, though admittedly all of the attachments are being removed... which is less than desirable.

Here is the code I'm using.

<cfloop from="1" to="2000" index="i">
    <cfset start64 = find("Content-Type: application/", XMLText, 1)>
    <cfif start64 IS 0>
        <cfbreak>
    <cfelse>
        <cfset end64 = find("</content>", XMLText, start64)>
        <cfset length64 = end64-start64>
    </cfif>

    <cfoutput>
    #start64# - #end64# -- #length64#<br />
    </cfoutput>

   <cfset XMLText = RemoveChars(XMLText, start64, length64)>

</cfloop>

Report · May 03, 2009

EDIT: You can ignore this one, I think! I was typing my response and didn't see your recent post until after I submitted. I'll leave it in case the curly quote details help at all.

Hi, Wingo,

I was able to get the feed to parse without the Base64 sections but I think of that more as a trouble-shooting step than a solution because I wanted to see if there was an odd character (or more) in these elements.

A perusal skimmed the ATOM documents this morning related to Base64 (if I recall from reading the feed, it was an ATOM feed and not an RSS feed) at http://www.atomenabled.org/developers/syndication/atom-format-spec.php, indicates there should not be any issues with Base64 content in the feed.

I also tested the feed with a PHP ATOM parser this morning. I just wanted to see if another server-side language could do it. Nope! PHP threw an invalid character at line 1243. Of course, the line number PHP/ColdFusion see does not necessarily match what we see when we copy the XML into a new file.

After testing the feed with PHP, I downloaded the feed XML again (in case it was updated since I messed with it the other day) and started to check in the area around this line number. I saw something of potential interest: a lot of instances of what I think are curly quotes in the document.

Search for the test "inherently competing definitions of". Immediately following the word "of " are two funny symbols and then the words 'public interest'. If looks almost like someone had curly quotes around these words and, I believe, curly quotes can cause issues in XML parsing. A document search for this 'character' revealed many of them, all in places that look like a curly quote would go (they were typically in long blocks of text.

Sorry not to be of more help but maybe this (the curly quotes) will get things rolling...

Report · May 03, 2009

"An error occured while Parsing an XML document. An invalid XML character (Unicode: 0x14) was found in the element content of the document."
I'm trying to strip out the character using this replace command, but it doesn't seem to be working.
XMLText = rereplace(XMLText, chr(14),"","ALL");

The hexadecimal 0x14 corresponds to decimal 20, hence to chr(20).

Report · May 05, 2009

It looks like you have a UTF 8 encoded XML file that is trying to include some characters from a different character set, specifically curly quotes. CF does not do a good job in converting text from non-unicode to unicode forms.

See this blog from Ray Camden entry for info and a possible fix.
http://www.coldfusionjedi.com/index.cfm/2006/11/2/xmlFormat-and-Microsofts-Funky-Characters

Adobe Community

invalid character in XML