• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Parsing XML

New Here ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

I'm a bit of a noob with parsing XML with coldfusion and I could use some help with an issue.

I'm trying to parse the following XML file;
http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml

and I get the following error;

Character conversion error: "Illegal ASCII character, 0xe9" (line number may be too low).
The error occurred on line 9.

I can post my code if this is helpful? But it works fine when parsing a different XML file.

Any ideas?
TOPICS
Advanced techniques

Views

1.9K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Sam_Ham wrote:
> I'm a bit of a noob with parsing XML with coldfusion and I could use some help
> with an issue.
>
> I'm trying to parse the following XML file;
> http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml
>

Well the first thing I can tell you is that Firefox says this is not
well formed XML file. And as such can't be processed as XML.

If Firefox can't do it, I doubt that ColdFusion can.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

The dtd of that XML doc is:
http://www.bbc.co.uk/travelnews/xml/tpegml_en/tpegML.dtd
Which also points to:
tpegMLDataTypes.dtd
locML.dtd
ptiML.dtd
rtmML.dtd

rtmML.dtd gets its entity definitions from:
rtmML.ent

This is where I stopped digging. Your entities are all defined in .ent files that you'll have to pull down.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Thanks Kronin,

You'll have to excuse my ignorance with XML.

How do I pull these entities down and use them?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Sam_Ham wrote:
> Thanks Kronin,
>
> You'll have to excuse my ignorance with XML.
>
> How do I pull these entities down and use them?

Well, you pull them down the same as any other Web resource, just use
the DTD url.

Now I am interested to see more about actually using them. I've never
had to use a DTD with my ColdFusion XML processing and would like to see
something about how this is done.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

> Character conversion error: "Illegal ASCII character, 0xe9" (line number may
> be too low).
> The error occurred on line 9.

The error message pretty much tells you what's wrong. There's an 0xE9
character in the doc, which is illegal in XML. It's not a well-formed XML
doc, so you can't treat it as one.

You should probably do two things:
1) if the parse fails for reasons like this, catch the exception in the UI
(or wherever appropriate) and put a warning message in along the lines of
"sorry, the traffic service is not currently available).
2) get in touch with the Beeb and tell them their developers need a lesson
in creating XML docs.

--
Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Sam_Ham wrote:
> I'm a bit of a noob with parsing XML with coldfusion and I could use some help
> with an issue.
>
> I'm trying to parse the following XML file;
> http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml
>

Firefox is complaining about undefined entities in the file. Looking at
the code I see things like: "&rtm31_4;" and "&loc41_30;". These look
like custom entities and there is no entity definition section to the
XML file needed to define them. I beleive that would usually look
something like:

<!ENTITY nbsp "&#160;">
<!ENTITY copy "&#169;">

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Some good points there.

I understand that the XML document is not well formatted. I have already e-mail the BBC about this, hoping they will address the issues.

Despite issues, the XML file should still be usable as there are web apps already using this XML file to mashup data into google maps etc.

I'm trying im still trying to find a workaround.

There must be a way?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Sam_Ham wrote:
>
> I'm trying im still trying to find a workaround.
>
> There must be a way?
>

Well, it is just a text file. There is nothing preventing you from
processing it as a plain text. You can either us Regex or similar
string manipulation techniques to extract the desired information or use
the string techniques to repair the XML and then process it as an XML file.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

When trying to do this:
<cfset myXML = XmlParse(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml",true,"http://www.bbc.co.uk/travelnew...
I got this error:
Recursive entity reference "%tpegMLDataTypes". (Reference path: %tpegMLDataTypes -> %tpegMLDataTypes -> %tpegMLDataTypes)

So CF doesn't like this at all. To simplify the DTD, I pulled it all down and put it into one file (replacing the ENTITY lines that pull in the other files with the file contents themselves).
An example is I changed this:
<ENTITY % tpegMLDataTypes PUBLIC "-//EBU//DTD tpegML data types//EN" "tpegMLDataTypes.dtd">
&tpegMLDataTypes;
to this
<!-- ENTITY % tpegMLDataTypes PUBLIC "-//EBU//DTD tpegML data types//EN" "tpegMLDataTypes.dtd" -->
<!--============================================================-->
<!-- tpegML TPEG Traffic and Travel Information Common Data Types DTD release version -->
<!-- PUBLIC "-//EBU//DTD tpegML data types//EN" -->
<!--============================================================-->
<!-- time: Time in UTC, should be in the format of "YYYY-MM-DDThh:mm:ssZ". -->
<!ENTITY % time "CDATA">
<!-- intunti: Integer Unsigned Tiny, range 0..255 -->
<!ENTITY % intunti "CDATA">
<!-- intsiti: Integer Signed Tiny, range -128..127 -->
<!ENTITY % intsiti "CDATA">
<!-- intunli: Integer Unsigned Little, range 0..65535 -->
<!ENTITY % intunli "CDATA">
<!-- intsili: Integer Signed Little, range -32768..32767 -->
<!ENTITY % intsili "CDATA">
<!-- intunlo: Integer Unsigned Long, range 0..4294967295 -->
<!ENTITY % intunlo "CDATA">
<!-- intsilo: Integer Signed Long, range -2146483648..2147483647 -->
<!ENTITY % intsilo "CDATA">
<!-- numag: Integer from 0 to 3000000 (limited subset of these numbers as defined in TPEG Part 2 - SSF -->
<!ENTITY % numag "CDATA">
<!-- short_string: String of up to 255 characters. -->
<!ENTITY % short_string "CDATA">
<!-- long_string: String of up to 65535 characters. -->
<!ENTITY % long_string "CDATA">
<!-- day_mask:Can select one or more days of the week to indicate repetition.
if (selector = 00000000) : no day selected
if (selector = 0xxxxxx1) : every Sunday
if (selector = 0xxxxx1x) : every Monday
if (selector = 0xxxx1xx) : every Tuesday
if (selector = 0xxx1xxx) : every Wednesday
if (selector = 0xx1xxxx) : every Thursday
if (selector = 0x1xxxxx) : every Friday
if (selector = 01xxxxxx) : every Saturday
-->
<!ENTITY % day_mask "CDATA">

You can get that file here: http://www.hubbach.com/tpegML.dtd
I will delete this file at some point, so don't write your code to use my file. Pull it down onto your system and use it locally. You might have to update this file if the BBC ever changes their DTD or entities.

Once you do that, this will work:
<cfset myXML = XmlParse(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml",true,"http://www.myserver.com/tpegML...

Note that it takes quite awhile. My guess is that CF uses a DOM parser versus a SAX parser. If you wanted to speed this up, you could probably use a Java SAX XML parser.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Oh, and I tested this all on ColdFusion 8.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Now were getting somewhere...

I saved the XML file locally and removed line 2

<!DOCTYPE tpeg_document PUBLIC "-//EBU/tpegML/EN" " http://www.bbc.co.uk/travelnews/xml/tpegml_en/tpegML.dtd"[

The XML file now parses!

Now I need to work out how to handle the DTD from the live feed?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Sam, don't just remove the dtd declaration. All of those entity declarations in the ___.ent file are needed to make any sense out of the file. If you remove the dtd line from the XML file, none of those entities will be resolved.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Thanks Kronin,

I posted this before I noticed your last post... thanks very much for all the help.

I'm going to have to give this ago tomorrow, it's getting late.

I'll keep this topic updated 🙂

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 15, 2009 Jan 15, 2009

Copy link to clipboard

Copied

Kronin,

Unfornately I can't use your method becuase I'm using coldfusion 6 and XMLParse only accepts 2 parameters;

XmlParse(xmlString [, caseSensitive ] )

No luck!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 14, 2009 Jan 14, 2009

Copy link to clipboard

Copied

Kronin555 wrote:
>
> Note that it takes quite awhile. My guess is that CF uses a DOM parser versus
> a SAX parser. If you wanted to speed this up, you could probably use a Java SAX
> XML parser.
>

I wonder if one used a Java SAX XML parser, if one could just use the
DTD directly and not need to pull them down and concatenate them as
apparently one has to for ColdFusion.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 15, 2009 Jan 15, 2009

Copy link to clipboard

Copied

I've done a bit of research and it would make sense to parse the XML file using SAX rather than using coldfusions DOM parser, becuase of the size of the file and processing speeds.

I've had a go at doing this, but with little success. Because of my inexperience with XML I feel like i could be going down the wrong route.

There is no real documentation online about using coldfusion with SAX.

Does anybody have any knowledge on this?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Jan 15, 2009 Jan 15, 2009

Copy link to clipboard

Copied

LATEST
I borrowed some code from here:
http://www.javacommerce.com/displaypage.jsp?name=saxparser1.sql&id=18232

Here's your coldfusion code:
<cfset myHandler = CreateObject("Java","MyHandler")>
<cfset myHandler.init()>
<cfset xmlcontent = myHandler.parseXmlToString(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml")>
<cfset xmldoc = xmlparse(xmlcontent)>

And here's the MyHandler.java source. I have no idea what version of Java you're on, still being on ColdFusion 6, so I have no idea if this is going to compile for you or not. It runs fine for me on Coldfusion 8 with Java 1.5

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation