17 Replies Latest reply on Jan 15, 2009 8:53 AM by Kronin555

    Parsing XML

    Sam_Ham
      I'm a bit of a noob with parsing XML with coldfusion and I could use some help with an issue.

      I'm trying to parse the following XML file;
      http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml

      and I get the following error;

      Character conversion error: "Illegal ASCII character, 0xe9" (line number may be too low).
      The error occurred on line 9.

      I can post my code if this is helpful? But it works fine when parsing a different XML file.

      Any ideas?
        • 1. Re: Parsing XML
          Level 7
          Sam_Ham wrote:
          > I'm a bit of a noob with parsing XML with coldfusion and I could use some help
          > with an issue.
          >
          > I'm trying to parse the following XML file;
          > http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml
          >

          Well the first thing I can tell you is that Firefox says this is not
          well formed XML file. And as such can't be processed as XML.

          If Firefox can't do it, I doubt that ColdFusion can.

          • 2. Re: Parsing XML
            Level 7
            > Character conversion error: "Illegal ASCII character, 0xe9" (line number may
            > be too low).
            > The error occurred on line 9.

            The error message pretty much tells you what's wrong. There's an 0xE9
            character in the doc, which is illegal in XML. It's not a well-formed XML
            doc, so you can't treat it as one.

            You should probably do two things:
            1) if the parse fails for reasons like this, catch the exception in the UI
            (or wherever appropriate) and put a warning message in along the lines of
            "sorry, the traffic service is not currently available).
            2) get in touch with the Beeb and tell them their developers need a lesson
            in creating XML docs.

            --
            Adam
            • 3. Re: Parsing XML
              Level 7
              Sam_Ham wrote:
              > I'm a bit of a noob with parsing XML with coldfusion and I could use some help
              > with an issue.
              >
              > I'm trying to parse the following XML file;
              > http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml
              >

              Firefox is complaining about undefined entities in the file. Looking at
              the code I see things like: "&rtm31_4;" and "&loc41_30;". These look
              like custom entities and there is no entity definition section to the
              XML file needed to define them. I beleive that would usually look
              something like:

              <!ENTITY nbsp "&#160;">
              <!ENTITY copy "&#169;">
              • 4. Re: Parsing XML
                Sam_Ham Level 1
                Some good points there.

                I understand that the XML document is not well formatted. I have already e-mail the BBC about this, hoping they will address the issues.

                Despite issues, the XML file should still be usable as there are web apps already using this XML file to mashup data into google maps etc.

                I'm trying im still trying to find a workaround.

                There must be a way?
                • 5. Re: Parsing XML
                  Level 7
                  Sam_Ham wrote:
                  >
                  > I'm trying im still trying to find a workaround.
                  >
                  > There must be a way?
                  >

                  Well, it is just a text file. There is nothing preventing you from
                  processing it as a plain text. You can either us Regex or similar
                  string manipulation techniques to extract the desired information or use
                  the string techniques to repair the XML and then process it as an XML file.
                  • 6. Re: Parsing XML
                    Kronin555 Level 1
                    The dtd of that XML doc is:
                    http://www.bbc.co.uk/travelnews/xml/tpegml_en/tpegML.dtd
                    Which also points to:
                    tpegMLDataTypes.dtd
                    locML.dtd
                    ptiML.dtd
                    rtmML.dtd

                    rtmML.dtd gets its entity definitions from:
                    rtmML.ent

                    This is where I stopped digging. Your entities are all defined in .ent files that you'll have to pull down.
                    • 7. Re: Parsing XML
                      Sam_Ham Level 1
                      Thanks Kronin,

                      You'll have to excuse my ignorance with XML.

                      How do I pull these entities down and use them?
                      • 8. Re: Parsing XML
                        Level 7
                        Sam_Ham wrote:
                        > Thanks Kronin,
                        >
                        > You'll have to excuse my ignorance with XML.
                        >
                        > How do I pull these entities down and use them?

                        Well, you pull them down the same as any other Web resource, just use
                        the DTD url.

                        Now I am interested to see more about actually using them. I've never
                        had to use a DTD with my ColdFusion XML processing and would like to see
                        something about how this is done.

                        • 9. Re: Parsing XML
                          Kronin555 Level 1
                          When trying to do this:
                          <cfset myXML = XmlParse(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml",true,"http://www.bbc.co.uk /travelnews/xml/tpegml_en/tpegML.dtd")>
                          I got this error:
                          Recursive entity reference "%tpegMLDataTypes". (Reference path: %tpegMLDataTypes -> %tpegMLDataTypes -> %tpegMLDataTypes)

                          So CF doesn't like this at all. To simplify the DTD, I pulled it all down and put it into one file (replacing the ENTITY lines that pull in the other files with the file contents themselves).
                          An example is I changed this:
                          <ENTITY % tpegMLDataTypes PUBLIC "-//EBU//DTD tpegML data types//EN" "tpegMLDataTypes.dtd">
                          &tpegMLDataTypes;
                          to this
                          <!-- ENTITY % tpegMLDataTypes PUBLIC "-//EBU//DTD tpegML data types//EN" "tpegMLDataTypes.dtd" -->
                          <!--============================================================-->
                          <!-- tpegML TPEG Traffic and Travel Information Common Data Types DTD release version -->
                          <!-- PUBLIC "-//EBU//DTD tpegML data types//EN" -->
                          <!--============================================================-->
                          <!-- time: Time in UTC, should be in the format of "YYYY-MM-DDThh:mm:ssZ". -->
                          <!ENTITY % time "CDATA">
                          <!-- intunti: Integer Unsigned Tiny, range 0..255 -->
                          <!ENTITY % intunti "CDATA">
                          <!-- intsiti: Integer Signed Tiny, range -128..127 -->
                          <!ENTITY % intsiti "CDATA">
                          <!-- intunli: Integer Unsigned Little, range 0..65535 -->
                          <!ENTITY % intunli "CDATA">
                          <!-- intsili: Integer Signed Little, range -32768..32767 -->
                          <!ENTITY % intsili "CDATA">
                          <!-- intunlo: Integer Unsigned Long, range 0..4294967295 -->
                          <!ENTITY % intunlo "CDATA">
                          <!-- intsilo: Integer Signed Long, range -2146483648..2147483647 -->
                          <!ENTITY % intsilo "CDATA">
                          <!-- numag: Integer from 0 to 3000000 (limited subset of these numbers as defined in TPEG Part 2 - SSF -->
                          <!ENTITY % numag "CDATA">
                          <!-- short_string: String of up to 255 characters. -->
                          <!ENTITY % short_string "CDATA">
                          <!-- long_string: String of up to 65535 characters. -->
                          <!ENTITY % long_string "CDATA">
                          <!-- day_mask:Can select one or more days of the week to indicate repetition.
                          if (selector = 00000000) : no day selected
                          if (selector = 0xxxxxx1) : every Sunday
                          if (selector = 0xxxxx1x) : every Monday
                          if (selector = 0xxxx1xx) : every Tuesday
                          if (selector = 0xxx1xxx) : every Wednesday
                          if (selector = 0xx1xxxx) : every Thursday
                          if (selector = 0x1xxxxx) : every Friday
                          if (selector = 01xxxxxx) : every Saturday
                          -->
                          <!ENTITY % day_mask "CDATA">

                          You can get that file here: http://www.hubbach.com/tpegML.dtd
                          I will delete this file at some point, so don't write your code to use my file. Pull it down onto your system and use it locally. You might have to update this file if the BBC ever changes their DTD or entities.

                          Once you do that, this will work:
                          <cfset myXML = XmlParse(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml",true,"http://www.myserver. com/tpegML.dtd")>

                          Note that it takes quite awhile. My guess is that CF uses a DOM parser versus a SAX parser. If you wanted to speed this up, you could probably use a Java SAX XML parser.
                          • 10. Re: Parsing XML
                            Kronin555 Level 1
                            Oh, and I tested this all on ColdFusion 8.
                            • 11. Re: Parsing XML
                              Sam_Ham Level 1
                              Now were getting somewhere...

                              I saved the XML file locally and removed line 2

                              <!DOCTYPE tpeg_document PUBLIC "-//EBU/tpegML/EN" " http://www.bbc.co.uk/travelnews/xml/tpegml_en/tpegML.dtd"[

                              The XML file now parses!

                              Now I need to work out how to handle the DTD from the live feed?
                              • 12. Re: Parsing XML
                                Kronin555 Level 1
                                Sam, don't just remove the dtd declaration. All of those entity declarations in the ___.ent file are needed to make any sense out of the file. If you remove the dtd line from the XML file, none of those entities will be resolved.
                                • 13. Re: Parsing XML
                                  Sam_Ham Level 1
                                  Thanks Kronin,

                                  I posted this before I noticed your last post... thanks very much for all the help.

                                  I'm going to have to give this ago tomorrow, it's getting late.

                                  I'll keep this topic updated :)
                                  • 14. Re: Parsing XML
                                    Level 7
                                    Kronin555 wrote:
                                    >
                                    > Note that it takes quite awhile. My guess is that CF uses a DOM parser versus
                                    > a SAX parser. If you wanted to speed this up, you could probably use a Java SAX
                                    > XML parser.
                                    >

                                    I wonder if one used a Java SAX XML parser, if one could just use the
                                    DTD directly and not need to pull them down and concatenate them as
                                    apparently one has to for ColdFusion.

                                    • 15. Re: Parsing XML
                                      Sam_Ham Level 1
                                      I've done a bit of research and it would make sense to parse the XML file using SAX rather than using coldfusions DOM parser, becuase of the size of the file and processing speeds.

                                      I've had a go at doing this, but with little success. Because of my inexperience with XML I feel like i could be going down the wrong route.

                                      There is no real documentation online about using coldfusion with SAX.

                                      Does anybody have any knowledge on this?
                                      • 16. Re: Parsing XML
                                        Sam_Ham Level 1
                                        Kronin,

                                        Unfornately I can't use your method becuase I'm using coldfusion 6 and XMLParse only accepts 2 parameters;

                                        XmlParse(xmlString [, caseSensitive ] )

                                        No luck!
                                        • 17. Re: Parsing XML
                                          Kronin555 Level 1
                                          I borrowed some code from here:
                                          http://www.javacommerce.com/displaypage.jsp?name=saxparser1.sql&id=18232

                                          Here's your coldfusion code:
                                          <cfset myHandler = CreateObject("Java","MyHandler")>
                                          <cfset myHandler.init()>
                                          <cfset xmlcontent = myHandler.parseXmlToString(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml")>
                                          <cfset xmldoc = xmlparse(xmlcontent)>

                                          And here's the MyHandler.java source. I have no idea what version of Java you're on, still being on ColdFusion 6, so I have no idea if this is going to compile for you or not. It runs fine for me on Coldfusion 8 with Java 1.5