8 Replies Latest reply on Apr 29, 2009 12:17 AM by ntsiii

    Having issue with XML in CDATA tag

    mazukas

      Currently I am having a major issue that I've been dealing with for the last four days unsuccessfully.  I run a search against our data provider (lets say the term searched on is 'foobar').  We get back an XML document on our server along with a list of objects that has the start index of where the hit terms are found and the length from the start of the index (our data provider calculates all of this for us).  We then take that XML document in its exact state as it's given to us from our data provider and wrap it in a CDATA tag and put it into another XML document and pass that to our FLEX app.  The document looks something like this coming back from the server (we're using REST).  This is not the true original document as it was shortened for readability.

       

      <?xml version="1.0" encoding="UTF-8"?>

      <Document>

      <ID>123456</ID>

      <HITTERMS>

      <HITTERM index="45" length="6" />

      <HITTERM index="105" length="6" />

      <HITTERM index="260" length="6" />

      </HITTERMS>

      <DocumentXML><![CDATA[<?xml version="1.0" encoding="UTF-8"?>

      <DOC DOCUMENT_ID="123456" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

      <METADATA>

      <ID>123456</ID>

      <LANGUAGE>ENGLISH</LANGUAGE>

      <SOURCEDATA>1/1/2009</SOURCEDATA>

      <SOURCE>AP</SOURCE>

      </METADATA>

      <ARTICLE>

      <TITLE>Some title with foobar</TITLE>

      <TEXT>

      There would be just standard text.

       

      Some breaks for example like the start of new paragraphs, but otherwise all the foobar text would be condensed like this.

      </TEXT>

      </ARTICLE>

      </DOC>]]></DocumentXML>

      </Document>

       

       

      I've confirmed the XML coming from the server looks exactly as it does above. The issue is for some reason whenever I try to get the text out of '<DocumentXML>' it formats the code in a way that won't work for me since it throws off the offsets of the hit terms and changes the original document.  Whenever I do a .toString() on the XML it puts breaks before each '<' and after each '>' so it spaces everything out inside the CDATA tag.  Now the XML looks like this when turned into a string:

       

       

      <?xml version="1.0" encoding="UTF-8"?>

      <Document>

      <ID>123456</ID>

      <HITTERMS>

      <HITTERM index="45" length="6" />

      <HITTERM index="105" length="6" />

      <HITTERM index="260" length="6" />

      </HITTERMS>

      <DocumentXML><![CDATA[

      <?xml version="1.0" encoding="UTF-8"?>

       

      <DOC DOCUMENT_ID="123456" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

       

      <METADATA>

       

      <ID>123456</ID>

       

      <LANGUAGE>ENGLISH</LANGUAGE>

       

      <SOURCEDATA>1/1/2009</SOURCEDATA>

       

      <SOURCE>AP</SOURCE>

       

      </METADATA>

       

      <ARTICLE>

       

      <TITLE>Some title with foobar</TITLE>

       

      <TEXT>

       

      There would be just standard text.

       

      Some breaks for example like the start of new paragraphs, but otherwise all the foobar text would be condensed like this.

       

      </TEXT>

       

      </ARTICLE>

       

      </DOC>

      ]]>

      </DocumentXML>

      </Document>

       

      I need to be able to keep the original document as-is in the first example so I can calculate where the hit terms are so that I can highlight them and there are things further down the road that I'm going to need to be able to do so a simple search and replace will not get the job done unfortunately.  Has anyone ever encountered this before or have any idea possibly how to fix this.  Thanks in advance to anyone who can help with this.