3 Replies Latest reply on Dec 15, 2008 12:20 PM by PAEz.

    JS/HTML: HTML to DOM so I can XPath

      Im new to all this and all I want to do is get info from a page and redisplay it. I would really like to get that info like the title suggests tho...I want to start with the html source, parse it to a DOM and then use xpath to get the info I want. Ive only ever done this sort of thing in Greasemonkey before tho and dont seem to able to get it.
      Any help would be appreciated.
        • 1. Re: HTML to DOM so I can XPath
          Level 1
          Just to answer my own question ;).......
          quote:


          <html>
          <head>
          <title>Get It!</title>
          <link href="sample.css" rel="stylesheet" type="text/css"/>
          <script type="text/javascript" src="lib/air/AIRAliases.js"></script>
          <script type="text/javascript" src="lib/air/AIRIntrospector.js"></script>
          <script type="text/javascript">
          // AIR-related functions created by the developer

          function onHTMLLoadComplete(e)
          {
          //get a reference to the top level html document
          var doc = html.window.document;
          //var doc = e.target.window.document;
          var node=doc.evaluate("//title",doc).iterateNext();
          // while (thisNode = nodes.interateNext()) {
          // alert( thisNode.textContent );
          // thisNode = nodes.iterateNext();
          // }
          var elem = document.createElement( 'div' );
          elem.innerText = 'Title of Page is: ' + node.textContent;
          document.body.appendChild( elem );
          }

          // loads the content of a remote URL
          function doRequest(url) {
          var req = new XMLHttpRequest();
          req.onreadystatechange = function() {
          if (req.readyState == 4) {
          var str = req.responseText;

          html = new air.HTMLLoader();
          html.addEventListener(air.Event.COMPLETE, onHTMLLoadComplete);
          html.loadString(str);
          }
          }
          req.open('GET', url, true);
          req.send(null);
          }

          function openInBrowser(url) {
          air.navigateToURL( new air.URLRequest(url));
          }

          </script>
          </head>

          <body>
          <h3>HTML to DOM for XPath</h3>

          <ul>

          <li>XMLHttpRequest object can reach into remote domains &mdash; the following loads http://www.adobe.com:
          <br/>
          <input type="button" onclick='doRequest(" http://www.adobe.com");' value='doRequest(" http://www.adobe.com");'/>
          </li>
          </ul>

          </body>
          </html>



          Now Id like to know if there's an option for it to not load images when it parses the dom. I assume its still loading the images from the amount of time it took to load (I have dialup). If not I wonder if a regular expression could be made to wreck the urls of the images (by changing the href attribute id to something else) and then search for the new attribute with xpath.....pity Im no good at regular expressions.
          • 2. JS/HTML: HTML to DOM so I can XPath
            rrhyne Level 1
            [q
            Now Id like to know if there's an option for it to not load images when it parses the dom. I assume its still loading the images from the amount of time it took to load (I have dialup). If not I wonder if a regular expression could be made to wreck the urls of the images (by changing the href attribute id to something else) and then search for the new attribute with xpath.....pity Im no good at regular expressions.


            The Easiest way I can think of is to add the jquery library and remove the images from the dom in onHTMLLoadComplete tree before you display your edited dom. This would look like;

            $("img").remove();

            http://docs.jquery.com/Manipulation/remove#expr

            Also, you could achieve the same in plain javascript, this example has a nifty function to replace the images with placeholders:

            http://www.quirksmode.org/dom/fir.html

            • 3. Re: JS/HTML: HTML to DOM so I can XPath
              Level 1
              Problem is as soon as the dom is created it starts getting the images whether its actually displaying the page or not. As you can see in my code Im using that event onHTMLLoadComplete) and the images are loading (wonder if I could put a cancelLoad in there?...try that after). So what Im doing now is getting the source, ripping the scripts out with a regex and changing img tags to blockedimg tags (which causes a some error meassages in the console but who cares), I dont actually want the images gone, just not loaded.
              Thanx for the help tho, gotta get round to looking at jquery but Im just learning JS at the mo and dont wanna complicate things (I dont learn easy).