6 Replies Latest reply on Jul 14, 2011 6:09 AM by Claudiu Ursica

    parsing HTML




      Is it possible to parse a html page using flex and AIR, and then save the information preferably in a excel sheet.


      Detailed Problem statement -


      I need to parse one forum(say flex forum itself). And then extract information like, the link of the queries posted, the date when they were posted, who and on which date replied to that query, and when that query was actullay resolved.


      Our program should be able to login into the forum and then parse and save all the information mentioned above from the page.


      I will very thankful if someone could suggest on this with some example(say,login into the forum).



        • 1. Re: parsing HTML
          Claudiu Ursica Level 4

          Presumably the if the HTML sourceis valid XHTML you can parse it as a regular XML. Otherwise regex ???


          Depending on the source you could parse the feeds instead of the site. If you need to scrap the page you need to fin what request goes behind the scenes and fake that from flex/air.



          I don't have any example cause I never had to do it but google is your friend here.




          • 2. Re: parsing HTML
            kokorito Level 4

            yes its possible

            the site I was parsing wasnt strict enough to convert to xml, I had to treat it as a string


            eventually I found it easier to do it server side using php and a dom parser, http://simplehtmldom.sourceforge.net/ and then sent an array to flex using amf

            • 3. Re: parsing HTML
              ashok.tech Level 1

              Thanks claudiu,


              can you tell me how to fake the requests going behind the scenes from flex/air.


              and could you please elaborate on how to parse feeds.


              I could not find any relevant example. Could you write a small peace of code.


              Many thanks in advance.

              • 4. Re: parsing HTML
                Claudiu Ursica Level 4

                I believe you need to start with the browser and some packet monitor tool e.g Wireshark will do it... Look at the traffic and see what request is the browser making, also the payloads, headers attached to those requests. Those are the calls you will need to make yourself from flex/air. When you make the call the server will treat it like any other call and reply to you with the same response as you'd called from the browser... Once you have the response you can start parsing it.


                I don't have any specific code but the calls can be done like any other URLLoader/HttpSergice calls, it;s just a matter of setting the parameters.




                • 5. Re: parsing HTML
                  kokorito Level 4

                  Firefox has a nice add on called Firebug.

                  You can see the structure of the html as well as monitor body and header requests and responses

                  • 6. Re: parsing HTML
                    Claudiu Ursica Level 4

                    You are right about firebug, I forgot they are only HTTP calls. I was doing dome rtmp debugging with wireshark and it was the first thing to pop in mind.