9 Replies Latest reply on Oct 20, 2012 8:34 AM by BKBK

    cfhttp screen scrape. How to get information 'between'

    wmkolcz

      I am scraping one of our sites for information to display on another site. I can trigger the page to display the correct information but the information returned need to be parsed for only some of the text. I see that I can narrow the infromation down to a span tag and a hortizonal rule. Is there a way to grab just the information between '<span class="sml">' and '<hr />' and render it to the screen?

       

      Sample cfhttp.fileContent:

      <span class="sml">Administration &gt; Business Administration &gt; Managerial</span><br /> 100116 - <strong>Academic And/Or Research Program Officer Intermediate</strong> - AD220 - Independently manages a large academic or research program. Designs and develops major program components, develops and maintains curricula, develops research, leads professional conferences and provides public relations support. Develops ideas and options for faculty review and decision, and develops and implements instruction and research programs that reflect faculty interests. Evaluates effectiveness of curriculum and effectiveness of program in meeting goals. May teach seminars and workshops and participate with faculty on research. Plans, directs and controls program budget. Supervises program staff. Education and Experience: Academic background and experience in selected subject area. Requires advanced degree, preferably Ph.D. in selected subject area. Requires several years experience in academic work related to particular area of research. The primary duty of employees in this classification is the management of a customarily recognized department or subdivision, including the supervision of three or more full-time equivalent employees every week. Direction is over a permanent status-continuing function, not a collection of employees assigned to complete a project. Management duties include interviewing, selecting and training of employees; setting and adjusting their rates of pay and hours of work; planning and directing their work; appraising their productivity and efficiency for the purpose of recommending promotions or other changes in their status; handling their complaints and grievances and disciplining them when necessary. Management responsibilities include the authority to hire, fire, or promote assigned employees or make recommendations that are given particular weight. Employees have impact on budgeting, controlling costs, planning, scheduling, and procedural change. Under FLSA, incumbents in this position meet the criteria for exempt status.<hr />

       

       

      Thanks!

        • 1. Re: cfhttp screen scrape. How to get information 'between'
          Adam Cameron. Level 5

          Yep, you can do a regex find to get the start position and length of the match, and then extract that from the string.

           

          Have a read up on reFind():

          http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-7e 9a.html

           

          And there's a link from there through to CF's regex support:

          http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec0a38f -7fff.html

           

          Give that a blast...

           

           

          --

          Adam

          1 person found this helpful
          • 2. Re: cfhttp screen scrape. How to get information 'between'
            wmkolcz Level 1

            Unfortunately I am stuck with Blue Disaster/Dragon 7. Will that work with that too? Most of the answeres I found online call for one of the two (8 or 9).

            • 4. Re: cfhttp screen scrape. How to get information 'between'
              wmkolcz Level 1

              Here is the dump from the actual cfhttp:

               

              <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><!-- InstanceBegin  template="/Templates/interior.dwt.asp" codeOutsideHTMLIsLocked="false"  --> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"  /> <!-- InstanceBeginEditable name="doctitle" --> <title>career path navigator</title> <!-- InstanceEndEditable --> <link href="styles.css" rel="stylesheet" type="text/css" /> <link href="print.css" rel="stylesheet" type="text/css" media="print"  /> <link href="tabs.css" rel="stylesheet" type="text/css" /> <script type="text/javascript">      function showResponseWin(URL) {       aWindow=window.open(URL,"thewindow","scrollbars=1,width=525,height=425,resizable=yes");      } </script> <!-- InstanceBeginEditable name="head" --> <!-- InstanceEndEditable --> </head> <body> <div class="container">  <div class="header"><div class="tpinner"><a  href="http://www.hr.umich.edu/compclass/" target="_blank"  class="tp">Compensation &amp; Classification</a> | <a  href="http://www.hr.umich.edu/" target="_blank" class="tp">University  Human Resources</a> | <a href="http://www.umjobs.org"  target="_blank" class="tp">U-M Jobs</a></div><a  href="http://www.umich.edu/~hraa/compclass/index.html" class="noborder"  target="_blank"><img src="images/logo.gif" alt="Compensation and  Classification" width="194" height="52" border="0"  /></a></div> <div class="search"><a href="/default.asp"><img  src="images/navigator.gif" alt="career path navigator" width="358"  height="51" border="0" class="blockIMG" /></a></div> <div id="multi-level">  <img class="pad" src="images/nav/nav_shading.gif" alt="" width="43"  height="39" />  <ul class="menu">      <li class="top p1"><a href="/CFCSOverview.asp" id="what"  class="top_link"><span>What is?</span><!--[if IE  7]><!--></a><!--<![endif]-->           <!--[if lte IE  6]><table><tr><td><![endif]-->           <ul class="sub">                <li><a href="/CFCSOverview.asp">Career Family  Classification System (CFCS)</a></li>                <li><a href="/PathLevels.asp">Path  Levels</a></li>        </ul>           <!--[if lte IE  6]></td></tr></table></a><![endif]-->      </li>      <li class="top p2"><a href="/GettingStarted.asp" id="start"  class="top_link"><span>Getting  Started</span></a></li>      <li class="top p3"><a href="/FAQ.asp" id="faq"  class="top_link"><span>FAQ</span></a></li>      <li class="top p4"><a href="/CareerFamilies.asp" id="mapping"  class="top_link"><span>Mapping to the  Market</span></a></li>      <li class="top p5"><a href="/search.asp" id="search"  class="top_link"><span>Search</span></a></li>      <li class="top p6"><a href="/OtherResources.asp" id="other"  class="top_link"><span>Other  Resources</span></a></li> </ul>  </div> <!-- end multi-level -->  <div class="content">  <!-- InstanceBeginEditable name="content" -->    <h2>Market Title(s)</h2>  <div class="messageholder">      <p>The Career Path Navigator references common qualifications  from the labor market for a position. Please refer to the U-M job  posting for a specific position for the required  qualifications.</p>     <span class="bottom"></span> </div>  <form name="frm" method="get" action="print.asp"> <table width="100%"  border="0" cellspacing="0" cellpadding="0">       <tr>           <td valign="top"><input type="checkbox" name="mTitle"  value="100223"> </td>           <td><span class="sml">Administration &gt; General  Office/Administrative Support &gt; Professional</span><br  />                   100223 - <strong>Academic Records Assistant  Intermediate</strong> - AD140 - Under general supervision,  performs a variety of more complex duties to prepare, process, maintain  and provide information regarding student academic records and/or  reports. Work requires an overall understanding of procedures and  systems related to the record function in order to identify and resolve  complex inquiries and problems. May train and direct workflow of other  students or clerical employees. Education and Experience: High School  graduate and 2 to 3 years of related experience required. Under FLSA,  incumbents in this position are nonexempt.  <hr /> </td> </tr> </table> <p><input type="submit" name="Submit" value="Printable  Preview"></p> </form>      <!-- InstanceEndEditable -->  </div> <!-- end content -->  <div class="footer"><p>To provide feedback please email,  <a  href="mailto:careerpathfeedback@umich.edu">careerpathfeedback@umich.edu</a>.<br  /> Copyright &copy;    2011 <a href="http://www.regents.umich.edu/" target="_blank">The  Regents</a> of the University of Michigan</p>   <div>Last updated: 5/30/2011 2:00:22 AM<br /> <br /> </div> <!-- end sml -->  </div> <!-- end footer -->  </div> <!-- end container --> <script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ?  "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost +  "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> var pageTracker = _gat._getTracker("UA-99999999"); pageTracker._initData(); pageTracker._trackPageview(); </script></body> <!-- InstanceEnd --></html>

               

               

              Bolded is what I need to grab and render
              • 5. Re: cfhttp screen scrape. How to get information 'between'
                ilssac Level 5

                As the refind() functionality has been around since at least version 4.5 I suspect your engine should have some capability for it.

                • 6. Re: cfhttp screen scrape. How to get information 'between'
                  Adam Cameron. Level 5

                  wmkolcz wrote:

                   

                  Unfortunately I am stuck with Blue Disaster/Dragon 7. Will that work with that too? Most of the answeres I found online call for one of the two (8 or 9).

                   

                   

                  To be completely blunt... why is it you think that I (or anyone else here) should waste their time testing this out for you?

                   

                  Why don't you just try it for yourself and find out?

                   

                  --

                  Adam

                  • 7. Re: cfhttp screen scrape. How to get information 'between'
                    cfwild Level 1

                    @wmkolcz

                     

                    Something simple like the following "could" work:

                     

                    <cfset string = cfhttp.filecontent />
                    <cfset StartText = '<span class="sml">' />
                    <cfset Start = FindNoCase(StartText, string, 1) />
                    <cfset EndText='<hr />' />
                    <cfset Length=Len(StartText) />
                    <cfset End = FindNoCase(EndText, string, Start) />
                    <cfset parse = Mid(string, Start+Length, End-Start-Length) />

                    <cfset parse = trim(parse) />
                    <cfoutput>#parse#</cfoutput>

                     

                    Good Luck!

                     

                    <cfwild />

                    • 8. Re: cfhttp screen scrape. How to get information 'between'
                      MarcovandenOever

                      Thanks for this example, just what i needed!

                      • 9. Re: cfhttp screen scrape. How to get information 'between'
                        BKBK Adobe Community Professional & MVP

                        cfwild wrote:

                         

                        Something simple like the following "could" work:

                         

                        <cfset string = cfhttp.filecontent />
                        <cfset StartText = '<span class="sml">' />
                        <cfset Start = FindNoCase(StartText, string, 1) />
                        <cfset EndText='<hr />' />
                        <cfset Length=Len(StartText) />
                        <cfset End = FindNoCase(EndText, string, Start) />
                        <cfset parse = Mid(string, Start+Length, End-Start-Length) />

                        <cfset parse = trim(parse) />
                        <cfoutput>#parse#</cfoutput>

                        Brave attempt. However, this would fail if the HTTP client returned HTML tags that contained arbitrary spaces, like <span class = "sml" > and <hr      />.

                         

                        In any case, you have provided a basis for a possible solution by means of regular expressions. For example,

                         

                        <cfset startCount = REFindNocase('<span\s+class\s*=\s*"sml"\s*>', httpContent)>

                        <cfset endCount = REFindNocase('<hr\s*/>', httpContent)>

                        <cfoutput>#mid(httpContent,startCount, endCount-startCount)#</cfoutput>