• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

cfhttp screen scrape. How to get information 'between'

Explorer ,
May 31, 2011 May 31, 2011

Copy link to clipboard

Copied

I am scraping one of our sites for information to display on another site. I can trigger the page to display the correct information but the information returned need to be parsed for only some of the text. I see that I can narrow the infromation down to a span tag and a hortizonal rule. Is there a way to grab just the information between '<span class="sml">' and '<hr />' and render it to the screen?

Sample cfhttp.fileContent:

<span class="sml">Administration &gt; Business Administration &gt; Managerial</span><br /> 100116 - <strong>Academic And/Or Research Program Officer Intermediate</strong> - AD220 - Independently manages a large academic or research program. Designs and develops major program components, develops and maintains curricula, develops research, leads professional conferences and provides public relations support. Develops ideas and options for faculty review and decision, and develops and implements instruction and research programs that reflect faculty interests. Evaluates effectiveness of curriculum and effectiveness of program in meeting goals. May teach seminars and workshops and participate with faculty on research. Plans, directs and controls program budget. Supervises program staff. Education and Experience: Academic background and experience in selected subject area. Requires advanced degree, preferably Ph.D. in selected subject area. Requires several years experience in academic work related to particular area of research. The primary duty of employees in this classification is the management of a customarily recognized department or subdivision, including the supervision of three or more full-time equivalent employees every week. Direction is over a permanent status-continuing function, not a collection of employees assigned to complete a project. Management duties include interviewing, selecting and training of employees; setting and adjusting their rates of pay and hours of work; planning and directing their work; appraising their productivity and efficiency for the purpose of recommending promotions or other changes in their status; handling their complaints and grievances and disciplining them when necessary. Management responsibilities include the authority to hire, fire, or promote assigned employees or make recommendations that are given particular weight. Employees have impact on budgeting, controlling costs, planning, scheduling, and procedural change. Under FLSA, incumbents in this position meet the criteria for exempt status.<hr />

Thanks!

TOPICS
Advanced techniques

Views

4.1K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 31, 2011 May 31, 2011

Copy link to clipboard

Copied

Yep, you can do a regex find to get the start position and length of the match, and then extract that from the string.

Have a read up on reFind():

http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-7e9a.html

And there's a link from there through to CF's regex support:

http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec0a38f-7fff.html

Give that a blast...

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
May 31, 2011 May 31, 2011

Copy link to clipboard

Copied

Unfortunately I am stuck with Blue Disaster/Dragon 7. Will that work with that too? Most of the answeres I found online call for one of the two (8 or 9).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
May 31, 2011 May 31, 2011

Copy link to clipboard

Copied

opps

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
May 31, 2011 May 31, 2011

Copy link to clipboard

Copied

Here is the dump from the actual cfhttp:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><!-- InstanceBegin  template="/Templates/interior.dwt.asp" codeOutsideHTMLIsLocked="false"  --> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"  /> <!-- InstanceBeginEditable name="doctitle" --> <title>career path navigator</title> <!-- InstanceEndEditable --> <link href="styles.css" rel="stylesheet" type="text/css" /> <link href="print.css" rel="stylesheet" type="text/css" media="print"  /> <link href="tabs.css" rel="stylesheet" type="text/css" /> <script type="text/javascript">      function showResponseWin(URL) {       aWindow=window.open(URL,"thewindow","scrollbars=1,width=525,height=425,resizable=yes");     } </script> <!-- InstanceBeginEditable name="head" --> <!-- InstanceEndEditable --> </head> <body> <div class="container">  <div class="header"><div class="tpinner"><a  href="http://www.hr.umich.edu/compclass/" target="_blank"  class="tp">Compensation & Classification</a> | <a  href="http://www.hr.umich.edu/" target="_blank" class="tp">University  Human Resources</a> | <a href="http://www.umjobs.org"  target="_blank" class="tp">U-M Jobs</a></div><a  href="http://www.umich.edu/~hraa/compclass/index.html" class="noborder"  target="_blank"><img src="images/logo.gif" alt="Compensation and  Classification" width="194" height="52" border="0"  /></a></div> <div class="search"><a href="/default.asp"><img  src="images/navigator.gif" alt="career path navigator" width="358"  height="51" border="0" class="blockIMG" /></a></div> <div id="multi-level">  <img class="pad" src="images/nav/nav_shading.gif" alt="" width="43"  height="39" />  <ul class="menu">      <li class="top p1"><a href="/CFCSOverview.asp" id="what"  class="top_link"><span>What is?</span><!--[if IE  7]><!--></a><!--<![endif]-->           <!--[if lte IE  6]><table><tr><td><![endif]-->           <ul class="sub">                <li><a href="/CFCSOverview.asp">Career Family  Classification System (CFCS)</a></li>                <li><a href="/PathLevels.asp">Path  Levels</a></li>        </ul>           <!--[if lte IE  6]></td></tr></table></a><![endif]-->      </li>      <li class="top p2"><a href="/GettingStarted.asp" id="start"  class="top_link"><span>Getting  Started</span></a></li>      <li class="top p3"><a href="/FAQ.asp" id="faq"  class="top_link"><span>FAQ</span></a></li>      <li class="top p4"><a href="/CareerFamilies.asp" id="mapping"  class="top_link"><span>Mapping to the  Market</span></a></li>      <li class="top p5"><a href="/search.asp" id="search"  class="top_link"><span>Search</span></a></li>      <li class="top p6"><a href="/OtherResources.asp" id="other"  class="top_link"><span>Other  Resources</span></a></li> </ul>  </div> <!-- end multi-level -->  <div class="content">  <!-- InstanceBeginEditable name="content" -->    <h2>Market Title(s)</h2>  <div class="messageholder">      <p>The Career Path Navigator references common qualifications  from the labor market for a position. Please refer to the U-M job  posting for a specific position for the required  qualifications.</p>     <span class="bottom"></span> </div>  <form name="frm" method="get" action="print.asp"> <table width="100%"  border="0" cellspacing="0" cellpadding="0">       <tr>           <td valign="top"><input type="checkbox" name="mTitle"  value="100223"> </td>           <td><span class="sml">Administration &gt; General  Office/Administrative Support &gt; Professional</span><br  />                   100223 - <strong>Academic Records Assistant  Intermediate</strong> - AD140 - Under general supervision,  performs a variety of more complex duties to prepare, process, maintain  and provide information regarding student academic records and/or  reports. Work requires an overall understanding of procedures and  systems related to the record function in order to identify and resolve  complex inquiries and problems. May train and direct workflow of other  students or clerical employees. Education and Experience: High School  graduate and 2 to 3 years of related experience required. Under FLSA,  incumbents in this position are nonexempt.  <hr /> </td> </tr> </table> <p><input type="submit" name="Submit" value="Printable  Preview"></p> </form>      <!-- InstanceEndEditable -->  </div> <!-- end content -->  <div class="footer"><p>To provide feedback please email,  <a  href="mailto:careerpathfeedback@umich.edu">careerpathfeedback@umich.edu</a>.<br  /> Copyright &copy;    2011 <a href="http://www.regents.umich.edu/" target="_blank">The  Regents</a> of the University of Michigan</p>   <div>Last updated: 5/30/2011 2:00:22 AM<br /> <br /> </div> <!-- end sml -->  </div> <!-- end footer -->  </div> <!-- end container --> <script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ?  "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost +  "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> var pageTracker = _gat._getTracker("UA-99999999"); pageTracker._initData(); pageTracker._trackPageview(); </script></body> <!-- InstanceEnd --></html>

Bolded is what I need to grab and render

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
May 31, 2011 May 31, 2011

Copy link to clipboard

Copied

As the refind() functionality has been around since at least version 4.5 I suspect your engine should have some capability for it.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 31, 2011 May 31, 2011

Copy link to clipboard

Copied

wmkolcz wrote:

Unfortunately I am stuck with Blue Disaster/Dragon 7. Will that work with that too? Most of the answeres I found online call for one of the two (8 or 9).

To be completely blunt... why is it you think that I (or anyone else here) should waste their time testing this out for you?

Why don't you just try it for yourself and find out?

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Jun 01, 2011 Jun 01, 2011

Copy link to clipboard

Copied

@wmkolcz

Something simple like the following "could" work:

<cfset string = cfhttp.filecontent />
<cfset StartText = '<span class="sml">' />
<cfset Start = FindNoCase(StartText, string, 1) />
<cfset EndText='<hr />' />
<cfset Length=Len(StartText) />
<cfset End = FindNoCase(EndText, string, Start) />
<cfset parse = Mid(string, Start+Length, End-Start-Length) />

<cfset parse = trim(parse) />
<cfoutput>#parse#</cfoutput>

Good Luck!

<cfwild />

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Oct 13, 2012 Oct 13, 2012

Copy link to clipboard

Copied

Thanks for this example, just what i needed!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Oct 20, 2012 Oct 20, 2012

Copy link to clipboard

Copied

LATEST

cfwild wrote:

Something simple like the following "could" work:

<cfset string = cfhttp.filecontent />
<cfset StartText = '<span class="sml">' />
<cfset Start = FindNoCase(StartText, string, 1) />
<cfset EndText='<hr />' />
<cfset Length=Len(StartText) />
<cfset End = FindNoCase(EndText, string, Start) />
<cfset parse = Mid(string, Start+Length, End-Start-Length) />

<cfset parse = trim(parse) />
<cfoutput>#parse#</cfoutput>

Brave attempt. However, this would fail if the HTTP client returned HTML tags that contained arbitrary spaces, like <span class = "sml" > and <hr      />.

In any case, you have provided a basis for a possible solution by means of regular expressions. For example,

<cfset startCount = REFindNocase('<span\s+class\s*=\s*"sml"\s*>', httpContent)>

<cfset endCount = REFindNocase('<hr\s*/>', httpContent)>

<cfoutput>#mid(httpContent,startCount, endCount-startCount)#</cfoutput>

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation