Welcome! Log In Create A New Profile

Advanced

Extracting Data after httpGetResult()

Posted by Dan M 
Dan M
Extracting Data after httpGetResult()
July 12, 2009 03:20AM
I am using HTTPRequest and then HTTPGetResult to capture the html used on a page.

I now want to extracts data from this page.

I only need a few pieces of data which are in a table ...

Is there an example of how to do this or ...

Can some one tell me what functions I should be looking at to figure out how to extract the needed data ...

I am trying to extract ... the part number, manufacturer, quantity, and price

The code I need to work with starts in a <tr> tag and then is followed with
<td class="middesc"> that is how I know where the record starts and then the next record starts.

Is there a way to say locate a specific tag then extract XX digits to the right or until reach the closing tag? .... then find the next <TAG> and extract XX digits to the right ... etc ...

or ... find all the information between the opening an closing tag labeled...

Here is a piece of the html code that I am trying extract data from (there are 7 records) ... (there is a bunch of code above this but nothing I need and I don't think it is relevant ...

I tried to put the code here ... but it will not accept my post if I do ... How can I dispplay the html code ...

Can I use Snagit? or something else?
Al
Re: Extracting Data after httpGetResult()
July 12, 2009 06:53AM
Hello Dan

The code you attached caused your message to be picked up by the spam filter.
Can you use Snagit or other other screen shot and then use the "TinyPic" option shown in the top of the forum page to display the graphic image

Regards
Al
DanM
Re: Extracting Data after httpGetResult()
July 12, 2009 02:04PM

Here is the link to view the html from which I want to extract the data

[i28.tinypic.com]
Dan M
Extracting Data after httpGetResult()
July 12, 2009 03:15AM
I am using HTTPRequest and then HTTPGetResult to capture the html used on a page.

I now want to extracts data from this page.

I only need a few pieces of data which are in a table ...

Is there an example of how to do this or ...

Can some one tell me what functions I should be looking at to figure out how to extract the needed data ...

I am trying to extract ... the part number, manufacturer, quantity, and price

The code I need to work with starts in a <tr> tag and then is followed with
<td class="middesc"> that is how I know where the record starts and then the next record starts.

Is there a way to say locate a specific tag then extract XX digits to the right or until reach the closing tag? .... then find the next <TAG> and extract XX digits to the right ... etc ...

or ... find all the information between the opening an closing tag labeled...

Here is a piece of the html code that I am trying extract data from (there are 7 records) ... (there is a bunch of code above this but nothing I need and I don't think it is relevant ...


<tr>
<!--<td class="listcell"><input type="checkbox" name="comp1"></td>-->
<td class="listcell" style="color: #cccccc; width: 104px; text-align: center;">



<div style="height: 80px; width: 104px; overflow: hidden;">



<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KE3-EUPC">
<img src="/images/catalog/picture-na_s.jpg" alt="FZ1600R12KE3 - more info" border="0" style="font-size: 11px;"width="100px"><br> </a>
</div>
</td>
<td class="listcell desc">
<div class="proddesc">
<table style="height: 78px;">
<tr><td class="topdesc">
<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KE3-EUPC">
EUPEC<br>TRANSISTOR </a>
</td></tr>
<tr><td class="middesc">IGBT 1600A 1200V SINGLE</td></tr>
<tr><td class="botdesc"><span class="bold">ITEM # </span><A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KE3-EUPC">FZ1600R12KE3</A></td></tr> </table>
</div> <!-- proddesc -->
</td>
<td class="listcell" align="center">
<div class="stockstat">NO STOCK</div>
<div class="shipmsg">Est. Lead Time<br>28 days</div>
</td>
<td class="listcell" style="text-align: center; vertical-align: top;">
<div style="padding: 21px 8px 6px 8px; font-weight: bold;">
$1,859.60 </div>
<div class="volmsg">Volume<br>Discounts<br>Available<br></div> </td>
<td class="listcell" align="center">
<!--
<a href="/scripts/cgiip.exe/wa/wcat/shopcart.r?listtype=Catalog&amp;pnum=FZ1600R12KE3-EUPC&amp;mfgr=EUPEC" style="color: #ff0033; font-weight: 700;">ADD to CART</a>
-->
<span class="bold">QTY. </span>
<input type="text" maxlength="8" value="1" class="descr" style="width: 35px; text-align: right;" name="part_1">
<br><input type="image" value="Submit" src="/images/buttons/r-addtocart2.gif"
onClick="return addToCart(part_1.value,this.form,'FZ1600R12KE3-EUPC');"
name="Add to Cart" id="add_1" style="position: relative; top: 4px;">
</td>
</tr>


<tr>
<!--<td class="listcell"><input type="checkbox" name="comp2"></td>-->
<td class="listcell" style="color: #cccccc; width: 104px; text-align: center;">
<div style="height: 80px; width: 104px; overflow: hidden;">


<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KF4-EUPC">
<img src="/images/catalog/picture-na_s.jpg" alt="FZ1600R12KF4 - more info" border="0" style="font-size: 11px;"width="100px"><br> </a>
</div>
</td>
<td class="listcell desc">
<div class="proddesc">
<table style="height: 78px;">
<tr><td class="topdesc">
<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KF4-EUPC">
EUPEC<br>TRANSISTOR </a>
</td></tr>
<tr><td class="middesc">IGBT 1600A 1200V SINGLE</td></tr>
<tr><td class="botdesc"><span class="bold">ITEM # </span><A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KF4-EUPC">FZ1600R12KF4</A></td></tr> </table>
</div> <!-- proddesc -->
</td>
<td class="listcell" align="center">
<div class="stockstat">NO STOCK</div>
<div class="shipmsg">Est. Lead Time<br>27 days</div>
</td>
<td class="listcell" style="text-align: center; vertical-align: top;">
<div style="padding: 21px 8px 6px 8px; font-weight: bold;">
$2,208.28 </div>
<div class="volmsg">Volume<br>Discounts<br>Available<br></div> </td>
<td class="listcell" align="center">
<!--
<a href="/scripts/cgiip.exe/wa/wcat/shopcart.r?listtype=Catalog&amp;pnum=FZ1600R12KF4-EUPC&amp;mfgr=EUPEC" style="color: #ff0033; font-weight: 700;">ADD to CART</a>
-->
<span class="bold">QTY. </span>
<input type="text" maxlength="8" value="1" class="descr" style="width: 35px; text-align: right;" name="part_2">
<br><input type="image" value="Submit" src="/images/buttons/r-addtocart2.gif"
onClick="return addToCart(part_2.value,this.form,'FZ1600R12KF4-EUPC');"
name="Add to Cart" id="add_2" style="position: relative; top: 4px;">
</td>
</tr>


<tr>
<!--<td class="listcell"><input type="checkbox" name="comp3"></td>-->
<td class="listcell" style="color: #cccccc; width: 104px; text-align: center;">
<div style="height: 80px; width: 104px; overflow: hidden;">


<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KL4C-EUPC">
<img src="/images/catalog/picture-na_s.jpg" alt="FZ1600R12KL4C - more info" border="0" style="font-size: 11px;"width="100px"><br> </a>
</div>
</td>
<td class="listcell desc">
<div class="proddesc">
<table style="height: 78px;">
<tr><td class="topdesc">
<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KL4C-EUPC">
EUPEC<br>TRANSISTOR </a>
</td></tr>
<tr><td class="middesc">IGBT 1600A 1200V SINGLE</td></tr>
<tr><td class="botdesc"><span class="bold">ITEM # </span><A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R12KL4C-EUPC">FZ1600R12KL4C</A></td></tr> </table>
</div> <!-- proddesc -->
</td>
<td class="listcell" align="center">
<div class="stockstat">NO STOCK</div>
<div class="shipmsg">Est. Lead Time<br>28 days</div>
</td>
<td class="listcell" style="text-align: center; vertical-align: top;">
<div style="padding: 21px 8px 6px 8px; font-weight: bold;">
$2,208.28 </div>
<div class="volmsg">Volume<br>Discounts<br>Available<br></div> </td>
<td class="listcell" align="center">
<!--
<a href="/scripts/cgiip.exe/wa/wcat/shopcart.r?listtype=Catalog&amp;pnum=FZ1600R12KL4C-EUPC&amp;mfgr=EUPEC" style="color: #ff0033; font-weight: 700;">ADD to CART</a>
-->
<span class="bold">QTY. </span>
<input type="text" maxlength="8" value="1" class="descr" style="width: 35px; text-align: right;" name="part_3">
<br><input type="image" value="Submit" src="/images/buttons/r-addtocart2.gif"
onClick="return addToCart(part_3.value,this.form,'FZ1600R12KL4C-EUPC');"
name="Add to Cart" id="add_3" style="position: relative; top: 4px;">
</td>
</tr>


<tr>
<!--<td class="listcell"><input type="checkbox" name="comp4"></td>-->
<td class="listcell" style="color: #cccccc; width: 104px; text-align: center;">
<div style="height: 80px; width: 104px; overflow: hidden;">


<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KE3-B2-EUPC">
<img src="/images/catalog/picture-na_s.jpg" alt="FZ1600R17KE3-B2 - more info" border="0" style="font-size: 11px;"width="100px"><br> </a>
</div>
</td>
<td class="listcell desc">
<div class="proddesc">
<table style="height: 78px;">
<tr><td class="topdesc">
<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KE3-B2-EUPC">
EUPEC<br>TRANSISTOR </a>
</td></tr>
<tr><td class="middesc">IGBT SINGLE</td></tr>
<tr><td class="botdesc"><span class="bold">ITEM # </span><A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KE3-B2-EUPC">FZ1600R17KE3-B2</A></td></tr> </table>
</div> <!-- proddesc -->
</td>
<td class="listcell" align="center">
<div class="stockstat">NO STOCK</div>
<div class="shipmsg">Est. Lead Time<br>28 days</div>
</td>
<td class="listcell" style="text-align: center; vertical-align: top;">
<div style="padding: 21px 8px 6px 8px; font-weight: bold;">
$2,862.80 </div>
<div class="volmsg">Volume<br>Discounts<br>Available<br></div> </td>
<td class="listcell" align="center">
<!--
<a href="/scripts/cgiip.exe/wa/wcat/shopcart.r?listtype=Catalog&amp;pnum=FZ1600R17KE3-B2-EUPC&amp;mfgr=EUPEC" style="color: #ff0033; font-weight: 700;">ADD to CART</a>
-->
<span class="bold">QTY. </span>
<input type="text" maxlength="8" value="1" class="descr" style="width: 35px; text-align: right;" name="part_4">
<br><input type="image" value="Submit" src="/images/buttons/r-addtocart2.gif"
onClick="return addToCart(part_4.value,this.form,'FZ1600R17KE3-B2-EUPC');"
name="Add to Cart" id="add_4" style="position: relative; top: 4px;">
</td>
</tr>


<tr><!--<td class="listcell"><input type="checkbox" name="comp5"></td>-->
<td class="listcell" style="color: #cccccc; width: 104px; text-align: center;">
<div style="height: 80px; width: 104px; overflow: hidden;">


<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KE3-EUPC">
<img src="/images/catalog/picture-na_s.jpg" alt="FZ1600R17KE3 - more info" border="0" style="font-size: 11px;"width="100px"><br> </a>
</div>
</td>
<td class="listcell desc">
<div class="proddesc">
<table style="height: 78px;">
<tr><td class="topdesc">
<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KE3-EUPC">
EUPEC<br>TRANSISTOR </a>
</td></tr>
<tr><td class="middesc">IGBT SINGLE</td></tr>
<tr><td class="botdesc"><span class="bold">ITEM # </span><A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KE3-EUPC">FZ1600R17KE3</A></td></tr> </table>
</div> <!-- proddesc -->
</td>
<td class="listcell" align="center">
<div class="stockstat">NO STOCK</div>
<div class="shipmsg">Est. Lead Time<br>28 days</div>
</td>
<td class="listcell" style="text-align: center; vertical-align: top;">
<div style="padding: 21px 8px 6px 8px; font-weight: bold;">
$2,208.28 </div>
<div class="volmsg">Volume<br>Discounts<br>Available<br></div> </td>
<td class="listcell" align="center">
<!--
<a href="/scripts/cgiip.exe/wa/wcat/shopcart.r?listtype=Catalog&amp;pnum=FZ1600R17KE3-EUPC&amp;mfgr=EUPEC" style="color: #ff0033; font-weight: 700;">ADD to CART</a>
-->
<span class="bold">QTY. </span>
<input type="text" maxlength="8" value="1" class="descr" style="width: 35px; text-align: right;" name="part_5">
<br><input type="image" value="Submit" src="/images/buttons/r-addtocart2.gif"
onClick="return addToCart(part_5.value,this.form,'FZ1600R17KE3-EUPC');"
name="Add to Cart" id="add_5" style="position: relative; top: 4px;">
</td>
</tr>


<tr>
<!--<td class="listcell"><input type="checkbox" name="comp6"></td>-->
<td class="listcell" style="color: #cccccc; width: 104px; text-align: center;">
<div style="height: 80px; width: 104px; overflow: hidden;">


<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KF6-B2-EUPC">
<img src="/images/catalog/picture-na_s.jpg" alt="FZ1600R17KF6-B2 - more info" border="0" style="font-size: 11px;"width="100px"><br> </a>
</div>
</td>
<td class="listcell desc">
<div class="proddesc">
<table style="height: 78px;">
<tr><td class="topdesc">
<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KF6-B2-EUPC">
EUPEC<br>TRANSISTOR </a>
</td></tr>
<tr><td class="middesc">IGBT 1600A 1700V SINGLE</td></tr>
<tr><td class="botdesc"><span class="bold">ITEM # </span><A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KF6-B2-EUPC">FZ1600R17KF6-B2</A></td></tr> </table>
</div> <!-- proddesc -->
</td>
<td class="listcell" align="center">
<div class="stockstat">NO STOCK</div>
<div class="shipmsg">Est. Lead Time<br>28 days</div>
</td>
<td class="listcell" align="center" style="color: #cccccc;">&#149;</td>
<td class="listcell" align="center">
<!--
<a href="/scripts/cgiip.exe/wa/wcat/shopcart.r?listtype=Catalog&amp;pnum=FZ1600R17KF6-B2-EUPC&amp;mfgr=EUPEC" style="color: #ff0033; font-weight: 700;">ADD to CART</a>
-->
<span class="bold">QTY. </span>
<input type="text" maxlength="8" value="1" class="descr" style="width: 35px; text-align: right;" name="part_6">
<br><input type="image" value="Submit" src="/images/buttons/r-addtocart2.gif"
onClick="return addToCart(part_6.value,this.form,'FZ1600R17KF6-B2-EUPC');"
name="Add to Cart" id="add_6" style="position: relative; top: 4px;">
</td>
</tr>
<tr>
<!--<td class="listcell"><input type="checkbox" name="comp7"></td>-->
<td class="listcell" style="color: #cccccc; width: 104px; text-align: center;">
<div style="height: 80px; width: 104px; overflow: hidden;">



<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KF6C-B2-EUPC">
<img src="/images/catalog/picture-na_s.jpg" alt="FZ1600R17KF6C-B2 - more info" border="0" style="font-size: 11px;"width="100px"><br> </a>
</div>
</td>
<td class="listcell desc">
<div class="proddesc">
<table style="height: 78px;">
<tr><td class="topdesc">
<A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KF6C-B2-EUPC">
EUPEC<br>TRANSISTOR </a>
</td></tr>
<tr><td class="middesc">IGBT</td></tr>
<tr><td class="botdesc"><span class="bold">ITEM # </span><A HREF="/scripts/cgiip.exe/wa/wcat/itemdtl.r?listtype=Catalog&amp;pnum=FZ1600R17KF6C-B2-EUPC">FZ1600R17KF6C-B2</A></td></tr> </table>
</div> <!-- proddesc -->
</td>
<td class="listcell" align="center">
<div class="stockstat">NO STOCK</div>
<div class="shipmsg">Est. Lead Time<br>28 days</div>
</td>
<td class="listcell" style="text-align: center; vertical-align: top;">
<div style="padding: 21px 8px 6px 8px; font-weight: bold;">
$2,933.14 </div>
<div class="volmsg">Volume<br>Discounts<br>Available<br></div> </td>
<td class="listcell" align="center">
<!--
<a href="/scripts/cgiip.exe/wa/wcat/shopcart.r?listtype=Catalog&amp;pnum=FZ1600R17KF6C-B2-EUPC&amp;mfgr=EUPEC" style="color: #ff0033; font-weight: 700;">ADD to CART</a>
-->
<span class="bold">QTY. </span>
<input type="text" maxlength="8" value="1" class="descr" style="width: 35px; text-align: right;" name="part_7">
<br><input type="image" value="Submit" src="/images/buttons/r-addtocart2.gif"
onClick="return addToCart(part_7.value,this.form,'FZ1600R17KF6C-B2-EUPC');"
name="Add to Cart" id="add_7" style="position: relative; top: 4px;">
</td>
</tr>
</table>



<table width="98%" border="0"><tr><td width="20%" align="center" class="small"><i>1 - 7 of 7 Matches</i> </td><td width="22%" class="small">&nbsp;</td><td width="58%" align="right" class="small">
&nbsp; </td> </tr> </table>
</form>
</div>
</td></tr></table>
<div class="gvvtext" style="text-align: right;"><br>
</div>
</div>
</div> <!-- mainbody -->
</div> <!-- core -->
<div class="tfoot">
<!-- tfoot.inc -->
<div class="spacer"></div>
<div class="orangebot">
<div class="footl">
<span id="siteseal"><script type="text/javascript" src="[seal.godaddy.com];
</div>
<div class="footc">
<div class="small bold">26010 Pinehurst Drive, Madison Heights, MI &nbsp;48071</div>
<div class="small pad">
<a class="tlnk" href="/fabout.htm">About Us</a> | &copy; Copyright 2009 Galco Industrial Electronics, All Rights Reserved | <a class="tlnk" href="/terms.htm">Terms of Use</a>
</div>
</div>
<div class="footr">
<!-- START SCANALERT CODE -->
<a target="_blank" href="[www.mcafeesecure.com] width="94" height="54" border="0" src="//images.scanalert.com/meter/www.galco.com/23.gif" alt="McAfee Secure sites help keep you safe from identity theft, credit card fraud, spyware, spam, viruses and online scams" oncontextmenu="alert('Copying Prohibited by Law - McAfee Secure is a Trademark of McAfee, Inc.'); return false;"></a>
<!-- END SCANALERT CODE -->
</div>
</div>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "[ssl."]; : "[www."]winking smiley;
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-709262-1");
pageTracker._initData();
pageTracker._trackPageview();
</script>
<!-- End of tfoot.inc -->
</div> <!-- tfoot -->
</div> <!-- content -->
</body>
</html>




Stefan Bentvelsen
Re: Extracting Data after httpGetResult()
July 12, 2009 10:33AM
Hi Dan,

if you have the page in a string, I think you can use Position() for that.
Chris L
Re: Extracting Data after httpGetResult()
July 12, 2009 02:04PM
Dan

I've done quite a bit of this stuff over the past few years, extracting data from web pages typically (though not always) in table form.

The good news is that it can be done and it works well, even with some fairly complicated pages. (One application I use accesses a main page, extracts links to subsidiary pages and then extracts the data from these subsidiary pages, which is in the form of groups of tables.)

The bad news is that it takes a lot of time and effort to set this up. I haven't found any easy way of doing this, although I have discovered certain functions in WinDev which have made this easier.

I did think the new HTMLToText function in version 14 would be a great help but because it strips all the formatting and just leaves a long line of text, I have not been able to use this. The HTMLToRTF function leaves formatting like Bold but doesn't leave the built-in structure of the web page.

Basically, I save the web page to a text file then parse this. It's the parsing that takes the time, basically a trial-and-error process. I use a test window to do this: one with two large edit fields showing before and after views. I build up the code gradually to get the information required.

I don't know whether you're interested but here's some detail of what I've done.


Writing to a text file

FileId = fOpen("meeting.html", foCreate)  // can also be a .txt file

IF FileID <> -1 THEN
	
	Callres= HTTPRequest(sWebAddress)
	
	WHILE Callres = False                              // loop put in to cope with delays in connection
		iCount3 ++
		FOR icount2 = 1 TO 1000
		END
		Callres= HTTPRequest(sWebAddress)
	END
		
	IF Callres= True THEN
		// Save the image retrieved into the file
		fWrite(FileID, HTTPGetResult())
	ELSE
		Trace("There's a problem with timing.")	
	END
	fClose(FileID)
END



Next step: get to the body of the text and clean it up a bit (but leaving essential breaks and formatting)

IF fSize("meeting.html") > 1000 	
	
	FileId = fOpen("meeting.html",foRead)
	
	sTextVersion = fRead(FileId,fSize("meeting.html"))
	
	icount2 = PositionOccurrence(sTextVersion,"/h6 h2",1,IgnoreCase)
	
	sTextVersion = sTextVersion[[(icount2+9) TO ]]
	icount2 = PositionOccurrence(sTextVersion,"div id='Footer'>",1,IgnoreCase)
	sTextVersion = sTextVersion[[1 TO (icount2-27)]]
	
	sTextVersion = Replace(sTextVersion,"/table>","&&&"+CR+CR,IgnoreCase)	/// "&&&" used as a marker for later substitution
	sTextVersion = Replace(sTextVersion,"/tbody>>","",IgnoreCase)	
	sTextVersion = Replace(sTextVersion,"h2>","",IgnoreCase)	
	sTextVersion = Replace(sTextVersion,"/tr> tr> td>","",IgnoreCase)	
	sTextVersion = Replace(sTextVersion,"<tbody><tr><td>",CR,IgnoreCase)	
	sTextVersion = Replace(sTextVersion," /td> td>",TAB,IgnoreCase)	
	sTextVersion = Replace(sTextVersion," /td> td>",TAB,IgnoreCase)	
	sTextVersion = Replace(sTextVersion," /td>","",IgnoreCase)	
	sTextVersion = Replace(sTextVersion," /tr>&&&","&&&",IgnoreCase)	
	sTextVersion = Replace(sTextVersion," /h2> h3>","##",IgnoreCase)	
	
	icount2 = PositionOccurrence(sTextVersion,"&&&",1,FromEnd)
	sTextVersion = sTextVersion[[1 TO (icount2+2)]]
	
	
	sTextFile = sVenue+".txt"
	
	FileID2 = fOpen(sTextFile,foCreate)
	
	IF FileID2 <> -1 THEN
		WriteRes = fWrite(FileID2,sTextVersion)
		
		IF WriteRes = -1 THEN
			INFO("Oops! Error in writing sTextVersion")		
		END
		fClose(FileID2)
	END
I've had to remove all the "<" brackets from the front of the HTML codes because they wreak havoc with the code displayed in this posting. However, in my original code these left angle brackets are present.

Two functions of WinDev provide the framework for parsing the HTML code into something useful.

PositionOccurrence to determine the position of a particular HTML code group. Notice I say 'code group' because typically it is a particular sequence of HTML codes which distinguish one piece or section of data from another. (In the above code, you'll see the use of "/tr> tr> td>" because that is what was necessary to separate this piece of data from something else in the table.

String slicing is other important component of this process. Having determined the position of a code or code group, I now know where to get the data. If it's a fixed length I can use that in the arguments of the string slice; if the length is variable then I need to determine the position of the end marker or the start of the next piece of data.



When you've got rid of the dross and broken the file down into useable sections then use the ExtractString function.

sRaceString = ExtractString(sTextVersion,firstRank,"&&&")
	
	NumberOfRaces = 0
	WHILE Length(sRaceString)>10
		ProcessOneRace(sRaceString)
		NumberOfRaces += 1
		sRaceString = ExtractString(sTextVersion,nextRank,"&&&")
	END

You'll recall I said I used the "&&&" as a marker for later substitution, maybe a TAB or CR. In this case, I've used it as a marker for a slab of text. In this particular application, sRaceString is a long string which contains all the data from one race (but one race only) extracted from a page which lists many races as well as general data about the meeting.

I next work on this sRaceString to extract what could be called the 'header' information and I am then left with a slab of text which contains just the data about the runners. I again use the ExtractString function to pick out the information for the individual runners, one string per runner. Finally, another ExtractString function picks up the individual items from each runner: name, rider, weight, etc. (This ExtractString function has proved enormously useful, saving much time and effort from my earlier versions when I had work on PositionOccurrence with HTML codes right down to the individual items of data. The sooner you can use ExtractString in your parsing process, the quicker and easier this process becomes.)



I don't know whether any of this helps but it might give you some ideas for what you're doing. As I said at the beginning, I haven't found any short easy way of doing this - it takes literally hours of trial-and-error. The only saving grace is that WD makes it very quick and easy to make a change and test it out. Write something in your code, test the window, close it and you're immediately back to your code with the cursor in the same position as you left it.

Best of luck.

Chris L
Melbourne, Oz






DanM
Re: Extracting Data after httpGetResult()
July 12, 2009 03:29PM
Chris,

Thanks for pointing me in the right direction.

By using ... EDT_Edit1 = HTMLToText(HTTPGetResult()) (in WinDev12)

(in the Entry of EDT1) I am able to get the data to look like this (see below). It is getting very close.

Now , there is a bunch data before the table with the data I want to extract. A lot more than I am showing here. The data below 1 - 7 of Matches ...
is the data I want to extract ...

How do I tell WinDev where to start reading data or how do I trim the string starting after this point? Example .. there is a heading phrase

IMAGE DESCRIPTION AVAILABILITY PRICE

I want to further trim the string to only the data after this phrase. this should leave me with just the data I want to save to a data table (except for the junk at the end)

Also, what is the function to read data by line? It looks like each record below is 16 lines ... starting with the word EUPEC ... (the manufacturer)

I have been playing with PositionOccurence, Position, ExtractString ... but no luck getting it to work yet ...

Thanks for your help so far ... Dan

===============================================================

Wire Duct
Flexible Duct
Panel Duct
Wireholders
Wireholders
 Coming soon ...
 Coming soon ...
 Coming soon ...
Buy Products > Search: FZ1600

Narrow your Search
 
1 - 7 of 7 Matches Show  Items / Page   
IMAGE DESCRIPTION AVAILABILITY PRICE  


EUPEC
TRANSISTOR
IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KE3

NO STOCK
Est. Lead Time
28 days

$1,859.60
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KF4

NO STOCK
Est. Lead Time
27 days

$2,208.28
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KL4C

NO STOCK
Est. Lead Time
28 days

$2,208.28
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT SINGLE
ITEM # FZ1600R17KE3-B2

NO STOCK
Est. Lead Time
28 days

$2,862.80
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT SINGLE
ITEM # FZ1600R17KE3

NO STOCK
Est. Lead Time
28 days

$2,208.28
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT 1600A 1700V SINGLE
ITEM # FZ1600R17KF6-B2

NO STOCK
Est. Lead Time
28 days
? QTY.


EUPEC
TRANSISTOR
IGBT
ITEM # FZ1600R17KF6C-B2

NO STOCK
Est. Lead Time
28 days

$2,933.14
Volume
Discounts
Available
QTY.
1 - 7 of 7 Matches    

DanM
Re: Extracting Data after httpGetResult()
July 12, 2009 06:52PM
There is a phrase in the string that is the beginning of where I need to extract the data ...

1 - 7 of 7 Matches Show Items / Page
IMAGE DESCRIPTION AVAILABILITY PRICE

What is the function I can use to determine the position of the end of this phrase?

I think from there I could use "right" function to grab everything from that point on????

Any ideas??

Dan
DanM
Re: Extracting Data after httpGetResult()
July 12, 2009 08:01PM
I am getting closer but ... this will not allow me to extract correctly every time ...

First, I am able to use ...

gResStart = HTTPRequest("[www.onlinecomponents.com]winking smiley

... and then ...

EDT_Edit1 = HTMLToText(HTTPGetResult())

this gets me to the point where the HTML from the page is in Text format in a string ...

Now I am trying to get rid of all the information I do not need. I am currently doing it manually by slowly figuring out at whar position the data starts and stops, as below ...

MyString is string = EDT_Edit1

FirstCut is a string = Right(MyString , 640) // Returns "Madagascar"
EDT_Edit2 = Left(FirstCut, 220)

Does anyone know of a way to ExtractString between to TAGS or phrases??
then I would be able to extract the information I need without identifying the position or location of the beginning of the data.

OR ... How do I identify the position or location of a phrase? I think I would be able to do a Right and Left function on the string to remove the un-needed data??

Any thoughts, suggestions, ideas ...

Dan



Piet van Zanten
Re: Extracting Data after httpGetResult()
July 12, 2009 11:36PM
Hi Dan,

The beauty of extractstring is that it will not only use a single character as an argument, but also a multicharacter string. So if you wanted to extract the body part of a html string then you would use:

sBody=Extractstring(Extractstring(sHTML,1,"[/body]"),2,"[body]")
The html page is split in two by the body end tag. The inner extractstring returns the part in front the [/body] This is the first argument of the outer extractstring, which returns the part after the [body] tag. (rank 2)
Using this technique you can narrow down your search step by step.

Regards,
Piet

Note: Because the the forum will not display any pointy brackets and anything between them I replaced them by square brackets.
DanM
Re: Extracting Data after httpGetResult()
July 13, 2009 03:05AM
Piet,

That is amazing ... Thank you ...

but I am still stuck when it comes to the column qty and prices ...

This allows me to get down to ...

1TL1-2G
In Stock: 36 pcs. can ship now
Factory Lead-Time: 6 weeks
Pricing for 1TL1-2G Quantity Price
1 - 24 $34.56
25 - 49 $28.69
50 - 249 $25.24
250 - 999 $23.50
1000 + $21.43

Now, many part number have different column quantities (ie. 1-2,3-5,6-10 or 1-9, 10-24,25-99 or 1-99, 100-499, 500+)

How would I use the ExtractString function to separate by line? OR can I extract by line?
is there a "CR" character I can extract by? or some other way?

Example (based on above data)

$partnumber = 1TL1-2G
etc ... (I know understand how to strip the additional fields up top ...

$qty_on_hand = ExtractString(ExtractString($qtyStep1,1,"pcs.")2,"In Stock")

but what can I do about the multiple column qty & pricing since the column qty & prices change?

... I will not know what the beginning and end strings will be to extract by?

If there is a way to extract by line I could use the $ in the price to do a ... (from beginning of line to $ and a from $ to end of line)

does that make sense ???

this would be the need outcome ....

$col1_qty = 1-24
$col1_price = $34.56
$col2_qty = 25-99
$col2_price = 28.69
$col3_qty = 50 - 249
$col3_price= $25.24
$col4_qty = 250 - 999
$col4_price = $23.50
$col5_qty = 1000 +
$col5_price = $21.43













Piet van Zanten
Re: Extracting Data after httpGetResult()
July 13, 2009 08:55AM
Hi Dan,

To extract parts separated by a repeating string look at:
FOR EACH STRING sPart OF sContent SEPARATED BY "anything"
Discard what you don't need and add the relevant sParts to an array and break down the elements of the array with the same technique.
Use breakpoints and the debugger to track the results and narrow down your search.

You must find patterns that are allways the same, otherwise it will be impossible to extract the data. Typically you can look for tables: <table </table , <tr </tr (table row), <td </td (table cell) or <div </div or <br (carriage return)
Note that there can be various class indicators and/or style attributes included in between the < and the >

Regards,
Piet
Chris L
Re: Extracting Data after httpGetResult()
July 13, 2009 03:34PM
Dan

Piet's highlighted the problem/challenge. You have to find a separator for each part, whether you're using ExtractString or the FOR EACH construction mentioned above. That's why I ended up rejecting the HTMLToText function. As I mentioned in my previous posting, this function removed all the formatting which meant that it was next to impossible to find unique separators.

One thing I tried was analysing the ASCII codes of every character in the string. Sometimes there were hidden characters, characters which did not show on the screen when displayed as standard text. For instance, the end of a line could be an ASCII code 13 or 10 or both. That's why looking for a standard CR doesn't always work.

If you still can't find the unique characters or character combinations to distinguish lines in your example then the only solution I can think of is to go back to your original HTML file (before transformation with the HTMLToText function), work out the particular combination of codes which distinguish each line and then substitute/replace with a special string such as "&&&"; you can then use this later as your unique separator.

Rest assured, it's all possible (speaking from considerable experience parsing a range of web pages) but as I've indicated, it's very much grunt work, plain hard laborious slogging.

Have fun.

Chris
DarrenF
Re: Extracting Data after httpGetResult()
July 13, 2009 05:32PM
Hi,

I don't know the sorce of the HTML file you are trying to process, but, is it possible to get the information supplied to you in XML?

Piet and Chris are quite correct - it's all about finding unique separators that can't appear anywhere else - even the word "Display" (for example) could appear in the product description area. But all this also assumes that the layout of the HTML page dosen't change over time.

If you can get it in XML it's much easier to process...

Cheers...
Dan M
Re: Extracting Data after httpGetResult()
July 13, 2009 06:23PM
Darren,

unfortunately ... we are not able to get the xml from many of the suppliers, therefore I will need to do this for many (too many) suppliers until they get up to speed on the XML feed,

Chris,

Did you say ... it's very much grunt work, plain hard laborious slogging. Have fun.

... all in the same breath? I do not recall the last time I was slogging and had fun ... LOL

but ... I am so close ... I think if I can figure out this last piece ... I will be on my way ...

Here it is ...

I am back to the HTML (without the HTMLToText) as you discussed. My line of code is ...

sPriceBreakQty = ExtractString(ExtractString(sAPriceBreakLine, 1, "[/td]"),2,"[td width="50%" align="center" class="regprice"]")

but, that does not work because the HTML has quotes around all the data (50%, center, and regprice)

So then I tried removing the quotes around those 3 pieces of data ... but that does not work either.

so now I am thinking about what you said earlier about unique identifiers ... I thought I would replace any of the HTML that had the quotes in it ...

What if I replace [/td]"),2,"[td width="50%" align="center" class="regprice"]
... with ENDOFPRICE as a unique identifier ...

BUT ...

How do you do a replace when the item you want to replace has quotes in it when you need to put quotes around the item???

Any Ideas??

Dan
Ruben Sanchez Peña
Re: Extracting Data after httpGetResult()
July 13, 2009 08:11PM
Hi. If you want use " in a string you must write "" por each.

sPriceBreakQty = ExtractString(ExtractString(sAPriceBreakLine, 1, "[/td]"),2,"[td width=""50%"" align=""center"" class=""regprice""]")


Chris L
Re: Extracting Data after httpGetResult()
July 14, 2009 03:35AM
Dan

"... it's very much grunt work, plain hard laborious slogging. Have fun."

Yes, it was said tongue in cheek! As I think you've discovered, it's both boring and frustrating. You'd go out of your mind if you tried to do it without breaks of doing other work.

Having completed the work, I must admit it is satisfying to be able to click a button and see thousands of pieces of data stored in the appropriate files. Until of course someone decides to update the website! That happened to me at the beginning of this year; the webmaster of the major site I use decided to modernise with style sheets and the rest. It certainly looks better but it meant that many of the codes changed so I had to start over again. Luckily it gets easier second (third, fourth, etc) time round!

Ruben's info on putting quotes within quotes is worth noting - it crops up in all sorts of places. If you want to use quotes within a string defined with quotes, simply double the quotes.
sItemDescription = "Microprocessor transistor (known as ""MOSFET"")"
It looks strange when the quotes are at the end of the string and you have three quotes together but that's correct.
sItemDescription = "Transistor ""BJT"""

I'll change my parting salutation to ""Best of luck""!

Chris
DanM
Re: Extracting Data after httpGetResult()
July 14, 2009 04:06PM
Chris,

Well, I have made it ... I ended up needing 1 more scrub of the data and this was how it was done ... (by adding the HTMLToText)

before adding HTMLToText : [tinypic.com]

after adding HTMLToText : [tinypic.com]

sPriceBreakQty = HTMLToText(ExtractString(ExtractString(sAPriceBreakLine, 1, "</td>"),2,"<td width=""50%"" align=""center"" class=""regprice"">"))

sPriceBreakPrice =HTMLToText(ExtractString(ExtractString(sAPriceBreakLine, 1, "</span>"),2,"<span class=""padright"">"))

Apparently, there was still some HTML formatting in the string which was preventing me from getting to the actual data. I was able to see the data in an infobox but when I attempted to display it to a table it would come up as a "blank". Once I added the HTMLToText, what ever was in front of the data was stripped and now all is good.

Thank you ALL for your help on this journey !!

Dan
Chris L
Re: Extracting Data after httpGetResult()
July 15, 2009 07:45AM
Well done, Dan!

Perseverance pays off.

It'll be easier next time!

Chris
Author:

Your Email:


Subject:


Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically. If the code is hard to read, then just try to guess it right. If you enter the wrong code, a new image is created and you get another chance to enter it right.
Message: