Spider and get tag information of one web page  
Author Message
_BNC





PostPosted: Wed May 09 02:24:20 CDT 2007 Top

ASP.Net >> Spider and get tag information of one web page

Hi all
i would like to know if anyone knows about a code sample.
Lets say for example
http://www.hide-link.com/ ;_ylt=AowRaqOx9PVGQC1OCxcj9vsEgFoB;_ylu=X3oDMTBhNjRqazhxBHNlYwNzZWFyY2g-?p=+friendship+roses+&did=&x=51&y=10

As you can see that there is a lot of items.
I need to be able to get the image link, navigate url, price,
description etc. of each item and then store them in a database.

I know that there is a way of searching in the html code and return
values (but don't know how)
And help would be appreciated.
Thank you,

Web Programming343  
 
 
Alexey





PostPosted: Wed May 09 02:24:20 CDT 2007 Top

ASP.Net >> Spider and get tag information of one web page

> I know that there is a way of searching in the html code and return
> values (but don't know how)

Use Regular Expressions.
More info: http://www.google.com/search?hl=en&q=regular+expressions+asp.net

In your case you should get the text and parse it using patterns.

Here's the complete pattern to get the link, name, description and
price:

(?<=\<h2\>\<a\shref=\")
(?<url>(.|\n)*?)(\"\>)(?<name>(.|\n)*?)(\<\/a\></h2\>\n\<br\/\>)
(?<description>(.|\n)*?)(\n)
(.|\n)*?
(\<span\sclass\=\"price\"\>)(?<price>.*?)(\<\/span\>)

Note, in the code it has to be in one line.

Here's an example of the code:

string t = "html_from_yahoo";
string e = "(?<=\<h2\>............(\<\/span\>)";

Regex r = new Regex(e, RegexOptions.Compiled);
MatchCollection matches = r.Matches(t);

foreach (Match m in matches)
{
Response.Write("name="+match.Groups["name"]);
Response.Write("description="+match.Groups["name"]);
Response.Write("url="+match.Groups["url"]);
Response.Write("price="+match.Groups["price"]);
}

Hope it helps

 
 
discountonall





PostPosted: Sun May 13 15:08:08 CDT 2007 Top

ASP.Net >> Spider and get tag information of one web page


>
> > I know that there is a way of searching in the html code and return
> > values (but don't know how)
>
> Use Regular Expressions.
> More info:http://www.google.com/search?hl=en&q=regular+expressions+asp.net
>
> In your case you should get the text and parse it using patterns.
>
> Here's the complete pattern to get the link, name, description and
> price:
>
> (?<=\<h2\>\<a\shref=\")
> (?<url>(.|\n)*?)(\"\>)(?<name>(.|\n)*?)(\<\/a\></h2\>\n\<br\/\>)
> (?<description>(.|\n)*?)(\n)
> (.|\n)*?
> (\<span\sclass\=\"price\"\>)(?<price>.*?)(\<\/span\>)
>
> Note, in the code it has to be in one line.
>
> Here's an example of the code:
>
> string t = "html_from_yahoo";
> string e = "(?<=\<h2\>............(\<\/span\>)";
>
> Regex r = new Regex(e, RegexOptions.Compiled);
> MatchCollection matches = r.Matches(t);
>
> foreach (Match m in matches)
> {
> Response.Write("name="+match.Groups["name"]);
> Response.Write("description="+match.Groups["name"]);
> Response.Write("url="+match.Groups["url"]);
> Response.Write("price="+match.Groups["price"]);
>
> }
>
> Hope it helps

I have the full string of the page.
I would like to know what the syntext for example is to find all the
full string from <table class="item_table"
Until the next one and return it as a string

 
 
Alexey





PostPosted: Sun May 13 15:36:48 CDT 2007 Top

ASP.Net >> Spider and get tag information of one web page


>
>
>
>
>


>
> > > I know that there is a way of searching in the html code and return
> > > values (but don't know how)
>
> > Use Regular Expressions.
> > More info:http://www.google.com/search?hl=en&q=regular+expressions+asp.net
>
> > In your case you should get the text and parse it using patterns.
>
> > Here's the complete pattern to get the link, name, description and
> > price:
>
> > (?<=\<h2\>\<a\shref=\")
> > (?<url>(.|\n)*?)(\"\>)(?<name>(.|\n)*?)(\<\/a\></h2\>\n\<br\/\>)
> > (?<description>(.|\n)*?)(\n)
> > (.|\n)*?
> > (\<span\sclass\=\"price\"\>)(?<price>.*?)(\<\/span\>)
>
> > Note, in the code it has to be in one line.
>
> > Here's an example of the code:
>
> > string t = "html_from_yahoo";
> > string e = "(?<=\<h2\>............(\<\/span\>)";
>
> > Regex r = new Regex(e, RegexOptions.Compiled);
> > MatchCollection matches = r.Matches(t);
>
> > foreach (Match m in matches)
> > {
> > Response.Write("name="+match.Groups["name"]);
> > Response.Write("description="+match.Groups["name"]);
> > Response.Write("url="+match.Groups["url"]);
> > Response.Write("price="+match.Groups["price"]);
>
> > }
>
> > Hope it helps
>
> I have the full string of the page.
> I would like to know what the syntext for example is to find all the
> full string from <table class="item_table"
> Until the next one and return it as a string- Hide quoted text -
>
> - Show quoted text -

I guess, something similar to the

(\<table\sclass\=\"item_table\")(.|\n)*?(?=\<table\sclass\=\"item_table
\")