Coffeehouse Thread

7 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

Screen scraping AskVille

Back to Forum: Coffeehouse
  • User profile image
    Steve411

    Hey everyone,

    I'm having a hard time with this one. Since AskVille does not provide an RSS feed I can work with, I need to write a regular expression which will retrieve all of the results of a search.

    So for example, on askville.com if i were to search for "google", i'd get referred to this page:
    http://askville.amazon.com/SearchRequests.do?search=google&open=true&closed=false

    It returns about 10 individual results. I need to grab the URL and title of each result. Can anyone help me with the regular expression for this?

    Thanks.
    Steve

  • User profile image
    Sven Groot

    Forget about regular expressions. Extracting stuff from HTML with regular expressions is a pain and extremely error prone.

    Instead, I suggest you use the HTML Agility Pack, a free full-blown HTML parser (that can deal with malformed HTML because the page you linked to is not well-formed), and then use an XPath expression to get the elements you want.

    It seems the results are contained in a <div> with id "ad_contained_1", so all you need is an XPath expression //div[@id='ad_contained_1']/a  to get all the anchor elements for the results. Then it's a simple matter of reading the contents and the href attribute for each element to get the text and link.

  • User profile image
    Sven Groot

    Sven Groot said:
    Forget about regular expressions. Extracting stuff from HTML with regular expressions is a pain and extremely error prone.

    Instead, I suggest you use the HTML Agility Pack, a free full-blown HTML parser (that can deal with malformed HTML because the page you linked to is not well-formed), and then use an XPath expression to get the elements you want.

    It seems the results are contained in a <div> with id "ad_contained_1", so all you need is an XPath expression //div[@id='ad_contained_1']/a  to get all the anchor elements for the results. Then it's a simple matter of reading the contents and the href attribute for each element to get the text and link.
    Quick sample to get you started:

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    Uri baseUrl = new Uri("http://askville.amazon.com/SearchRequests.do?search=google&open=true&closed=false");
    using( WebClient client = new WebClient() ) using( Stream s = client.OpenRead(baseUrl) ) {     doc.Load(s); }
    HtmlAgilityPack.HtmlNodeCollection results = doc.DocumentNode.SelectNodes("//div[@id='ad_contained_1']/a");
    foreach( HtmlAgilityPack.HtmlNode node in results ) {     Uri url = new Uri(baseUrl, node.GetAttributeValue("href", ""));     string text = node.InnerText.Trim();     // Do something with it. }

  • User profile image
    Steve411

    Sven Groot said:
    Sven Groot said:
    *snip*
    Quick sample to get you started:

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    Uri baseUrl = new Uri("http://askville.amazon.com/SearchRequests.do?search=google&open=true&closed=false");
    using( WebClient client = new WebClient() ) using( Stream s = client.OpenRead(baseUrl) ) {     doc.Load(s); }
    HtmlAgilityPack.HtmlNodeCollection results = doc.DocumentNode.SelectNodes("//div[@id='ad_contained_1']/a");
    foreach( HtmlAgilityPack.HtmlNode node in results ) {     Uri url = new Uri(baseUrl, node.GetAttributeValue("href", ""));     string text = node.InnerText.Trim();     // Do something with it. }
    Very cool! That saved me an hour or two of sitting back here!

    I owe you one.
    Steve

  • User profile image
    Steve411

    Sven Groot said:
    Sven Groot said:
    *snip*
    Quick sample to get you started:

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    Uri baseUrl = new Uri("http://askville.amazon.com/SearchRequests.do?search=google&open=true&closed=false");
    using( WebClient client = new WebClient() ) using( Stream s = client.OpenRead(baseUrl) ) {     doc.Load(s); }
    HtmlAgilityPack.HtmlNodeCollection results = doc.DocumentNode.SelectNodes("//div[@id='ad_contained_1']/a");
    foreach( HtmlAgilityPack.HtmlNode node in results ) {     Uri url = new Uri(baseUrl, node.GetAttributeValue("href", ""));     string text = node.InnerText.Trim();     // Do something with it. }
    By way, is there any way i can use the Agility pack to add data to the controls of the web pages? For example, an input box.

  • User profile image
    Sven Groot

    Steve411 said:
    Sven Groot said:
    *snip*
    By way, is there any way i can use the Agility pack to add data to the controls of the web pages? For example, an input box.
    No, if you want to send a form you'd have to build the request yourself.

  • User profile image
    Ion Todirel

    Sven Groot said:
    Sven Groot said:
    *snip*
    Quick sample to get you started:

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    Uri baseUrl = new Uri("http://askville.amazon.com/SearchRequests.do?search=google&open=true&closed=false");
    using( WebClient client = new WebClient() ) using( Stream s = client.OpenRead(baseUrl) ) {     doc.Load(s); }
    HtmlAgilityPack.HtmlNodeCollection results = doc.DocumentNode.SelectNodes("//div[@id='ad_contained_1']/a");
    foreach( HtmlAgilityPack.HtmlNode node in results ) {     Uri url = new Uri(baseUrl, node.GetAttributeValue("href", ""));     string text = node.InnerText.Trim();     // Do something with it. }
    regular expressions all the way down! Wink

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.