I'm trying to use the System.Net.WebClient class to scrape a page behind a Form-based authentication system. For which, I have the credentials for. But I'm not sure how to provide that to the WebClient class.
Any help is much appreciated.
-
-
It can be simple or complicated. You need to know the names of the authentication fields and construct an application/x-www-form-urlencoded string from them. Doing this with WebClient isn't really possible. It seems you can send post data to a server by calling OpenWrite but I can't see a way to get back the result of that post. In many cases you can use a querystring, so you could just append it to the url. That way you can use it with WebClient.
However, it seems likely you're going to want to extract a cookie returned from the server if you want to access any other pages besides the login page using the supplied credentials. WebClient cannot do this, you'll have to use HttpWebRequest/HttpWebResponse to pull this off.
To make matters worse, if the page you're accessing uses ASP.NET you'll probably have to download the login page first and extract the viewstate so you can send it back with the post.
If you want an example of how to do this, download the source of either C9Avatar or C9Music. Both include the Channel9.Profile assembly, which logs on to your C9 account and edits your profile. Check the ProfileEditor.Login method to see how it logs in (this is .Net 2.0 code but I think it's pretty much the same as it would be in .Net 1.1, you didn't say which you're using). -
Thanks for the pointer, Sven, I'm making my way through the C9IAC right now...
I'm proceeding the with HttpWebRequest / Response classes, but the cookies aren't just coming down. Even when I'm requesting from a non-protected page. The site sends down some cookies even for non-protected page.
It's an .PHP page. Is there a setting that the browser sends up to say "I accept cookies"? If so, my HttpWebRequest isn't doing that & the page isn't sending down cookies?
-
That's something I came across as well with the Channel9.Profile stuff. You can see in the Login method the following line:
request.CookieContainer = new CookieContainer();
You must do that otherwise no cookies will be returned in the response.
EDIT: This is noted in the docs for the HttpWebRequest.CookieContainer property: "You must assign a CookieContainer object to the property to have cookies returned in the Cookies property of the HttpWebResponse returned by the GetResponse method." -
Yeah, that's the puzzling part. I do have the CookieContainer created first before the GetResponse() <-- this is where the request is sent up to the site, right? Do you mind lending an extra pair of eyes? This code always returns a CookieCollection count of zero, even though, visiting the same page w/ a browser sets 3 cookies. I'm really confused.
// First pass
HttpWebRequest request = CreateRequest("http://www.gamasutra.com/php-bin/article_display.php");
request.CookieContainer = new CookieContainer();
CookieCollection cookies = null;
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
// Store the session cookie.
cookies = response.Cookies;
System.Windows.Forms.MessageBox.Show("First pass cookies=" + cookies.Count.ToString());
} -
Okay, this took some digging, but I found it. The three mystery cookies are set by javascript. They are not returned in the headers of the page. You can check this by disabling javascript and revisiting the site (if necessary clear session cookies first) and you'll see the cookies aren't there.
Since HttpWebRequest can't execute javascript (and the actual offending javascript file is in fact never even downloaded) you won't see these cookies. Only cookies that are returned by the server in the headers of the actual file you're getting are picked up on.
Judging by the cookies and the script file that sets them, it's also safe to just ignore them. -
Sven Groot wrote:
Okay, this took some digging, but I found it.
The three mystery cookies are set by javascript.Aahh. Thanks for lending a hand, Sven. You're right, after disabling JavaScript and going through Logging in w/ FireFox, I'm getting 3 more cookies (there used to be 6).
At least I can now concentrate on why THESE 3 cookies aren't sent back by the site even though I set up the POST data for the username / password (correctly I hope).
I'm tempted to just copy these 3 cookies verbatim & send that up with the request -- I wonder if that would work.
Oh well, thanks again.
-
Woot!
Manually creating the cookies w/ values matching those reported by FireFox let me in. Not as sexy as submitting the username/password, but it works. -
It is possible to use the WebClient with cookie management.
public class WebClientExtended : WebClient
{
private CookieContainer myContainer;
private HttpWebRequest myRequest;
private string myMethod;
public string Method
{
get { return myMethod; }
set { myMethod = value; }
}
public CookieContainer Cookies
{
get
{
if (myContainer == null)
{
myContainer = new CookieContainer();
}
return myContainer;
}
set
{
myContainer = value;
}
}
protected override WebRequest GetWebRequest(Uri address)
{
myRequest = (HttpWebRequest)base.GetWebRequest(address);
myRequest.Method = this.Method;
myRequest.CookieContainer = Cookies;
return myRequest;
}
protected override WebResponse GetWebResponse(WebRequest request)
{
return myRequest.GetResponse();
}
protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
{
return myRequest.EndGetResponse(result);
}
Thread Closed
This thread is kinda stale and has been closed but if you'd like to continue the conversation, please create a new thread in our Forums,
or Contact Us and let us know.