网络资源的拷贝粘贴 备份参考之用


8 February 2008

当 Crawler 遇到网页重定向的问题

User #49896   2841 posts
Whirlpool Forums Addict
I'm using HttpWebRequest and CookieContainer to log in to an external site by passing post parameters to a HTTPS page. I've done this before with a different page and it worked fine. Trying to do it again today in a new project but I get a WebException saying "Too many automatic redirections were attempted.".

Googled that but nothing came up. Anyone come across this before?
posted 2006-Mar-18, 7pm AEST
User #80562   1044 posts
Whirlpool Enthusiast
Sounds like the HttpWebRequest object is seeing too many 302 responses or similar (trying to fetch the one page) and has given up. Perhaps the target page has a 302 loop?

I have not specifically used HttpWebRequest, but looking at its documentation it seems very similar to LWP::UserAgent (perl) or CURL (php).

I would guess that similar problems affect them all.
posted 2006-Mar-18, 7pm AEST
edited 2006-Mar-18, 7pm AEST
User #49896   2841 posts
Whirlpool Forums Addict
It works fine when going through the process in a web browser. Could this be a security feature?

For any USYD students, I'm trying to log into the timetable page.
posted 2006-Mar-18, 7pm AEST
edited 2006-Mar-18, 7pm AEST
User #80562   1044 posts
Whirlpool Enthusiast
kufu writes...
It works fine when going through the process in a web browser.
Try usinh some program to record exactly which HTTP headers your browser is sending and receiving during the login process.

Could this be a security feature?
I doubt it, but stupidity is not picky about where it crops up :-)
posted 2006-Mar-18, 7pm AEST
User #49896   2841 posts
Whirlpool Forums Addict
erroneousBollock writes...
Try usinh some program to record exactly which HTTP headers your browser is sending and receiving during the login process.

I used the Live HTTP Headers extension for FireFox to do just that. Login process seems to go through two different 302 redirects. Flow is as follows:

1. I post login data to a login cgi script.
2. Server responds with a 302 redirect to a new location and also sets a cookie variable.
3. Browser requests new URL.
4. Server responds with another 302 and another cookie is set.
5. Browser requests new URL again.
6. Server responds with final 200 destination.

Any way to get this to work with ASP.NET?
posted 2006-Mar-18, 8pm AEST
User #80562   1044 posts
Whirlpool Enthusiast
Post directly to the final destination.
You may need to set the referer header to be the url in the 2nd destination.

Re-read: mmmm, cookies. It may be that the they've glued multiple disparate systems together with some hackish security thrown in. You may indeed have to make 3 separate requests to do the login. Not difficult.
posted 2006-Mar-18, 8pm AEST
edited 2006-Mar-18, 8pm AEST
User #49896   2841 posts
Whirlpool Forums Addict
I've uploaded a header trace here (User credentials and session keys have been altered). The trace starts with the POST request. --- lines separate request/response pairs.

Points of interest: the destURL post parameter and the 2 cookies being set in each of the redirect responses. I believe the first cookie identifies your login and the second cookie identifies that you were redirected by the login page.

If I remove the destURL post parameter from the login procedure then the process stops after the first redirect and only the first cookie is set. If I then type the destURL directly in the browser then this takes me to where I want to go, but the header trace shows that it goes back through the login page and the 2 redirects.
posted 2006-Mar-18, 9pm AEST
edited 2006-Mar-18, 9pm AEST
User #49896   2841 posts
Whirlpool Forums Addict
erroneousBollock writes...
It may be that the they've glued multiple disparate systems together with some hackish security thrown in

Yes, I'm almost positive that's the case. Each system had a separate login previously even though the same credentials are used. They've recently merged these into a common portal which only requires you to log in once.

You may indeed have to make 3 separate requests to do the login. Not difficult.

I'll give it a try tomorrow. Thanks!

Still don't know why that error happens though. It should just follow the redirects like a browser does *shrugs*
posted 2006-Mar-18, 9pm AEST
User #49896   2841 posts
Whirlpool Forums Addict
The HttpWebRequest class has two members: AllowAutoRedirects and MaximumAutomaticRedirections. Thought this might be the problem but the defaults are true and 50. I set these explicitly but same thing happens :(
posted 2006-Mar-19, 10am AEST
User #80562   1044 posts
Whirlpool Enthusiast
Ok, that sounds like it's having trouble following the redirects.
Just interpret them yourself... code around it.

Is the post actually accepted by the first, second or third URL?
Are the cookies sent back at each stage needed by the next stage?
posted 2006-Mar-19, 10am AEST
User #49896   2841 posts
Whirlpool Forums Addict
I just tried setting the max redirects value to something crazy like 1 million. When I noticed the page kept loading and I had a constant 10KBps up/down stream I knew something was wrong. Wonder if I triggered any uni DoS alarms, lol. Looks like it's entering some sort of redirect loop as you said before.
posted 2006-Mar-19, 10am AEST
User #49896   2841 posts
Whirlpool Forums Addict
Well, this is just weird. I used Ethereal to trace the HTTP headers being sent by HttpWebRequest. First two stages are correct and follow the browser headers. Cookies appear to be set correctly too. But in the 3rd stage when it tries to go to the actual page it get's a 302 - 'Please login first' and redirects to the login page which then enters an infinite loop.
posted 2006-Mar-19, 12pm AEST
User #80562   1044 posts
Whirlpool Enthusiast
It there an Authorization header returned in one of the previous pages that you're not sending to the third page?
posted 2006-Mar-21, 1pm AEST
User #49896   2841 posts
Whirlpool Forums Addict
Not that I can see. And besides, it should be able to do this all in 1 call to HttpWebRequest, following all the redirects as needed and whatever else. I would expect it to work the same as a browser.

I tried setting followRedirects to false and doing the process manually one step at a time but I have the same problem. Final redirect location says 'login first' and refers me back to the beginning.
posted 2006-Mar-21, 2pm AEST
User #80562   1044 posts
Whirlpool Enthusiast
kufu writes...
Ok, so you dumped the traffic...
do the pages send referer headers on the each next page? are you sending those also?

I would expect it to work the same as a browser.
What leads you to this expectation?
Do the docs say "this works exactly like Internet Explorer" ?

:-)
posted 2006-Mar-21, 2pm AEST
edited 2006-Mar-21, 2pm AEST
User #49896   2841 posts
Whirlpool Forums Addict
erroneousBollock writes...
Do the docs say "this works exactly like Internet Explorer" ?

hehe, no, but they should :p

I think I tested by setting the referer headers explicitly as well but might give it another shot when I have some time.
posted 2006-Mar-21, 3pm AEST
User #49896   2841 posts
Whirlpool Forums Addict
After two hours of debugging the HTTP headers with Ethereal and comparing the browser request/response stream vs the .NET stream, I finally solved it!

I did some googling and found other people with the same symptoms, i.e. you cannot use HttpWebRequest with automatic redirects enabled to do a login process involving 302s and cookies because the cookies don't get set until the end of the whole process.

The solution was to disable auto redirects and implement the whole login process manually on a step-by-step basis (get the 'Location' header of 302 redirect responses, as well as the 'Set-cookie' header, and pass these down to successive steps as needed).

I tried using a CookieContainer to handle the cookies automarically but that didn't work either. Reading the 'Cookie' and 'Set-Cookie' headers directly did the trick though.
posted 2006-Mar-22, 2pm AEST
edited 2006-Mar-22, 2pm AEST
User #80562   1044 posts
Whirlpool Enthusiast
kufu writes...
After two hours of ...., I finally solved it!
Fantastic.

you cannot use HttpWebRequest with automatic redirects enabled to do a login process involving 302s and cookies because the cookies don't get set until the end of the whole process.
So HttpWebRequest sucks then. :-)
LWP::Useragent and WWW::Mechanize do not suffer from such limitations.
posted 2006-Mar-22, 4pm AEST
edited 2006-Mar-22, 4pm AEST
User #32016   422 posts
Forum Regular
erroneousBollock writes...
So HttpWebRequest sucks then. :-)
LWP::Useragent and WWW::Mechanize do not suffer from such limitations.


Well sure they're quite obviously superior because they behave differently in this one instance, damn I'm dropping .Net now and picking up PHP/Perl, tree hugging and crack smoking.

Thanks erroneousBollock, you've really made me see the light.
posted 2006-Mar-22, 6pm AEST
edited 2006-Mar-22, 6pm AEST
User #80562   1044 posts
Whirlpool Enthusiast
Wicked Sticks writes...
Thanks erroneousBollock, you've really made me see the light
Wait! Back-up... remove "PHP" and "tree hugging" and we can agree :-)

And for the record, I said HttpWebRequest sucks, not the .NET platform.
posted 2006-Mar-22, 6pm AEST
edited 2006-Mar-22, 6pm AEST
User #32016   422 posts
Forum Regular
That's right....leave Perl and crack smoking

You'd need to be on something to actually want to work in Perl >:)
posted 2006-Mar-22, 7pm AEST
edited 2006-Mar-22, 7pm AEST
User #80562   1044 posts
Whirlpool Enthusiast
perl5 is my prefered (app-level) language right now...
perl6 & haskell may get more of my time in the future.
I find MSIL to be rather limiting.

I'll let you draw your own conclusions.
posted 2006-Mar-22, 7pm AEST
edited 2006-Mar-22, 7pm AEST

No comments:

Google