Cleaning Up ASP.NET Sessions in Google

ASP.NET and Dirty Urls

There are two things that have been bothering me about pages that are getting indexed in Google from an ASP.NET application. The first is somehow there are ASP.NET Session Urls ending up in the Google index. This is bad because searchers that actually do click these links are likely to get a 500 error (internal server error) because they will be trying to access a page of an expired session.

Indexed Session Urls in Google Sitemap tools

How is Google finding all these 'bad' urls?

Well apparently there is no browser definition in ASP.NET 2.0 for the Googlebot's useragent string, so when the spider hits your ASP.NET page it's browser capabilities are not defined.

Edit: The default browser capabilities are defined to use cookies, the issue occurs because the base Mozilla definition is defined to NOT use cookies. If the browser is not able to accept cookeis .NET gets around this by inserting the session information into the Url and issues a 302 (content temporarily moved) in the response header.

This default behaviour is a good and a bad thing. It's good in the fact that if I'm browsing an asp.net site on a pda that doesn't support cookies I still can. However just about every search engine spider ever created has it's own UserAgent string making it a tough task to issue the standard non-crufted url. One solution to fixing the session urls being indexed in Google is to tell your asp.net application that Googlebot supports cookies and the problem is solved.

To read more about the solution please see my next post.

Dynamic Captcha build-up

Another dynamic aspect that is used on this site are Captcha images, and yes Google's image spider finds those too. Upon trying a Google image search on my domain, it's littered with Captcha images! I've also added an exclude to the robots.txt file for this.

Captcha image build-up in Google image search

Solution for Captcha images build-up and stop-gap solution for Session Urls

Here is my "robots.txt" file so far for my SingleUserBlog install. *Note the last two lines, "Disallow: /(A(*" should exclude any ASP.NET session urls, (this is not recommended unless you have fixed the Mozilla detection hole). The last line should exclude any captcha images from being indexed.

User-agent: *
Disallow: /LoginPage.aspx
Disallow: /Administration/
Disallow: /(A(*
Disallow: /Captcha.ashx*$

.NET Code Snippets SingleUserBlog Errors and Bugs ASP.NET SEO.NET
Posted by: Brendan Kowitz
Last revised: 21 Sep 2013 12:15PM

Comments

12/12/2006 12:04:15 PM
Blimey. Great post (and the followup). Any idea why SingleUserBlog would need session state? I know asp.net has it on by default, but it's not actually used anywhere, is it?

Cheers
Matt
12/12/2006 2:21:52 PM
I use a session variable to track user stats with the "Online Presence" webpart. But it looks like the comment form uses session stuff too:

Session["Comment_RandomText"]
12/12/2006 10:37:01 PM
Ah. Looks like I'd better turn it back on then! I'll add the browser definition from your other post - dead useful, thanks.
10/19/2008 4:39:28 PM
Just a friendly helpful hint, no-takey-offense, but "it's" is a contraction for "it is".

When you want to speak of a possessive form such as "there's a dog. It's wagging its tail." you drop the apostrophe.

Hope this helps! Spread the word!
10/19/2008 8:14:14 PM
Awesome, I'll be sure to tag this post with 'English+lessons' :)

No new comments are allowed on this post.