[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Full-disclosure] Google's robots.txt handling



On Mon, Dec 10, 2012 at 3:21 PM, James Lay <jlay@xxxxxxxxxxxxxxxxxxx> wrote:

> On 2012-12-10 12:25, Hurgel Bumpf wrote:
> > Hi list,
> >
> >
> > i tried to contact google, but as they didn't answer my email,  i do
> > forward this to FD.
> > This "security" feature is not cleary a google vulnerability, but
> > exposes websites informations that are not really intended to be
> > public.
> >
> > (Additionally i have to say that i advocate robots.txt files without
> > sensitive content and working security mechanisms.)
> >
> > Here is an example:
> >
> > An admin has a public webservice running with folders containing
> > sensitive informations. Enter these folders in his robots.txt and
> > "protect" them from the indexing process of spiders. As he doesn't
> > want the /admin/ gui to appear in the search results he also puts his
> > /admin in the robots text and finaly makes a backup to the folder
> > /backup.
> >
> > Nevertheless these folders arent browsable but they might contain
> > f(a)iles with easy to guess namestructures, non-encrypted
> > authentications (simple AUTH) , you name it...
> >
> > Without a robots.txt nobody would know about the existance of these
> > folders, but as some folders might be linked somewhere, these folders
> > might appear in search results when not defined in the robots.txt
> > The
> > admin finds himself in a catch-22 situation where he seems to prefer
> > the robots.txt file.
> >
> > Long story short.
> >
> > Although google accepts and respects the directives of the robots.txt
> > file, google INDEXES these files.
> >
> > This my concern.
> >
> >
> >
> http://www.google.com/search?q=inurl:robots.txt+filetype%3Atxt+Disallow%3A+%2Fadmin
> >
> >
> http://www.google.com/search?q=inurl:robots.txt+filetype%3Atxt+Disallow%3A+%2Fbackup
> >
> >
> http://www.google.com/search?q=inurl:robots.txt+filetype%3Atxt+Disallow%3A+%2Fpassword
> >
> > As these searches can be used less for targeted attacks, they more
> > can be used to find victims.
> >
> >
> >
> http://www.google.com/search?q=inurl:robots.txt+filetype%3Atxt+%2FDisallow%3A+wp-admin
> >
> >
> http://www.google.com/search?q=inurl:robots.txt+filetype%3Atxt+%2FDisallow%3A+typo3
> > <Just be creative>
> >
> > This shouldn't be a discussion about bad practice but the google
> > feature itself.
> >
> > Indexing a file which is used to prevent indexing.. isn't that just
> > paradox and hypocrite?
> >
> > Thanks,
> >
> >
> > Conan the bavarian
>
>
> I'm wondering if, in perhaps .htaccess, one could allow ONLY site
> crawlers access to the robots.txt file.  Then add robots.txt to
> robots.txt...would this mitigate some of the risk?
>
> James
>

You'd probably end up accidentally block bots though if you did it via ip
range, and you wouldn't be safer at all if you did it by UserAgent string.
I think it'd be safer to just assume robots.txt is going to be looked at by
someone other than actual robots.


>
>
_______________________________________________
> Full-Disclosure - We believe in it.
> Charter: http://lists.grok.org.uk/full-disclosure-charter.html
> Hosted and sponsored by Secunia - http://secunia.com/
>



-- 
====
Q. How many Prolog programmers does it take to change a lightbulb?
A. No.
_______________________________________________
Full-Disclosure - We believe in it.
Charter: http://lists.grok.org.uk/full-disclosure-charter.html
Hosted and sponsored by Secunia - http://secunia.com/