[vox-tech] how to modify .htaccess to prevent wget or the likes from downing my site?

Alex Mandel tech_dev at wildintellect.com
Wed May 25 12:52:41 PDT 2011


On 05/25/2011 12:10 PM, Chanoch (Ken) Bloom wrote:
> On Wed, 2011-05-25 at 14:50 -0400, Hai Yi wrote:
>> Hello all:
>>
>> I first asked this question to the support of my web host, and they
>> redirected me to this link:
>> http://www.webhostingtalk.com/showthread.php?t=437549
>>
>> and the snippet on that page looks like:
>>
>>
>> SetEnvIfNoCase User-Agent "^Wget" bad_bot
>>
>> <Limit GET POST>
>>    Order Allow,Deny
>>    Allow from all
>>    Deny from env=bad_bot
>> </Limit>
> 
> This snippet will only block wget, if wget deigns to identify itself as
> wget by saying so in the user-agent string.
> 
>>
>> I copied and pasted it to the .htaccess under /public_html. Still, I
>> am able to use this command to fetch my site:
>>
>> wget --wait=20 --limit-rate=20K -r -p -U Mozilla www.my_iste.com
> 
> Yup. Wget decided to identify itself as Mozilla in the user-agent
> string. That means you have no way at all of knowing that someone's
> trying to use Wget to download from your site.
> 
>> However, if I  tried the same wget with a slight change in the command
>> line (without " -U Mozilla ")
>>
>>  wget --wait=20 --limit-rate=20K -r -p www.my_site.com
>>
>> I get this:
>>
>> --2011-05-25 14:30:36--  http://www.my_site.com/
>> Resolving www.my_site.com... xxx.xx.xxx.xx
>> Connecting to www.my_site.com|xxx.xx.xxx.xx|:80... connected.
>> HTTP request sent, awaiting response... 403 Forbidden
>> 2011-05-25 14:30:37 ERROR 403: Forbidden.
> 
> Wget deigned to identify itself as wget this time.
> 
>> Now I have three questions:
> 
>> 1. Why didn't the code in .htaccess prevent the downloading? Did I
>> miss something?
> 
> (See my explanation above.)
> 
>> 2. Do we have other tools acting like wget, how can we prevent them
>> all from downing the site content?
> 
> There are other tools that act like wget. You can't prevent them *all*
> from downloading, though you could blacklist specific ones the way you
> did with Wget. Of course, they may also decide to change the User-Agent
> string, then you have no way of telling at all.
> 
>> 3. If someone is downloading, can we have some log file that can
>> expose the downloader's info?
> 
> Your web browser logs will have their IP address, but I doubt you could
> do anything useful with that information. If your user logs in to the
> site, you could try to keep track of that yourself somehow, but that
> could be very complex depending what you're trying to prevent.
> 
> ...
> 
> 
> In other words, the protection you're asking for is basically impossible
> against a determined downloader.
> 

If you're using Apache, there is a connection limiter per IP tool you
can use to restrict the usage of download accelerators. Just be careful
not to clamp down to far because you can end up limiting end users who
just appear to have the same IP at the end of the line due to IP routing.

You can also do bandwidth throttling, look for stuff on QoS for how to
handle that.

Enjoy,
Alex



More information about the vox-tech mailing list