[vox-tech] Performance tuning for http file serving

Bill Broadley bill at broadley.org
Tue Apr 13 01:55:39 PDT 2010


On 04/13/2010 01:09 AM, Alex Mandel wrote:
> Bill Broadley wrote:
>> On 03/31/2010 05:12 PM, Alex Mandel wrote:
>>> I'm looking for some references and tips on how to tune a server
>>> specifically for serving large files over the internet. ie 4 GB iso
>>> files. I'm talking software config tweaks here.
>>
>> How many 4GB ISO files are there?  How many simultaneous files?
>> Clients?  How fast is the uplink to er, umm, wherever the clients are
>> (on the internet)?
>>
> Most clients will be downloading 1-3 files, and those are likely to be
> the same files for everyone.

Well if the set of files that you need to serve mostly fit in ram I 
suspect things will work well.  It's when the set of files you are 
serving are substantially larger than ram that it's a big problem.

Bit-torrent BTW is a great way for serving out large files, of course 
that won't work for many environments.

>> What gets ugly is if you have 2 or more clients accessing 2 or more
>> files.  Suddenly it becomes very very important to intelligently handle
>> your I/O.  Say you have 4 clients, reading 4 ISO files, and a relatively
>> stupid/straight forward I/O system.  Say you read 4KB from each file (in
>> user space), then do a send.
>>
>> Turns out a single disk can only so 75 seeks/sec or so, that means you
>> only get 75*4KB = 300KB/sec.  Obviously you want to read ahead on the
>> files.
>>
> Is there a way to configure this?

I haven't seen much in the way of I/O tuning for apache.  There is a 
mod_mmap_static, but it looks largely experimental.

I also found:
http://httpd.apache.org/docs/2.2/caching.html

But it doesn't look like they deal with the case of using say 1GB of 
cache for the current hot spots on say 3 4GB files when you only have 
8GB ram.  Hard to say.  Ideally you could give apache 1GB for caching, 
and that any read on a large file would read at least 64MB at a time. 
That way your I/O system would only be seeing 64MB transfers, and as you 
get more clients for the same set of files you'd be likely to get more 
cache hits.

Oh, http://httpd.apache.org/docs/2.2/mod/core.html#enablesendfile
discusses the zero copy tweak I mentioned.

I'd use the http benchmark of your choice, even just a few wget's from 
multiple clients to benchmark the benefits of any tweaking you do.

>> You might assume that a "RAID with striping" will be much faster than a
>> single disk on random workloads.  I suggest testing this with your
>> favorite benchmark, something like postmark or fio.  Set it up to
>> simulate the number of simultaneous random access over the size of files
>> you expect to be involved.
>
> It's RAID 6 and there's no changing it.

Well even so, I'd still benchmark it.  It's nice to have an idea of the 
upper limit of how fast apache is going to be able to read a large file.

  >> Once you fix that problem you run into others as the bandwidth starts
>> increasing.  Normal configurations often use basically read(file...) to
>> write(socket...).  This involves extra copies and context switches.
>> Context switches related to kernel performance much like random reads do
>> to I/O performance.  Fixes include using mmap or sendfile.
>>
> I'll have to read up more on those.

As mentioned apache can mmap files and has sendfile capable, pretty much 
anything performance oriented should.  Apache requires that the files 
don't change for mmap (as mentioned in the docs).

>> Which brings me to my next question... do you have a requirement for
>> apache or is that just the path fo least resistance?  Various servers
>> are highly optimized for serving large static files quickly.  Tux
>> springs to mind, although is somewhat dated these days.  A decent
>> summary is at:
>>     http://en.wikipedia.org/wiki/TUX_web_server
>>
>> A more recent entry into the simple/fast webservers is nginx, fairly
>> popular for a niche server.  Various popular sites like wordpress.com,
>> hulu, and sourceforge use it.
>
> Apache is least resistance as it's in use on all the machines in this
> cluster of various web services and all the admins know how to configure

Well another approach would be to watch the bandwidth served.  There are 
number of ways to do this.  Watching logs, running a log summary 
program, enabling/watching mod_status, wget yourself, etc.  If it's not 
actually a problem you could just ignore it until it is.

> it. I'm open to exploring anything that's in Debian+Backports and is
> current/supported. Most of the oddball things people have mentioned so
> far are long abandoned projects from 2005 and earlier. nginx sounds
> promising and I'll look into that.

Indeed an article from a little while ago claims 16M sites where using 
nginx.




More information about the vox-tech mailing list