[vox-tech] (forw) Re: (forw) Re: Need Partitioning Advice

Fri Jun 19 06:15:01 PDT 2009

> I also make a point of noting that there's nothing particularly "wrong"
> with the One Big Partition school (though root plus swap is still
> generally recommended).  

You say that here, 3 partitions in the actual document says different.

> 
>     *** IN BOLD TEXT!!! ***
> 
>     *** IN THE THIRD PARAGRAPH OF THE FAQ!!!! ***
> 
> I guess you can't please 'em all.

You sound much more reasonable in your above statements then you do in the
document.  In this mentioned "THIRD PARAGRAPH OF THE FAQ" still in bold you
admit the increasingly popularity of "minimal partitioning" which you go on to
explain actually means 5 partitions.  Still in that same paragraph you explain
that you advocate more than this minimal for "control and redundancy".  I fail
to see either, unless by control you mean frequently needing to either spend
the day "rationalizing" your partition tables or having to play games like
making a ton of symbolic links between one partition and the next.

> Even swapfiles are largely tenable these days as performance issues have
> been addressed.

Interesting, I have some old habits as well... swap always on partitions.  I
admit I've not quantified any difference.  I guess it's nice that the swap
files are invisible to df, du, can't accidentally be deleted.

> This and several other comments lead me to believe that the guide wasn't
> so much read as imputed....

I apologize for that.  I've seen similar mindsets and justifications for
design decisions like having tons of partitions that were not directly gleened
for your document.  So while your document was useful and relevent when
written I think it's a poor "Basic recommendation" for everyone, especially
for the person who started this thread, and basically everyone else.  Sure a
specific environment could lead to your exact decisions, but that environment
is a very small percentage of all unix boxes in 2009.

> There actually are some scaling issues above 2TB which I don't address,

RHEL shipped a particularly old kernel for a long time with an artificially
low limit.  Currently ext3 limit on most hardware is 8TB, I regularly make
3-4T ext3 filesystems with no problems:
/dev/md2              3.6T  9.9G  3.4T   1% /export/1
/dev/md3              3.6T  197M  3.4T   1% /export/2
/dev/md4              3.5T  197M  3.3T   1% /export/3

Speaking of which I'm a fan of splitting very large file servers (like 16*1T)
into a bunch of partitions.  Easy of maintenance, disaster recovery,
reliability, etc all basically being related to have 5 disks to hold or
recover your data is a much easier unit to find a space for, wait for a fsck,
or drive somewhere then it is for 16 disks.  Especially since for most of our
uses 3*3.4T isn't much less useful than 10T.

> though I've started encountering them at work.  These include
> both partitioning and filesystem support.
> 
> DOS partitioning no longer works above this size, instead, Intel EFI/GPT
> must be used.  fdisk can't support >2TB filesystems, so use GNU parted.
> And older versions of ext3fs will fail as well (I don't recall the exact
> transition point).

Indeed.

>> Only if you mount rw by default, and only on filesystems of significant
>> size and complexity.
> 
> Actually, both incorrect.
> 
> Journals allow bypassing of fsck on most boots.  Journal integrity is
> validated, and inconsistent inodes deleted.  However, unless disabled,
> once the fsck check interval or reboot/remount count is exceeded, a full
> fsck is still executed, and this is strictly governed by disk size.

True, but the journal reduces the average fsck time, so at worst case it's the
same (the time is up) but is often much faster.  So it's still a benefit.

> Worse, fsck time scales poorly as disk sizes increase, an issue kernel
> developers have been quite cognizant of.  Read Ted Tso or Val Henson /
> Val Aurora on this.  Ext4 is supposed to address this, as will SSD:

Sure, so as mentioned a 32GB partition takes almost 2 minutes, IMO not a big
deal, especially since that fsck only happens by default I belive around twice
a year.

> 
> http://thunk.org/tytso/blog/2009/02/26/fast-ext4-fsck-times-revisited/

I read that one.

> http://lwn.net/Articles/337680/

Subscriber only.

> http://valhenson.livejournal.com/41912.html

Points at the sub only and provides some data, without mentioning exactly why.

> 
> While journals speed *boot* times (and greatly assist in ensuring
> filesystem integrity), they do nothing for fsck times.

Sounds like a semantic difference.  Fsck can be clean and close to zero, the
common case on boot when there was an orderly shutdown.  It can be of a fs
with a journal (fast).  It can also be a full fsck and slow.  2 minutes for a
32 GB as mentioned in your linked article.  All 3 are fscks.

Not sure we actually are disagreeing.  Journals help with maintaining
integrity in the case of a crash, and in the speed of recovery (assuming the
180 day timer doesn't go off).  In the worst (and uncommon) case it's no
faster than ext2, but still has better integrity.

>> Consider /usr on a server where that is kept mounted read-only except
>> during installation/removal of packages.  Why have the overhead of a 
>> journal?
>>
>> Consider also a /tmp filesystem where you want high performance, and for
>> some reason don't want to use tmpfs.  (Maybe you prefer /tmp to be
>> persistent between reboots.)  Again, why do you want the overhead of a
>> journal on _/tmp_?
>>
>> The example of /usr will not lead to long fsck times because it's synced
>> at all times (except rare occasions when you remount it rw for package
>> operations).
>>
>> The example of /tmp doesn't lead to long fsck times because, well, it's
>> /tmp -- isn't huge, doesn't have large amounts of stuff in it.
> 
> ... all of which covers the basic rationale for partitioning:  it allows
> you to use appropriate features for different filesystems.

Right, at the cost of a fair bit of work and complexity, and increasing the
future chances of having to either repartition are start playing games with
lots of symbolic links which causes other administrative burders... not to
mention decreases the unique value of each partition.  If you tune each
partition, set a different backup schedule, play games with perf vs integrity,
surviving reinstalls, mount flags and the like then they are poor candidates
for using to extend other partitions.

>>> * Rare/expensive unix systems that ran tons of services and had
>>>   shells for users.  Which required protecting services from users
>>>   and vice versa.
>> Actually, protecting the system from misbehaving processes, and the
>> system from the sysadmin, and the system from poor recoverability, are
>> rather more the point.  So, for example, the more the root filesystem
>> is isolated by having non-essential things be in separate filessytems,
>> the more likely you will be able to mount / at boot time despite
>> problems that may have arisen in, say, /usr or /var.
>>
>> There's a really good reason why system recovery/restore/repair tools
>> are all in /bin and /sbin:  That's so they'll not be unavailable if /usr
>> is temporarily hosed and cannot be mounted.  Why else do you think those 
>> and /usr/bin / /usr/sbin aren't simply merged for simplicity's sake?
> 
> I agree strongly with Rick's view here.

Have you noticed that /sbin is no longer self contained?  That it depends
heavily on shared libraries?  It's also fairly common to use shared libraries
under /usr?  The days are long gone of /sbin being a recovery partition of
statically linked files that you can recover a system with.  Thus installation
CDs/DVDs and related have taken over such duties.

>>> * Crude partition based backups
>> You'd rather provide an explicit and laundry list of directories (that
>> must then be maintained), when just adding "-x" (don't cross filesystem
>> boundaries) to your rsync command solves that problem entirely?  Really?
> 
> Bingo.

I already replied to this one, but I'll just summarize that I think finding a
backup for the important bits shouldn't be restricted to partitions regardless
of how many partitions you have.  If I think the valuable state of a machine
is in /etc, /home, and /opt (or /usr/local if you prefer) then I should be
able to write a simple config file for a backup system to backup those
directories.

>> Ironically for your comment (above), Karsten _does_ mention LVM in a
>> laudatory fashion, as an option -- though he doesn't employ it in his
>> examples.
> 
> Yep.

Personally I can't imagine the 9 or so partitions listed in your config (I
don't cound /mnt/dos since a dos partition is well justified if you need it)
in the 19-83% full without easy to resize partitions.  You seem to put quite a
bit of thought into it, still had to spend an additional day revising your
decision, and even after that still had to make symbolic links to handle overflow.

> I'll refine that point a bit, if I may speak for the original author....
> 
>   - The usual security threat is the operator.  Preventing
>     fumble-fingered mistakes is a great way to avoid harm and downtime.

I hadn't thought of rm -rf /usr being mistakenly typed by someone running as
root, but admit that a ro /usr is handy in that case.  Handy enough to justify
the pain of applying every future patch vs the cost of a restore of /usr (and
downtime) in the (IMO) rather unlikely event is more debatable.

>   - Even a stultifyingly low bar can be tremendously effective at
>     avoiding common exploit paths.  The exploits are common because the
>     low-hanging-fruit preventive measures are so infrequently taken.
>     Any time you can eliminate an entire class of exploit avenues, as
>     the OpenBSD folks have proven time and again, you're miles ahead.
>     Stripping SUID and DEV, and EXEC where possible, is damned simple.

Simple yes.  Likely to help.. not so sure.  In a system with just / and /home
sure I'd mount /home as nosuid nodev.  But is it really that handy to have
different decisions in 8-10 partitions?  What thread model allows a hacker to
use a rogue dev device or suid file where he doesn't already have root to
cause unlimited mayhem anyways?

> If you can prevent an attacker from creating a lever (device file,
> executable, SUID file) somewhere, they're going to have to try that much
> harder.  And you'll avoid many of the drive-by / automated hacks.

In theory, but if a hacker is in your system it's too late.  Why wouldn't they
put the dev in er /dev, and the suid file in /usr?

> Naturally, if someone's specifically interested, at all cost, in you,
> the picture changes radically.  As always the relevant questions are:
> 
>   - What is your threat model?
>   - Who is your attacker?
>   - What is her attack budget?
>   - What is at risk in an attack?  (Access, control, integrity, DoS,
>     ...)
>   - What reasonable countermeasures / recovery paths are available?

Yup.  In this case how many man hours could you save by having a swap
partition, /, and /home that you could use improving your security to the
point where hackers aren't writing files to your filesystem regardless of your
mount flags.

> I'll note again, from the FAQ text:
> 
>     I *very* strongly recommend kernel hacker Martin Pool's essay Is swap
>     space obsolete?
>     http://sourcefrog.net/weblog/software/linux-kernel/swap.html

Sure, I'm not arguing against swap.  Just if at all possible don't let it
become a performance bottleneck.  Sure multiple spindles sounds great for
swap, it's 1000 times better if you don't swap..  Adding spindles only helps
in the case where you are not doing a ton of I/O already.  So sure use all
your idle spindles for swap... but the question becomes can you tell which
spindles are going to be idle before you start swapping.

> Wait?  There's an OS other than Debian GNU/Linux?  And nobody told me?
> 
> I don't dual boot.
> 
> While multi-booting is indeed a justification for partitioning, it's not
> one I address or much care for at all (VMs are the way to fly anyway, as
> Rick notes).

Fair enough, I don't dual boot either.  I do use virtualization, it's handy
for making simpler more secure images and reduces the need for a bunch of
partitions since if a VM only does a simple task you have less justification
for 8-10 partitions.

>>> * the lack of device, pty, /proc, tmpfs and other related virtual or temporary
>>>   filesystems that help offload the duties and security privs required
>>>   of a filesystem.
>> Um, /proc is mentioned.
>>
>> But all of those are irrelevant to _partitioning_, anyway.  Karsten's
>> page is about partitioning strategy.
> 
> Bingo.

I don't follow.  My desktop has 16 virtual mounts, each with customized mount
flags, each performance or functionally optimized for doing things that used
to be on a filesystem.  Things like udev, it's designed for devices, now all
other partitions can have udev, so I don't need a /dev partition for that
protection.  Tmpfs, it can have the optimal flags, I don't need a /tmp
partition.  Various tmpfs sprinkled around insure they have optimal flags so
that for instance a user can create a lock file but can't write an executable
file there.  Basically these specialty file systems allow better protection
from hackers without requiring chopping your filesystem up into as many
partitions.

>  
>> Anyhow, I'd feel a prize chump if I had my server set up as
>> single-filesystem plus swap on quite a few grounds, including
>> performance:  Being able to put the swap in the middle of the spindle,
>> and the most-visited portions of the file tree on either side, is a huge
>> win for keeping average seek time low.  I'd be bloody incompetent if I
>> _didn't_ do that.

Heh, I didn't understanding this one.  Swap is random lookups, thus
performance doesn't change depending on where in the disk it is.  Using it to
artificially take the most visited portions of the disk artificially more
distant just means that all of your seeks are going to be longer.  Now if you
are looking at bandwidth then you want the beginning of the disk, but then you
can't put anything on both sides since you are already on an edge.  I agree in
principal that using partitions to put high I/O ares of the disk together
could be a win, I just don't see how it could apply here.  It seems rather
fraught with peril.  So even if you do succeed in predicting what is the most
I/O intensive... say mysql.  Then place the partition appropriately, then
successfully guess at the 2nd mode popular (of 8-10) partition and move it
close by.  Because the partitions are so much smaller it seems inevitable that
there will be a 3rd I/O pattern, or the mysql database will get too big for
/var and have to be moved to /home... or swapping to 2 drives will completely
destroy the I/O.  Especially since the in most (not all of course) I/O
operations are very likely to be highly local to the directory and file
systems already optimize for short seeks within a directory.

So basically optimizing I/O at the partition level is so expensive to change
(major work to play the partition shuffle game), crude (1/10th of a system),
and static that it's going to be hugely less effective then the current
filesystem options (a big /) that will gracefully handle your I/O loads...
especially since wherever you have the most I/O is likely to be the partition
you end up filling because of the I/O activity.