[vox-tech] Bash scripting newbie - need syntax help

Wed, 28 Apr 2004 12:48:24 -0700 (PDT)

Dave Margolis said:
> Bill Kendrick wrote:
>
>>Say you had files "deleteme", "metoo" and "imouttahere"
>>
>>  find . -type f -exec rm {} \;
>>
>>would cause this to happen:
>>
>>  rm deleteme
>>  rm metoo
>>  rm imouttahere
>>
>>whereas the xargs method:
>>
>>  find . -type f | xargs rm
>>
>>would cause this:
>>
>>  rm deleteme metoo imouttahere
>>
>>
>>A bit quicker; less process forking, yada-yada-yada.
>>
> great explaination, gracias.  is there ever a chance that the set of
> information piped off to xargs could become too big?

Not likely. xargs is designed to "suck up" several args before some limit
is reached, and then pass them to the command to be executed until the
next large batch or group comes in. In this way, xargs should not have
problems where it runs out of space for args passed, and it should be much
faster since the number of processes created would be much smaller.

> for example:
>
> find . -type f -exec rm {} \;
>     rm 1
>     rm 2
>     ...
>     rm 6,000,000
>
> vs.
>
> find . -type f | xargs rm
> [where xargs has to push 6 million things at rm and things get whacky
> fast]

The cost of creating a process is very large in comparison to parsing the
list of items to delete. Consider the case where each rm would need to
parse a list of "1" item, and passing a list of 6000000 items to one rm
vs. having 6000000 rm each have to parse 1 arg should be equal or in favor
of the rm which parses 6000000 args. Also, I expect that xargs would chop
the list of files to delete in being passed to rm long before it reached 6
million.

> it seems (but i have no idea) that find would handle this one rm process
> at a time, and though handling it slower, would struggle though it better
> than xargs might with one huge data set.

I would expect the reverse. Cost of process creation is very large in
comparison. Consider on x86 intel where you have the cost of just a
function call where you must save various registers to the stack, allocate
space on the stack for local vars, iterate through the function, allocate
memory (malloc/free) and then deal with the teardown and restoration of
the registers after the return. Now for starting a new process, the cost
is much higher as more of the OS is involved. Even when we consider cost
for local disk caching of executables (like rm) so that each process will
not need to actually go to disk to load the rm, each new rm will still
have to copy the executable from memory to memory for executaion and deal
with the creation and management of the various stack spaces per process
and ensure that /proc is also updated and the memory allocation and
deallocation for each process is of course enormous in comparions to a
single (or few) number of rm parsing long lists.

Good questions. Keep them coming.