[vox-tech] [help@google.com: Re: [#19464334] Searching for dotfiles]

Thu Jan 13 12:40:14 PST 2005

On Thu, Jan 13, 2005 at 03:13:23PM -0500, Peter Jay Salzman wrote:
> Anyone who searches for ".vimrc" means ".vimrc".  In this case, the dot is a
> literal, so ".forward" is as distinct from "forward" as "cat is distinct
> from "dog".

These (Unix dotfiles) are good special case that Google should consider
handling.  General punctuation would be hell, though.

Right now they probably only index ".*forward.*" as "forward", and stick
the page in that word's bucket.  This means they can't search for ".forward"
or "forward?" distinctly from "forward", "forward." or "forward:".

Not quickly, at least.  See my other post about the possibility of using
their own Google Cache feature for searching full text, once the general
term has been found on a set of pages...

I guess you could think of it like this:

  "I want to search for '.forward' on my computer"

  $ find / -type f -exec grep -l "\.forward" {} \;

That'd be slow.  But if we had a list of common terms that are contained
in pages (indexed on a regular basis, but NOT right when you go to search),
it'd be a lot faster.  This is what most search engines do.

A kind of backwards way of doing it could be like this:

  $ find / -type f -exec grep -l "forward" {} \; > files-with-forward.txt
  $ find / -type f -exec grep -l "backward" {} \; > files-with-backward.txt
  $ find / -type f -exec grep -l "upsidedown" {} \; > files-with-upsidedown.txt

(Really, what search engines do is just "find every file", and then rather
than 'grep' for a set of known words, it just looks at "what words are in this
file/page?" and keeps an index of those words, and adds a reference to each
page in it.  If it finds a new word, it just creates a new 'bucket' to store
page references in...)

So okay, now that we have an index of files containing particular words,
we can search for them.  Instead of doing:

  $ find / -type f -exec grep -l "forward" {} \;

we can just do:

  $ cat files-with-forward.txt

MUCH speedier!

So my proposal, albeit also a slow one (on the Internet scale, at least) is
this.  Say we want to find all files with the term ".forward" in them.
First, we take the term and massage it into something we keep track of.
(In this case, a kind of stemming to just the word "forward".)

  $ cat files-with-forward.txt

But, like with Google, that gives us EVERYTHING, despite punctuation.

So we just 'search our Google cache', like so:

  $ grep -l "\.forward" `cat files-with-forward.txt`

*WHEW!*  Make sense?  I hope I didn't make any glaring mistakes. ;)

<snip>
> Would that REALLY cause their database to melt down in panic?

It would if suddenly every variation of a 'word' became its own searchable
thing.  In my above example, we'd go from one 'bucket' labelled
"pages with the word 'forward' in it", to one for every variation...

A bucket for ".forward", a bucket for "forward.", a bucket for
"forward," a bucket for "forward;", a bucket for "forward?", a bucket for
"forward!", ... and so on. :^)  (Oh hey, maybe we want to search for
"forward...", too... distinct from "forward." ;^) )

But, again, I AM arguing that they SHOULD take into account dotfile naming
conventions.  (At LEAST in their  http://www.google.com/linux  sub-site!)

-bill!
bill at newbreedsoftware.com          April shower bring Kompressor power!
http://newbreedsoftware.com/