Finding the desired output basically consists of running the compressed data through zcat and grep for the strings desired, with a -i by default so that case is ignored.
Originally, my plan was to have one data file, and then grep for files vs. directories by using different regular expressions. However, testing quickly showed that searching for directories (my common use case) through a file with all the file names was very slow, so I ended up creating TWO files -- a list of file names and a list of just directory names. This gave me very acceptable performance when searching for directories, but still leaving the file search option available.
An effect I have noted a few times in my career is that sometimes performing two searches on the same basic data set ends up efficiently using the system disk cache for the second query, and I suspected it might be very possible to get both the complete list of all files AND the list of just all directories with two parallel searches, and it turns out this is correct in my environment. So the directory list find and the file list find are run in parallel by backgrounding them.
find(1) tends to find files in an order that may not seem obviously logical. Sometimes the order of results is annoying to me, I thought about sorting the output of the find command before running it through gzip, and figured that would increase the compression. Somewhat to my surprise, though, it did just the opposite -- the compressed files increased in size, with a huge penalty in time, and a lot of tmp space being used to sort the massive files. So, I dropped that -- if I really want sorted output, I'll run it through sort after the run.
A friend and former coworker of mine always likes to go for absolute maximum compression, using bzip instead of gzip for and the -9 option to get a tiny amount of extra compression in trade for a large amount of additional time. I thought this application might be the time when gzip -9 or even bzip -9 would make sense, where the additional time required to compress the file might be trivial compared to the time to find all the data and the additional compression might be appreciated. However, my testing showed the search times went up a lot when I tried it and the file size went down very little. Wasn't worth it in my opinion.
======================================================== #!/bin/ksh /* * Copyright (c) 2022 Nicholas Holland
* * Permission to use, copy, modify, and distribute this software for any * purpose with or without fee is hereby granted, provided that the above * copyright notice and this permission notice appear in all copies. * * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */ DATADIR=/var/db/wtf # where WTF datafiles live. FILELIST=filelist.text.gz # All files DIRLIST=dirlist.text.gz # All directory names DATAFILE=$DATADIR/$DIRLIST # Default to directories. BUILDUSER="root" # verify user being used to build the lists ZCAT="/usr/bin/zcat" # command to 'cat' a compressed file to stdout GREPCASE="-i" GREPOPTS="" if [[ -z $1 ]]; then cat <<-__ENDHELP usage: $0 [ -c] [-dDfF] searchstring options: -d "dirname" : search for strings in directory names -D "dirname" : search for exactly specified directory -f "filename" : search for string in file names -F "filename" : search for exact file name -c : case sensitive searches -build : rebuild the WTF database (run $BUILDUSER) Note: parser is lame, -c must come first if used. __ENDHELP exit fi if [[ "$1" = "-build" ]]; then if [[ $(whoami) != "$BUILDUSER" ]]; then print "Must run as $BUILDUSER" exit fi mkdir -pm 755 $DATADIR (find / -type f |gzip -v >$DATADIR/$FILELIST.new && mv $DATADIR/$FILELIST.new $DATADIR/$FILELIST) & (find / -type d |gzip -v >$DATADIR/$DIRLIST.new && mv $DATADIR/$DIRLIST.new $DATADIR/$DIRLIST) & sleep 2 print "Database build running in the background." chmod 644 $DATADIR/* exit fi while [[ -n $1 ]]; do case $1 in -c ) GREPCASE="" shift ;; -f ) GREPOPTS="$GREPCASE -e \"$2[^/]*$\"" DATAFILE=$DATADIR/$FILELIST shift ; shift ;; -F ) GREPOPTS="$GREPCASE -e \"/$2\$\"" DATAFILE=$DATADIR/$FILELIST shift ; shift ;; -d ) GREPOPTS="$GREPCASE -e \"$2[^/]*$\"" DATAFILE=$DATADIR/$DIRLIST shift ; shift ;; -D ) GREPOPTS="$GREPCASE -e \"/$2\$\"" DATAFILE=$DATADIR/$DIRLIST shift ; shift ;; -* ) print "Invalid option $1" exit ;; * ) GREPOPTS="$GREPCASE -e $1" shift ;; esac done COMMAND="$ZCAT $DATAFILE |grep $GREPOPTS" print "=$COMMAND=" # Diagnostic. Probably comment out in production. eval $COMMAND ========================================================
Teaming this up with disknice might be wise.
After that, just run "wtf" and whatever you want to search for. As I've presented it, it searches for directories, not file names, but if you are more often more interested in file names, you can change the logic to default to $FILELIST rather than $DIRLIST.
Holland Consulting home
Back to scripting page
since August 3, 2022
Copyright 2022, Nick Holland, Holland Consulting.