I've been working on a little Perl search engine module (current
working title DBIx::DocSearch, although for now it still has a
module called FooBar.pm). I decided to give it a thorough test by
indexing all the RFCs (about 5900 I think) which are installed by
the appropriate Debian packages. They're all gzipped, but that
wasn't a problem because I could just read them through a pipe from
zcat.
My first problem was dealing with character encodings in Perl 5.8.
I spent ages trying to work out how to convert an ISO-8859-1 string
into UTF-8 and have Perl's regex engine actually believe that
it was in UTF-8. I gave up on that and just removed non-ASCII
characters before tokenising the file. I'll have to fix that later,
but for this test it doesn't matter too much if I miss out the few
words which contain non-ASCII letters. At the same time I also used
y/\0//d to delete null bytes, because at least one RFC has a
load of those in, and they seem to break Postgres, or DBD::Pg or
something.
So with that problem solved, or at least side stepped, I tried
again. Leaving my machine running overnight the indexing process
eventually finished, but I noticed that it hadn't quite worked. The
time command told me it had been going for
1569m5.761s—over 26 hours! And it had terminated without doing
all the documents (only 4050 of them) complaining “IO::Pipe:
Cannot fork: Resource temporarily unavailable”.
Further investigation reveals that IO::Pipe doesn't automatically
clean up the processes it starts. It doesn't have a DESTROY
sub to do a waitpid(), so you have to call the close()
method. I hadn't realised that, so I must have had 4050 defunct
zcat processes hanging around. I think it died because I
have ulimit set to only allow 4095 ‘user processes’.
This is rather annoying, although I quite like the fact that I could
use my machine for a day without noticing that it had 4050 processes
lying around.