Search indexing and zombies

I've been working on a little Perl search engine module (current working title DBIx::DocSearch, although for now it still has a module called FooBar.pm). I decided to give it a thorough test by indexing all the RFCs (about 5900 I think) which are installed by the appropriate Debian packages. They're all gzipped, but that wasn't a problem because I could just read them through a pipe from zcat.

My first problem was dealing with character encodings in Perl 5.8. I spent ages trying to work out how to convert an ISO-8859-1 string into UTF-8 and have Perl's regex engine actually believe that it was in UTF-8. I gave up on that and just removed non-ASCII characters before tokenising the file. I'll have to fix that later, but for this test it doesn't matter too much if I miss out the few words which contain non-ASCII letters. At the same time I also used y/\0//d to delete null bytes, because at least one RFC has a load of those in, and they seem to break Postgres, or DBD::Pg or something.

So with that problem solved, or at least side stepped, I tried again. Leaving my machine running overnight the indexing process eventually finished, but I noticed that it hadn't quite worked. The time command told me it had been going for 1569m5.761s—over 26 hours! And it had terminated without doing all the documents (only 4050 of them) complaining “IO::Pipe: Cannot fork: Resource temporarily unavailable”.

Further investigation reveals that IO::Pipe doesn't automatically clean up the processes it starts. It doesn't have a DESTROY sub to do a waitpid(), so you have to call the close() method. I hadn't realised that, so I must have had 4050 defunct zcat processes hanging around. I think it died because I have ulimit set to only allow 4095 ‘user processes’. This is rather annoying, although I quite like the fact that I could use my machine for a day without noticing that it had 4050 processes lying around.

< I've written some more blurb | New toy: digital camera >

Miniblog

(nuggets of inanity)

Tuesday Apr 24th 2007, 16:54 »
Just took the annual web design survey that AListApart do. I don't realy consider myself to be a web designer, but I have been doing a lot of HTML and CSS lately.
Monday Apr 23rd 2007, 18:23 »
Strange, there appears to be a bare-knuckle boxing match going on in the field outside my flat. Wish they wouldn't make so much noise about it.
Thursday Mar 1st 2007, 18:47 »
“In its written form, Hebrew has no vowels, making it the ideal language for texting.”
—Said in jest on some Radio 4 programme just now.

Archive: 2007 · 2006 · 2005 · 2004
Feed