From oskar@qualica.com Thu May 20 19:42:27 1999 Return-Path: Received: from mantis.qualica.com (mantis.qualica.com [172.31.0.2]) by vi.qualica.com (8.8.8/8.8.8) with ESMTP id TAA08644 for ; Thu, 20 May 1999 19:42:26 +0200 Received: (from oskar@localhost) by mantis.qualica.com (8.8.8/8.8.8) id TAA05452 for oskar@vi.qualica.com; Thu, 20 May 1999 19:42:25 +0200 Received: (from temp@localhost) by mantis.qualica.com (8.8.8/8.8.8) id TAA05433; Thu, 20 May 1999 19:42:22 +0200 Received: from localhost (localhost [127.0.0.1]) by mantis.qualica.com (8.8.8/8.8.8) with ESMTP id TAA05369 for ; Thu, 20 May 1999 19:42:12 +0200 Received: from 127.0.0.1 by localhost with POP3 (fetchmail-5.0.0) for temp@localhost (single-drop); Thu, 20 May 1999 19:42:12 +0200 (SAST) Received: from localhost (localhost [127.0.0.1]) by mantis.qualica.com (8.8.8/8.8.8) with ESMTP id TAA02614 for ; Thu, 20 May 1999 19:15:46 +0200 Received: from pop.global.co.za by localhost with POP3 (fetchmail-5.0.0) for lists@localhost (single-drop); Thu, 20 May 1999 19:15:46 +0200 (SAST) Received: from mx1.global.co.za (mx1.global.co.za [196.3.167.152]) by mailhub4.global.co.za (8.8.7/8.8.5) with ESMTP id KAA24495 for <00312503@mailhub4.global.co.za>; Thu, 20 May 1999 10:36:47 +0200 Received: from hackvan.com (hackvan.com [206.80.31.242]) by mx1.global.co.za (8.9.2/8.9.1) with SMTP id KAA27316 for ; Thu, 20 May 1999 10:35:54 +0200 (GMT) Received: (qmail 5000 invoked by uid 530); 20 May 1999 08:35:22 -0000 Mailing-List: contact reiserfs-help@devlinux.com; run by ezmlm Delivered-To: mailing list reiserfs@devlinux.com Received: (qmail 4993 invoked from network); 20 May 1999 08:35:19 -0000 Message-Id: Date: Thu, 20 May 1999 01:35:44 -0700 (PDT) From: root To: reiserfs@devlinux.com Subject: (reiserfs) What would you do to optimize Reiserfs for SQUID? X-UIDL: 3f7f5413541fb3a6bc4e4a9df6ed135e Status: RO X-Status: A Content-Length: 1347 Lines: 34 We (Namesys) have an opportunity to earn some money by optimizing for SQUID. The customer sells boxes which run squid, and some performance metrics software for the boxes. We see the following things as possibilities for optimizing for SQUID: turning off preserve lists (losing files is not a big deal for SQUID). eliminating the hashing of names that squid does, and storing files in a directory whose name is the URL's site, and with a filename equal to the pathname. implementing effective SMP. optimizing for creates and reads equally, and not optimizing for stat data (this means using plan A). Maybe even putting stat data near file bodies, we'll have to review old benchmarks, and see. I see us creating a #define OPTIMIZE_FOR_SQUID. Finally, and this is what I am least inspired in, we need to create some proprietary features we can give them such that we won't care about whether everyone else gets them, but which will be useful to them in differientating themselves from their (not existing at this time) competition. The customer really wants this. I tried to convince them it isn't really as useful as the advertising you get when everybody uses a product which displays a "sponsored by X" banner, but they weren't fully swayed. Perhaps simply the #define OPTIMIZE_FOR_SQUID, and its ifdefs stuff. I need ideas. Hans From oskar@qualica.com Thu May 20 19:47:12 1999 Return-Path: Received: from mantis.qualica.com (mantis.qualica.com [172.31.0.2]) by vi.qualica.com (8.8.8/8.8.8) with ESMTP id TAA09051 for ; Thu, 20 May 1999 19:47:12 +0200 Received: (from oskar@localhost) by mantis.qualica.com (8.8.8/8.8.8) id TAA06833 for oskar@vi.qualica.com; Thu, 20 May 1999 19:47:11 +0200 Received: (from temp@localhost) by mantis.qualica.com (8.8.8/8.8.8) id TAA06824; Thu, 20 May 1999 19:47:10 +0200 Received: from localhost (localhost [127.0.0.1]) by mantis.qualica.com (8.8.8/8.8.8) with ESMTP id TAA06545 for ; Thu, 20 May 1999 19:46:21 +0200 Received: from 127.0.0.1 by localhost with POP3 (fetchmail-5.0.0) for temp@localhost (single-drop); Thu, 20 May 1999 19:46:21 +0200 (SAST) Received: from localhost (localhost [127.0.0.1]) by mantis.qualica.com (8.8.8/8.8.8) with ESMTP id TAA02879 for ; Thu, 20 May 1999 19:18:28 +0200 Received: from pop.global.co.za by localhost with POP3 (fetchmail-5.0.0) for lists@localhost (single-drop); Thu, 20 May 1999 19:18:28 +0200 (SAST) Received: from mx2.global.co.za (mx2.global.co.za [196.3.167.142]) by mailhub4.global.co.za (8.8.7/8.8.5) with ESMTP id MAA22937 for <00312503@mailhub4.global.co.za>; Thu, 20 May 1999 12:56:57 +0200 Received: from hackvan.com (hackvan.com [206.80.31.242]) by mx2.global.co.za (8.8.8/8.8.8) with SMTP id MAA20666 for ; Thu, 20 May 1999 12:56:34 +0200 Received: (qmail 5890 invoked by uid 530); 20 May 1999 10:56:08 -0000 Mailing-List: contact reiserfs-help@devlinux.com; run by ezmlm Delivered-To: mailing list reiserfs@devlinux.com Received: (qmail 5883 invoked from network); 20 May 1999 10:56:07 -0000 Sender: lkv@mail.isg.de Message-ID: <3743EA33.C598559@isg.de> Date: Thu, 20 May 1999 12:55:47 +0200 From: Lutz Vieweg Organization: Innovative Software GmbH X-Mailer: Mozilla 4.6 [en] (X11; I; Linux 2.0.36 i686) X-Accept-Language: German, de, en MIME-Version: 1.0 To: reiserfs@devlinux.com Subject: (reiserfs) Re: What would you do to optimize Reiserfs for SQUID? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-UIDL: f7e1e7982c14a71ffa4e3dcc46dc84d7 Status: RO X-Status: A Content-Length: 1862 Lines: 48 root wrote: > What would you do to optimize Reiserfs for SQUID? Dump Squid, and write a better software :-) Now seriously: I wrote specialized high-performance HTTP-servers and proxies for the Innovative Software GmbH, and these are currently used for the top-traffic public server in germany. Before I wrote my software I experimented with existing servers (such as apache and squid) and found that their performance limits are not really the filesystems... much more important to gain better performance is a more clever memory caching strategy and optimisations around the networking code. But of course, if you're paid to improve reiserfs to tweak squid a little - why not... probably helpful for the reiserfs project :-) > We see the following things as possibilities for optimizing for SQUID: ... > eliminating the hashing of names that squid does, and storing files in a > directory whose name is the URL's site, and with a filename equal to the > pathname. Uh-oh, I hope you're aware that you'll see even the most strange-looking urls, hundreds of byte in length (even though they may violate the HTTP limit of 256 characters), and every stupid special character on earth... I think using hash-sums as names makes things a lot easier and shoudn't hurt the performance too much, as you'd otherwise need to escape disturbing characters anyway... > implementing effective SMP. When you do not dynamically generate complex HTML content (as we do), user-CPU time really shouldn't be the limiting factor, the avoidance of excessive context-switches, mutex-waits etc. is probably more important Regards, Lutz Vieweg -- Dipl. Phys. Lutz Vieweg | email: lkv@isg.de Innovative Software GmbH | Phone/Fax: +49-69-505030 -20/-50 Feuerbachstrasse 26-32 | http://www.isg.de/persons/lkv/ 60325 Frankfurt am Main | ^^^ PGP key available here ^^^ From oskar@qualica.com Fri May 21 08:57:39 1999 Date: Fri, 21 May 1999 10:19:13 +0200 From: Oskar Pearson To: root Cc: reiserfs@devlinux.com Subject: Re: (reiserfs) What would you do to optimize Reiserfs for SQUID? Message-ID: <19990521101913.A136@qualica.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.5i In-Reply-To: ; from root on Thu, May 20, 1999 at 01:35:44AM -0700 Status: RO Content-Length: 8316 Lines: 177 Hi The Squid-Dev people have been talking about creating their own FS for a long time now. Given the polygraph results (polygraph = proxy benchmark) some of the core Squid developers have become interested in this. Duane Wessels (wessels@ircache.net) is currently putting code into Squid to support user-level fs's, similar to the way INN has plugins for different store methods. I don't have an archive of the squid-dev discussions of this any more. I trashed it when I moved jobs: I do have a printed out version. I'll try and do some searching for an online archive this weekend. There is an alpha version of Squid that includes some user-level fs stuff at ftp.packetstorm.on.ca. (hmm. Perhaps these people are your clients ;) > turning off preserve lists (losing files is not a big deal for SQUID). Agreed. If you have a file on disk, though, it _must_ be the file that was originally placed there Otherwise banking sites can end up being porn sites. (I think Squid does do some checking here: I haven't looked at that section of the code.) > eliminating the hashing of names that squid does, and storing files in a > directory whose name is the URL's site, and with a filename equal to the > pathname. I wouldn't say "eliminating" here. If you have a 100 gig cache, and the average object size is 13kb, you will have about 7.5 million objects on disk. Objects are not spread evenly across the directory space either, so if you create lots of directories (called www.netscape.com, www.linux.org, for example) 90% of them will probably on have a few files in them. If each site has (say) 30 objects in it's directory path, you have something like 200 000 "www.sitename.example" directories in the top level directory. I don't know if Reiserfs (or any other fs) will handle that well. Next issue with that: Squid doesn't just use the URL name as the key. The URL isn't a unique key: because http can contain additional info in the HTTP headers, you should only return the same object when the headers are the same. Squid generates different filenames for these objects. The cache can also be spread across multiple disk drives: Squid also decides which disk to save the object to based on the hash value. You can get around this with raid0, but the advantage of doing this with Squid is that you can remove a cache_dir line from the squid.conf and the disk is ignored. With raid0 you will lose all objects; with raid5 things get slow. > implementing effective SMP. Cpu is almost _never_ the limiting factor on caches. It can be if you have a slowish cpu, and you are doing transparency, but then all the cpu time is spent in the network code. I used to use Squid 1.0, where the whole of the code was a single-thread. Using "strace -T" (thanks to Michael O'Reilly, one of the major Squid guys) we confirmed that about 70% of the wall time was spent waiting for open(), and 15% of the time was waiting for close() Squid 1.1 and above can use threads which just call open and close. This has dramatically increased performance (our cache latency dropped by an order of magnitude.) If you can ensure that these functions return immediately, then you can get rid of that section of the Squid code. As I see it (and that doesn't mean anything) the ext2 code has to check if a file exists before it can return from open(). Because there are so many files on disk (a few million) the last time the kernel tried to open that file was a long time ago (if you keep objects for three days, it's possible that you haven't touched that inode for three days...) Any dcache info for that file has probably been thrown out a long time ago: there are many memory pressures on the machine (network buffers, and although caching disk reads is nearly pointless, disk read caching.) (Why disk caching is less useful on a cache is covered shortly) Because the file info is not cached, an open involves lots of seeks: you probably (no tests to verify this) have to stat a couple of directories, find the appropriate subdirectories, find the filename, truncate the file, update time info and then you can return. Note the truncate there: Squid constantly overwrites old files with new ones, and to do this it truncates an existing file, and then starts writing with the same filename. dancer@zeor.simegen.com says: ---------- Linux EXT2 is very fast at creating new files and data (does rarely block at all unless disk is saturated). Truncating (truncate/unlink) of existing files is comparably much slower than one would expect in my filesystem tests, and so is also writing to truncated files (actually slower than creating new files). Results of some simple saturation measurements on Linux 2.2.6ac1: Operation Time taken unlink empty files 3 create empty files 9 create files with data 40 overwrite truncated 50 unlink files with data 65 truncate to 0 size 70 overwrite files 145 (O_TRUNC used) ---------- I would _really_ like to try and use Stephen's iostat patches on a cache, but I don't have access to a loaded cache these days. This would give us an idea of what the real load is. > optimizing for creates > and reads equally, That doesn't sound right to me. Create is very important, but the reads stuff there might be a bit wrong. When a cache gets a 40% hit rate by object count, it doesn't mean that 40% of the files in the cache are accessed: it just means that a small proportion of the objects are accessed a _lot_. There are thus so-called "hot objects", which account for a large portion of a cache's hit rate (objects on netscape.com and altavista's home pages, for example.) These hot objects will probably fit in a small portion of ram (each object is 13kb, so the top 1000 requested objects will sit in a few Mb.) Squid can already keep these objects in ram: and it understands it's own disk-usage pattern. If people relied on the OS to cache this info on very large sites, the cached data would almost certainly be purged incorrectly (no, I don't have stats on this either.) So, the objects that are important are small, and may be cached by Squid, rather than the OS. And it can probably do a pretty good job of it. Thus, you have the standard "predominance of writes" problem. If you aggressively cache inode info (I don't know: make the dcache hang around instead of caching disk reads so much) so that creates are fast, ensure that truncating is fast and aggressively cache disk writes (so that the process essentially never blocks) then things should be fairly happy. Probably an order of magnitude speed up, and then the problems become Squid's once again. The Netapp (www.netapp.com) filesystem is optimised for raid: your really can't run a large cache (think 20 disks...) without raid, otherwise you have excessive downtime. Their advantage is that they can bundle writes into big blocks, and write these blocks sequentially across all stripes, across all heads at the same time. If you write lots and lots of small objects, as a cache does, their logging filesystem is much more efficient with raid devices (I believe: raid5 is unhappy with lots of small writes, correct?) I believe one of the problems with the BSD-type filesystems and caching is the amount of fragmentation you get. People say that by using only 1/2 their available disk space, they can drastically increase performance. I presume that the reiserfs allocation of space is nicer in this regard, right? I haven't done _any_ debugging here: I do have a squid-dev thread readily available here (only a little info in here, though), so I am going to sent it to you, Hans. > and not optimizing for stat data (this means using plan A). I presume that creating new files will still be fast? And truncating them? This is only the stat data that ls's use, right? > I see us creating a #define OPTIMIZE_FOR_SQUID. Good ;) You might be able to find people (ISPs, for example) that are willing to sponsor you efforts as gnu source. The commercial alternatives are very expensive: if you find a few ISPs that use Squid, and that are willing to contribute a modest amount to the development, you will probably still find the project viable. Speak to Dancer, address above, for example: they use Squid on Linux for large projects. (hmm, perhaps they are your client ;) Oskar From oskar@qualica.com Fri May 21 18:32:34 1999 Return-Path: Received: from mantis.qualica.com (mantis.qualica.com [172.31.0.2]) by vi.qualica.com (8.8.8/8.8.8) with ESMTP id SAA00707 for ; Fri, 21 May 1999 18:32:34 +0200 Received: (from oskar@localhost) by mantis.qualica.com (8.8.8/8.8.8) id SAA02283 for oskar@vi.qualica.com; Fri, 21 May 1999 18:32:35 +0200 Received: from localhost (localhost [127.0.0.1]) by mantis.qualica.com (8.8.8/8.8.8) with ESMTP id SAA02277 for ; Fri, 21 May 1999 18:32:34 +0200 Received: from pop.global.co.za by localhost with POP3 (fetchmail-5.0.0) for oskar@localhost (single-drop); Fri, 21 May 1999 18:32:34 +0200 (SAST) Received: from mx1.global.co.za (mx1.global.co.za [196.3.167.152]) by mailhub4.global.co.za (8.8.7/8.8.5) with ESMTP id RAA10721 for <00312173@mailhub4.global.co.za>; Fri, 21 May 1999 17:48:53 +0200 Received: from reiser (root@reiser.dial.idiom.com [209.157.70.56]) by mx1.global.co.za (8.9.2/8.9.1) with ESMTP id RAA01521 for ; Fri, 21 May 1999 17:48:15 +0200 (GMT) Received: by reiser.dial.idiom.com via sendmail from stdin id (Debian Smail3.2.0.102) for oskar@qualica.com; Fri, 21 May 1999 08:47:57 -0700 (PDT) From: root MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Fri, 21 May 1999 08:47:56 -0700 (PDT) To: Oskar Pearson Cc: "Alex Chan" , "Michael Chan" , reiserfs@devlinux.com Subject: Re: (reiserfs) What would you do to optimize Reiserfs for SQUID? In-Reply-To: <19990521103215.C136@qualica.com> References: <3743EA33.C598559@isg.de> <14148.17922.875756.703105@reiser> <19990521103215.C136@qualica.com> X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <14149.32374.946484.416993@reiser> Status: RO Content-Length: 1592 Lines: 38 Good point Oskar. This suggests that either one big directory with filenames equal to URLS would be optimal, or one could use the second to last component of the URL as the parent directory, and use multi-file read_ahead (reiserfs currently supports this), and read everything in the directory when reading one file in that directory for any directory below a certain size. We should try it both ways, and measure it. Hans Oskar Pearson writes: > Hi > > > Reiserfs hashes names, I am just proposing to do it twice, once for the > > site, and once for the rest of the URL. By doing it this way, files > > from the same site get put near each other. > > This could be a win in some circumstances: but I have a hunch that > it's a false optimisation. On a busy cache (where speed is very > important) your heads are seeking all over the place (hundreds of > seeks per second.) > > The advantage I see of having the objects close to one another is > that a client downloading an html page will hit the same jpg and > gif files next. Right? > > The problem is that the html download and interpretation can take > 1/2 a second (on ethernet) and seconds on a modem. By the time that > their browser requests the jpg and gif files, the cache disks may > have done another 50 seeks, and in the worst case (modem) they may > have done 500 seeks. Locality of reference isn't such an issue on a > very heavily used cache. On an unloaded cache it's probably going > to speed up a given sequence of requests a bit, but I don't see > the advantage being huge. > > Oskar From oskar@qualica.com Fri May 21 18:32:40 1999 Return-Path: Received: from mantis.qualica.com (mantis.qualica.com [172.31.0.2]) by vi.qualica.com (8.8.8/8.8.8) with ESMTP id SAA00727 for ; Fri, 21 May 1999 18:32:39 +0200 Received: (from oskar@localhost) by mantis.qualica.com (8.8.8/8.8.8) id SAA02326 for oskar@vi.qualica.com; Fri, 21 May 1999 18:32:40 +0200 Received: from localhost (localhost [127.0.0.1]) by mantis.qualica.com (8.8.8/8.8.8) with ESMTP id SAA02316 for ; Fri, 21 May 1999 18:32:39 +0200 Received: from pop.global.co.za by localhost with POP3 (fetchmail-5.0.0) for oskar@localhost (single-drop); Fri, 21 May 1999 18:32:39 +0200 (SAST) Received: from mx1.global.co.za (mx1.global.co.za [196.3.167.152]) by mailhub4.global.co.za (8.8.7/8.8.5) with ESMTP id SAA15085 for <00312173@mailhub4.global.co.za>; Fri, 21 May 1999 18:10:02 +0200 Received: from reiser (root@reiser.dial.idiom.com [209.157.70.56]) by mx1.global.co.za (8.9.2/8.9.1) with ESMTP id SAA04248 for ; Fri, 21 May 1999 18:09:29 +0200 (GMT) Received: by reiser.dial.idiom.com via sendmail from stdin id (Debian Smail3.2.0.102) for oskar@qualica.com; Fri, 21 May 1999 09:09:08 -0700 (PDT) From: root MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Fri, 21 May 1999 09:09:08 -0700 (PDT) To: Oskar Pearson Cc: reiserfs@devlinux.com, "Keith K Chau" , "Alex Chan" , "Michael Chan" Subject: Re: (reiserfs) What would you do to optimize Reiserfs for SQUID? In-Reply-To: <19990521101913.A136@qualica.com> References: <19990521101913.A136@qualica.com> X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <14149.32979.332184.118474@reiser> Status: RO Content-Length: 9407 Lines: 198 Oskar Pearson writes: > Hi > > The Squid-Dev people have been talking about creating their own FS > for a long time now. Given the polygraph results (polygraph = proxy > benchmark) some of the core Squid developers have become interested > in this. Duane Wessels (wessels@ircache.net) is currently putting > code into Squid to support user-level fs's, similar to the way INN > has plugins for different store methods. > > I don't have an archive of the squid-dev discussions of this any more. > I trashed it when I moved jobs: I do have a printed out version. I'll > try and do some searching for an online archive this weekend. > > There is an alpha version of Squid that includes some user-level fs > stuff at ftp.packetstorm.on.ca. (hmm. Perhaps these people are your > clients ;) > > > turning off preserve lists (losing files is not a big deal for SQUID). > > Agreed. If you have a file on disk, though, it _must_ be the file > that was originally placed there Otherwise banking sites can end > up being porn sites. (I think Squid does do some checking here: > I haven't looked at that section of the code.) > > > eliminating the hashing of names that squid does, and storing files in a > > directory whose name is the URL's site, and with a filename equal to the > > pathname. > > I wouldn't say "eliminating" here. If you have a 100 gig cache, > and the average object size is 13kb, you will have about 7.5 > million objects on disk. Objects are not spread evenly across > the directory space either, so if you create lots of directories > (called www.netscape.com, www.linux.org, for example) 90% of them > will probably on have a few files in them. If each site has (say) > 30 objects in it's directory path, you have something like 200 000 > "www.sitename.example" directories in the top level directory. I > don't know if Reiserfs (or any other fs) will handle that well. It will. More efficiently than squid will, since we won't need to do a lot of lookups. > > Next issue with that: Squid doesn't just use the URL name as the > key. The URL isn't a unique key: because http can contain additional > info in the HTTP headers, you should only return the same object > when the headers are the same. Squid generates different filenames > for these objects. > > The cache can also be spread across multiple disk drives: Squid also > decides which disk to save the object to based on the hash value. You > can get around this with raid0, but the advantage of doing this with > Squid is that you can remove a cache_dir line from the squid.conf and > the disk is ignored. With raid0 you will lose all objects; with raid5 > things get slow. > > > implementing effective SMP. > > Cpu is almost _never_ the limiting factor on caches. It can be if you > have a slowish cpu, and you are doing transparency, but then all the > cpu time is spent in the network code. > > I used to use Squid 1.0, where the whole of the code was a > single-thread. Using "strace -T" (thanks to Michael O'Reilly, one of > the major Squid guys) we confirmed that about 70% of the wall time was > spent waiting for open(), and 15% of the time was waiting for close() > > Squid 1.1 and above can use threads which just call open and close. > This has dramatically increased performance (our cache latency dropped > by an order of magnitude.) > > If you can ensure that these functions return immediately, then you > can get rid of that section of the Squid code. We could, not sure we should if squid has an effective work-around. The code we'd have to write might not be simpler than the code squid has. I'd have to look at it in detail. > > As I see it (and that doesn't mean anything) the ext2 code has to > check if a file exists before it can return from open(). Because > there are so many files on disk (a few million) the last time the > kernel tried to open that file was a long time ago (if you keep > objects for three days, it's possible that you haven't touched that > inode for three days...) Any dcache info for that file has probably > been thrown out a long time ago: there are many memory pressures > on the machine (network buffers, and although caching disk reads > is nearly pointless, disk read caching.) > (Why disk caching is less useful on a cache is covered shortly) > > Because the file info is not cached, an open involves lots of seeks: > you probably (no tests to verify this) have to stat a couple of > directories, find the appropriate subdirectories, find the filename, > truncate the file, update time info and then you can return. Note the > truncate there: Squid constantly overwrites old files with new ones, > and to do this it truncates an existing file, and then starts writing > with the same filename. > > dancer@zeor.simegen.com says: > ---------- > Linux EXT2 is very fast at creating new files and data (does rarely block > at all unless disk is saturated). Truncating (truncate/unlink) of existing > files is comparably much slower than one would expect in my filesystem > tests, and so is also writing to truncated files (actually slower than > creating new files). reiserfs is faster at creating and truncating and deleting. > > Results of some simple saturation measurements on Linux 2.2.6ac1: > > Operation Time taken > unlink empty files 3 > create empty files 9 > create files with data 40 > overwrite truncated 50 > unlink files with data 65 > truncate to 0 size 70 > overwrite files 145 (O_TRUNC used) > ---------- > > I would _really_ like to try and use Stephen's iostat patches on a > cache, but I don't have access to a loaded cache these days. This > would give us an idea of what the real load is. > > > optimizing for creates > > and reads equally, > > That doesn't sound right to me. Create is very important, but the > reads stuff there might be a bit wrong. When a cache gets a 40% > hit rate by object count, it doesn't mean that 40% of the files > in the cache are accessed: it just means that a small proportion > of the objects are accessed a _lot_. There are thus so-called "hot > objects", which account for a large portion of a cache's hit rate > (objects on netscape.com and altavista's home pages, for example.) > > These hot objects will probably fit in a small portion of ram (each > object is 13kb, so the top 1000 requested objects will sit in a few > Mb.) Squid can already keep these objects in ram: and it understands > it's own disk-usage pattern. If people relied on the OS to cache > this info on very large sites, the cached data would almost certainly > be purged incorrectly (no, I don't have stats on this either.) > > So, the objects that are important are small, and may be cached by > Squid, rather than the OS. And it can probably do a pretty good job of > it. > > Thus, you have the standard "predominance of writes" problem. If > you aggressively cache inode info (I don't know: make the dcache > hang around instead of caching disk reads so much) so that creates > are fast, ensure that truncating is fast and aggressively cache > disk writes (so that the process essentially never blocks) then things > should be fairly happy. Probably an order of magnitude speed up, and > then the problems become Squid's once again. > > The Netapp (www.netapp.com) filesystem is optimised for raid: your > really can't run a large cache (think 20 disks...) without raid, > otherwise you have excessive downtime. Their advantage is that > they can bundle writes into big blocks, and write these blocks > sequentially across all stripes, across all heads at the same > time. If you write lots and lots of small objects, as a cache does, > their logging filesystem is much more efficient with raid devices > (I believe: raid5 is unhappy with lots of small writes, correct?) > > I believe one of the problems with the BSD-type filesystems and > caching is the amount of fragmentation you get. People say that > by using only 1/2 their available disk space, they can drastically > increase performance. I presume that the reiserfs allocation of space > is nicer in this regard, right? I haven't done _any_ debugging here: > I do have a squid-dev thread readily available here (only a little > info in here, though), so I am going to sent it to you, Hans. > > > and not optimizing for stat data (this means using plan A). > > I presume that creating new files will still be fast? And truncating > them? This is only the stat data that ls's use, right? Yes, you understand perfectly. > > > I see us creating a #define OPTIMIZE_FOR_SQUID. > > Good ;) > > You might be able to find people (ISPs, for example) that are willing > to sponsor you efforts as gnu source. The commercial alternatives > are very expensive: if you find a few ISPs that use Squid, and that > are willing to contribute a modest amount to the development, you will > probably still find the project viable. Speak to Dancer, address > above, for example: they use Squid on Linux for large projects. (hmm, > perhaps they are your client ;) > > Oskar Thanks much, Oskar. You've done a lot to make me feel confident that reiserfs could do this task much better than ext2. SQUID hits all our performance strongpoints. Hans