Did the asynchronous write stuff (as it was in fr1) ever get into kernel
software raid?
I see from the raid acceleration ("ioat") patching going on that some
sort of asynchronicity is being contemplated, but blessed if I can make
head or tail of the descriptions I've read. It looks vaguely like
p
While travelling the last few days, a theory has occurred to me to
explain this sort of thing ...
> A user has sent me a ps ax output showing an enbd client daemon
> blocked in get_active_stripe (I presume in raid5.c).
>
> ps ax -of,uid,pid,ppid,pri,ni,vsz,rss,wchan:30,stat,tty,time,comman
A user has sent me a ps ax output showing an enbd client daemon
blocked in get_active_stripe (I presume in raid5.c).
ps ax -o f,uid,pid,ppid,pri,ni,vsz,rss,wchan:30,stat,tty,time,command
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
5 0 26540 1 23
"Also sprach Gabor Gombas:"
> On Thu, Aug 17, 2006 at 08:28:07AM +0200, Peter T. Breuer wrote:
>
> > 1) if the network disk device has decided to shut down wholesale
> >(temporarily) because of lack of contact over the net, then
> >retries and writes a
"Also sprach ptb:"
> 4) what the network device driver wants to do is be able to identify
>the difference between primary requests and retries, and delay
>retries (or repeat them internally) with some reasonable backoff
>scheme to give them more chance of working in the face of a
>
HI Neil ..
"Also sprach Neil Brown:"
> On Wednesday August 16, [EMAIL PROTECTED] wrote:
> > 1) I would like raid request retries to be done with exponential
> >delays, so that we get a chance to overcome network brownouts.
> >
> > 2) I would like some channel of communication to be available
"Also sprach Molle Bestefich:"
>
> > See above. The problem is generic to fixed bandwidth transmission
> > channels, which, in the abstract, is "everything". As soon as one
> > does retransmits one has a kind of obligation to keep retransmissions
> > down to a fixed maximum percentage of the poten
"Also sprach Molle Bestefich:"
[Charset ISO-8859-1 unsupported, filtering to ASCII...]
> Peter T. Breuer wrote:
> > > You want to hurt performance for every single MD user out there, just
> >
> > There's no performance drop! Exponentially staged retries on
"Also sprach Molle Bestefich:"
[Charset ISO-8859-1 unsupported, filtering to ASCII...]
> Peter T. Breuer wrote:
> > We can't do a HOT_REMOVE while requests are outstanding, as far as I
> > know.
>
> Actually, I'm not quite sure which kind of requests you
"Also sprach Molle Bestefich:"
> Peter T. Breuer wrote:
> > I would like raid request retries to be done with exponential
> > delays, so that we get a chance to overcome network brownouts.
>
> Hmm, I don't think MD even does retries of requests.
I had a "
"Also sprach Molle Bestefich:"
[Charset ISO-8859-1 unsupported, filtering to ASCII...]
> Peter T. Breuer wrote:
> > 1) I would like raid request retries to be done with exponential
> >delays, so that we get a chance to overcome network brownouts.
> >
> > I
Hello -
I believe the current kernel raid code retries failed reads too
quickly and gives up too soon for operation over a network device.
Over (my) the enbd device, the default mode of operation was
before-times to have the enbd device time out requests after 30s of net
stalemate and maybe even
Robbie Hughes <[EMAIL PROTECTED]> wrote:
> Number Major Minor RaidDevice State
>0 00 -1 removed
>1 22 661 active sync /dev/hdd2
>2 330 spare /dev/hda3
> The main problem i have now is
Molle Bestefich <[EMAIL PROTECTED]> wrote:
> There seems to be an obvious lack of a properly thought out interface
> to notify userspace applications of MD events (disk failed --> go
> light a LED, etc).
Well, that's probably truish. I've been meaning to ask for a per-device
sysctl interface for s
tmp <[EMAIL PROTECTED]> wrote:
> I've read "man mdadm" and "man mdadm.conf" but I certainly doesn't have
> an overview of software RAID.
Then try using it instead/as well as reading about it, and you will
obtain a more cmprehensive understanding.
> OK. The HOWTO describes mostly a raidtools conte
tmp <[EMAIL PROTECTED]> wrote:
> 1) I have a RAID-1 setup with one spare disk. A disk crashes and the
> spare disk takes over. Now, when the crashed disk is replaced with a new
> one, what is then happening with the role of the spare disk? Is it
> reverting to its old role as spare disk?
Try it an
Doug Ledford <[EMAIL PROTECTED]> wrote:
> > > Now, if I recall correctly, Peter posted a patch that changed this
> > > semantic in the raid1 code. The raid1 code does not complete a write to
> > > the upper layers of the kernel until it's been completed on all devices
> > > and his patch made it s
I forgot to say "thanks"! Thanks for the breakdown.
Doug Ledford <[EMAIL PROTECTED]> wrote:
(of event count increment)
> I think the best explanation is this: any change in array state that
OK ..
> would necessitate kicking a drive out of the array if it didn't also
> make this change in state
Luca Berra <[EMAIL PROTECTED]> wrote:
> On Tue, Mar 29, 2005 at 01:29:22PM +0200, Peter T. Breuer wrote:
> >Neil Brown <[EMAIL PROTECTED]> wrote:
> >> Due to the system crash the data on hdb is completely ignored. Data
> >
> >Neil - can you explain the
Neil Brown <[EMAIL PROTECTED]> wrote:
> On Tuesday March 29, [EMAIL PROTECTED] wrote:
> >
> > Don't put the journal on the raid device, then - I'm not ever sure why
> > people do that! (they probably have a reason that is good - to them).
>
> Not good advice. DO put the journal on a raid device
Neil Brown <[EMAIL PROTECTED]> wrote:
> Due to the system crash the data on hdb is completely ignored. Data
Neil - can you explain the algorithm that stamps the superblocks with
an event count, once and for all? (until further amendment :-).
It goes without saying that sb's are not stamped at ev
Schuett Thomas EXT <[EMAIL PROTECTED]> wrote:
> And here the fault happens:
> By chance, it reads the transaction log from hda, then sees, that the
> transaction was finished, and clears the overall unclean bit.
> This cleaning is a write, so it goes to *both* HDs.
Don't put the journal on the ra
md_exit calls mddev_put on each mddev during module exit. mddev_put
calls blk_put_queue under spinlock, although it can sleep (it clearly
calls kblockd_flush). This patch lifts the spinlock to do the flush.
--- md.c.orig Fri Dec 24 22:34:29 2004
+++ md.cSun Mar 27 14:14:22 2005
@@ -173,7
Luca Berra <[EMAIL PROTECTED]> wrote:
> we can have a series of failures which must be accounted for and dealt
> with according to a policy that might be site specific.
>
> A) Failure of the standby node
> A.1) the active is allowed to continue in the absence of a data replica
> A.2) disk writ
Paul Clements <[EMAIL PROTECTED]> wrote:
> system A
> [raid1]
> / \
> [disk][nbd] --> system B
>
> 2) you're writing, say, block 10 to the raid1 when A crashes (block 10
> is dirty in the bitmap, and you don't know whether it got written to the
> disk on A or B, neither, o
Luca Berra <[EMAIL PROTECTED]> wrote:
> If we want to do data-replication, access to the data-replicated device
> should be controlled by the data replication process (*), md does not
> guarantee this.
Well, if one writes to the md device, then md does guarantee this - but
I find it hard to parse
Neil Brown <[EMAIL PROTECTED]> wrote:
> However I want to do raid5 first. I think it would be much easier
> because of the stripe cache. Any 'stripe' with a bad read would be
There's the FR5 patch (fr5.sf.net) which adds a bitmap to raid5. It
doesn't do "robust read" for raid5, however.
> flagg
Paul Clements <[EMAIL PROTECTED]> wrote:
OK - thanks for the reply, Paul ...
> Peter T. Breuer wrote:
> > But why don't we already know from the _single_ bitmap on the array
> > node ("the node with the array") what to rewrite in total? All writes
> > m
Paul Clements <[EMAIL PROTECTED]> wrote:
> At any rate, this is all irrelevant given the second part of that email
> reply that I gave. You still have to do the bitmap combining, regardless
> of whether two systems were active at the same time or not.
As I understand it, you want both bitmaps in
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> The point a) is moot, because this whole structure is used in raid1.c ONLY.
> (I don't know why it is placed into raid1.h header file instead of into
> raid1.c directly, but that's a different topic).
Hmm. I'm a little surprised. I would be worried that
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> > Uh OK. As I recall one only needs to count, one doesn't need a bitwise
> > map of what one has dealt with.
>
> Well. I see read_balance() is now used to resubmit reads. There's
> a reason to use it instead of choosing "next" disk, I think.
I can't
Guy <[EMAIL PROTECTED]> wrote:
> I agree, but I don't think a block device can to a re-sync without
> corrupting both. How do you merge a superset at the block level? AND the 2
Don't worry - it's just a one-way copy done efficiently (i.e., leaving
out all the blocks known to be unmodified both s
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> On 2005-03-19T16:06:29, "Peter T. Breuer" <[EMAIL PROTECTED]> wrote:
>
> I'm cutting out those parts of the discussion which are irrelevant (or
> which I don't consider worth pursuing; maybe you'll
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> [-- text/plain, encoding 7bit, charset: KOI8-R, 74 lines --]
>
> Peter T. Breuer wrote:
> []
> > The patch was originally developed for 2.4, then ported to 2.6.3, and
> > then to 2.6.8.1. Neil has recently been doing
Mario Holbe <[EMAIL PROTECTED]> wrote:
> Peter T. Breuer <[EMAIL PROTECTED]> wrote:
> > Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> >> Split-brain is a well studied subject, and while many prevention
> >> strategies exist, errors occur even i
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> On 2005-03-19T14:27:45, "Peter T. Breuer" <[EMAIL PROTECTED]> wrote:
>
> > > Which one of the datasets you choose you could either arbitate via some
> > > automatic mechanisms (drbd-0.8 has a couple) o
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> On 2005-03-19T12:43:41, "Peter T. Breuer" <[EMAIL PROTECTED]> wrote:
>
> > Well, there is the "right data" from our point of view, and it is what
> > should by on (one/both?) device by now. O
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> Ok, you intrigued me enouth already.. what's the FR1 patch? I want
> to give it a try... ;) Especially I'm interested in the "Robust Read"
> thing...
That was published on this list a few weeks ago (probably needs updating,
but I am sure you can help
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> Luca Berra wrote:
> > On Fri, Mar 18, 2005 at 02:42:55PM +0100, Lars Marowsky-Bree wrote:
> >
> >> The problem is for multi-nodes, both sides have their own bitmap. When a
> >> split scenario occurs, and both sides begin modifying the data, that
> >> bi
Paul Clements <[EMAIL PROTECTED]> wrote:
> Peter T. Breuer wrote:
> > I don't see that this solves anything. If you had both sides going at
> > once, receiving different writes, then you are sc&**ed, and no
> > resolution of bitmaps will help you, since bo
Mario Holbe <[EMAIL PROTECTED]> wrote:
> Peter T. Breuer <[EMAIL PROTECTED]> wrote:
> > Yes, you can "sync" them by writing any one of the two mirrors to the
> > other one, and need do so only on the union of the mapped data areas,
>
> As far as I unders
Mario Holbe <[EMAIL PROTECTED]> wrote:
> Peter T. Breuer <[EMAIL PROTECTED]> wrote:
> > different (legitimate) data. It doesn't seem relevant to me to consider
> > if they are equally up to date wrt the writes they have received. They
> > will be in the wrong
Paul Clements <[EMAIL PROTECTED]> wrote:
> [ptb]
> > Could you set out the scenario very exactly, please, for those of us at
> > the back of the class :-). I simply don't see it. I'm not saying it's
> > not there to be seen, but that I have been unable to build a mental
> > image of the situation f
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> On 2005-03-18T13:52:54, "Peter T. Breuer" <[EMAIL PROTECTED]> wrote:
>
> > (proviso - I didn't read the post where you set out the error
> > situations, but surely, on theoretical grounds, all that can ha
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> Minor cleanup:
>
> > @@ -1325,24 +1336,24 @@ repeat:
> >
> > dprintk("%s ", bdevname(rdev->bdev,b));
> > if (!rdev->faulty) {
> > - err += write_disk_sb(rdev);
> > + md_super_write
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> On 2005-03-15T09:54:52, Neil Brown <[EMAIL PROTECTED]> wrote:
> > I think any scheme that involved multiple bitmaps would be introducing
> > too much complexity. Certainly your examples sound very far fetched
> > (as I think you admitted yourself).
Neil Brown <[EMAIL PROTECTED]> wrote:
> On Tuesday March 8, [EMAIL PROTECTED] wrote:
> > Have you remodelled the md/raid1 make_request() fn?
>
> Somewhat. Write requests are queued, and raid1d submits them when
> it is happy that all bitmap updates have been done.
OK - so a slight modification o
Paul Clements <[EMAIL PROTECTED]> wrote:
> Peter T. Breuer wrote:
> > Neil - can you describe for me (us all?) what is meant by
> > intent-logging here.
>
> Since I wrote a lot of the code, I guess I'll try...
Hi, Paul. Thanks.
> > Well, I can guess -
NeilBrown <[EMAIL PROTECTED]> wrote:
> The second two fix bugs that were introduced by the recent
> bitmap-based-intent-logging patches and so are not relevant
Neil - can you describe for me (us all?) what is meant by
intentÂlogging here.
Well, I can guess - I suppose the driver marks the bitmap
Can Sar <[EMAIL PROTECTED]> wrote:
> the driver just cycles through all devices that make up a soft raid
> device and just calls generic_make_request on them. Is this correct, or
> does some other function involved in the write process (starting from
> the soft raid level down) actually wait on
[EMAIL PROTECTED] wrote:
> I've been going through the MD driver source, and to tell the truth, can't
> figure out where the read error is detected and how to "hook" that event and
> force a re-write of the failing sector. I would very much appreciate it if
I did that for RAID1, or at least most
berk walker <[EMAIL PROTECTED]> wrote:
> What might the proper [or functional] syntax be to do this?
>
> I'm running 2.6.10-1.766-FC3, and mdadm 1.90.
Substitute the word "missing" for the corresponding device in the
mdadm create command.
(quotes manual page)
To create a "degraded" array i
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> >>Unable to handle kernel paging request at virtual address f8924690
> >
> > That address is bogus. Looks more like a negative integer. I suppose
> > ram corruption is a posibility too.
>
> Ram corruption in what sense? Faulty DIMM?
Anything.
> Well
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> And finally I managed to get an OOPs.
What CPU? SMP? How many?
Which kernel? Is it preemptive?
> Created fresh raid5 array out of 4 partitions,
> chunk size = 4kb.
> Created ext3fs on it.
> Tested write speed (direct-io) - it was terrible,
> about
[EMAIL PROTECTED] wrote:
> We are waiting for the one day where the same block on all mirrors has
> read problems. Ok, we're now waiting for about 15 years because the
> HPUX mirror strategy is the same. Quite a long time without desaster
> but it will happen (till today Murphy was right in any cas
J. David Beutel <[EMAIL PROTECTED]> wrote:
> Peter T. Breuer wrote, on 2005-Feb-23 1:50 AM:
>
> > Quite possibly - I never tested the rewrite part of the patch, just
> >
> >wrote it to indicate how it should go and stuck it in to encourage
> >others to go on f
In gmane.linux.raid Nagpure, Dinesh <[EMAIL PROTECTED]> wrote:
> I noticed the discussion about robust read on the RAID list and similar one
> on the EVMS list so I am sending this mail to both the lists. Latent media
> faults which prevent data from being read from portions of a disk has always
>
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> (note raid5 performs faster than a single drive, it's expectable
> as it is possible to write to several drives in parallel).
Each raid5 write must include at least ONE write to a target. I think
you're saying that the writes go to different targets fr
J. David Beutel <[EMAIL PROTECTED]> wrote:
> I'd like to try this patch
> http://marc.theaimsgroup.com/?l=linux-raid&m=110704868115609&w=2 with
> EVMS BBR.
>
> Has anyone tried it on 2.6.10 (with FC2 1.9 and EVMS patches)? Has
> anyone tried the rewrite part at all? I don't know md or the ker
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> Peter T. Breuer wrote:
> > Michael Tokarev <[EMAIL PROTECTED]> wrote:
> >
> >>When debugging some other problem, I noticied that
> >>direct-io (O_DIRECT) write speed on a software raid5
> >
> &
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> When debugging some other problem, I noticied that
> direct-io (O_DIRECT) write speed on a software raid5
And normal write speed (over 10 times the size of ram)?
> is terrible slow. Here's a small table just to show
> the idea (not numbers by itself a
No email <[EMAIL PROTECTED]> wrote:
>
> Forgive me as this is probably a silly question and one that has been
> answered many times, I have tried to search for the answers but have
> ended up more confused than when I started. So thought maybe I could
> ask the community to put me out of my miser
[EMAIL PROTECTED] wrote:
> just for my understanding of RAID1. When is a partition set faulty?
> As soon as a read hits a bad block or only when a write attempts to
> write to a bad block?
>
> I'm a little bit confused as I read the thread 'Robust read patch for raid1'.
> Does it mean that a read
Peter T. Breuer <[EMAIL PROTECTED]> wrote:
> Allow me to remind what the patch does: it allows raid1 to proceed
> smoothly after a read error on a mirror component, without faulting the
> component. If the information is on another component, it will be
> returned. If all com
I've had the opportunity to test the "robust read" patch that I posted
earier in the month (10 Jan, Subject: Re: Spares and partitioning huge
disks), and it needs one more change ... I assumed that the raid1 map
function would move a (retried) request to another disk, but it des not,
it always move
Just a followup ...
Neil said he has never seen disks corrupt spontaneously. I'm just making
the rounds of checking the daily md5sums on one group of machines with
a view to estimating the corruption rates. Here's one of the typical
(one bit) corruptions:
doc013:/usr/oboe/ptb% cmp --verbose /tm
Luca Berra <[EMAIL PROTECTED]> wrote:
> On Sun, Jan 23, 2005 at 07:52:53PM +0100, Peter T. Breuer wrote:
> >Making special device files "on demand" requires the cooperation of the
> >driver and devfs (and since udev apparently replaces devfs, udev). One
> >wou
Luca Berra <[EMAIL PROTECTED]> wrote:
> I believe the correct solution to this would be implementing a char-misc
> /dev/mdadm device that mdadm would use instead of the block device,
> like device-mapper does. Alas i have no time for this in the forseable
> future.
It's a generic problem (or non-p
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> On 2005-01-23T16:13:05, Luca Berra <[EMAIL PROTECTED]> wrote:
>
> > the first one adds an auto=dev parameter
> > rationale: udev does not create /dev/md* device files, so we need a way
> > to create them when assembling the md device.
>
> Am I missi
David Dougall <[EMAIL PROTECTED]> wrote:
> Jan 10 11:56:06 linux-sg2 kernel: SCSI disk error : host 0 channel 0 id 0
> lun 47
> return code = 802
That is sda.
> Jan 10 11:56:08 linux-sg2 kernel: I/O error: dev 08:10, sector 343219280
Well, I don't really understand - that is sdb, no? No? (
David Dougall <[EMAIL PROTECTED]> wrote:
> If I am running software raid1 and a disk device starts throwing I/O
> errors, Is the filesystem supposed to see any indication of this? I
No - not if the error is on only one disk. The first error will fault
the disk from the array and the driver will r
maarten <[EMAIL PROTECTED]> wrote:
> On Wednesday 19 January 2005 21:19, Peter T. Breuer wrote:
> > Poonam Dalya <[EMAIL PROTECTED]> wrote:
> > > I mounted my /dev/md1 on /mnt/raid. and then wrote a
> > > file on it. Then I tried to mount the raid disks
>
Poonam Dalya <[EMAIL PROTECTED]> wrote:
> I mounted my /dev/md1 on /mnt/raid. and then wrote a
> file on it. Then I tried to mount the raid disks
> /dev/hda10 on some other mount point and checked that
> mount point. But there was nothing in that mount
> point. Please could you please help me with
Hans Kristian Rosbach <[EMAIL PROTECTED]> wrote:
> On Mon, 2005-01-17 at 17:46, Peter T. Breuer wrote:
> > Interesting. How did you measure latency? Do you have a script you
> > could post?
>
> It's part of another application we use internally at work. I'll c
Hans Kristian Rosbach <[EMAIL PROTECTED]> wrote:
> -It selects the disk that is closest to the wanted sector by remembering
> what sector was last requested and what disk was used for it.
> -For sequential reads (sucha as hdparm) it will override and use the
> same disk anyways. (sector = lastsec
Mikael Abrahamsson <[EMAIL PROTECTED]> wrote:
> if read error then
> recreate the block from parity
> write to sector that had read error
> wait until write has completed
> flush buffers
> read back block from drive
> if block still bad
> fail disk
> log result
Well, I haven'
Michael Tokarev <[EMAIL PROTECTED]> wrote:
> That all to say: yes indeed, this lack of "smart error handling" is
> a noticieable omission in linux software raid. There are quite some
> (sometimes fatal to the data) failure scenarios that'd not had happened
> provided the smart error handling where
77 matches
Mail list logo