[ 
https://issues.apache.org/jira/browse/KUDU-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2040:
------------------------------
    Description: 
At the time of creation, a tablet's DataDirGroup will avoid using directories 
that are full and directories that have failed. This can lead to the creation 
of groups that are below the flag-specified target number of dirs. This isn't 
necessarily a error, but if the disks do come back to a healthy state, there is 
no way to resize an undersized group.

The assumption in this implementation is that these states are permanent, which 
isn't necessarily the case. A full disk may have tablets removed; when disk 
refreshes become supported by Kudu, disk failure will also become transient. As 
such, it's worth considering if/when/how undersized DataDirGroups should be 
resized.

A few notes on this:
- once a disk group has been created, the tablet's data will be spread across 
the disks in that group, so completely changing the group will require that the 
tablet's data is rewritten
- another approach might be to replicate the understriped tablet (either on the 
same server or elsewhere) in hopes that more disks are available
- as of writing this, recovery from a disk failure is not implemented, so disk 
failure is currently not considered transient (this will change once it is 
implemented)

  was:
At the time of creation, a tablet's DataDirGroup will avoid using directories 
that are full and directories that have failed. This can lead to the creation 
of groups that are below the flag-specified target number of dirs. This isn't 
necessarily a error, but if the disks do come back to a healthy state, there is 
no way to resize an undersized group.

The assumption in this implementation is that these states are permanent, which 
isn't necessarily the case. A full disk may have tablets removed; when disk 
refreshes become supported by Kudu, disk failure will also become transient. As 
such, it's worth considering if/when/how undersized DataDirGroups should be 
resized.

A few notes on this:
- once a disk group has been created, the tablet's data will be spread across 
the disks in that group, so completely changing the group will require that the 
tablet's data is rewritten
- another approach might be to replicate the understriped tablet (either on the 
same server or elsewhere) in hopes that more disks are available
- recovery from a disk failure not implemented at this time, so disk failure is 
currently not considered transient (this will change once recovery is 
implemented)


> Coordinate data dir lifecycle with DataDirGroups
> ------------------------------------------------
>
>                 Key: KUDU-2040
>                 URL: https://issues.apache.org/jira/browse/KUDU-2040
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs, tserver
>            Reporter: Andrew Wong
>
> At the time of creation, a tablet's DataDirGroup will avoid using directories 
> that are full and directories that have failed. This can lead to the creation 
> of groups that are below the flag-specified target number of dirs. This isn't 
> necessarily a error, but if the disks do come back to a healthy state, there 
> is no way to resize an undersized group.
> The assumption in this implementation is that these states are permanent, 
> which isn't necessarily the case. A full disk may have tablets removed; when 
> disk refreshes become supported by Kudu, disk failure will also become 
> transient. As such, it's worth considering if/when/how undersized 
> DataDirGroups should be resized.
> A few notes on this:
> - once a disk group has been created, the tablet's data will be spread across 
> the disks in that group, so completely changing the group will require that 
> the tablet's data is rewritten
> - another approach might be to replicate the understriped tablet (either on 
> the same server or elsewhere) in hopes that more disks are available
> - as of writing this, recovery from a disk failure is not implemented, so 
> disk failure is currently not considered transient (this will change once it 
> is implemented)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to