On 7/12/23 02:44, lina wrote:
Dear all,
My computer only has 2 TB data storage capacity,
I want to have 100 TB capacity to store/analyze data.
I am thinking of adding 5 hard drives, each is 18TB, and then merge
them into one volume? or get a file server? What is the best option
for me, and what is the budget?
Thanks so much for your advice, best, lina
On 7/12/23 04:48, lina wrote:
Currently I do not have a plan to keep the data, once the data
finished analyzing, I can just remove it.
On 7/12/23 06:00, lina wrote:
I need to extract the data for downstream analysis. after that, these
data can be removed.
It is hard to provide recommendations without knowing your computer,
your network, your analysis, your quality metrics, or your budget.
I use ZFS. Given an x86_64/amd64 computer with Debian, sufficient HDD
bays, and sufficient HBA ports, yes, you could install 5 @ 18 TB HDD's
and merge them into one 90 TB ZFS pool. If your computer has 5 bays and
ports, this will be your lowest cost solution; but is unlikely to be
your "best" solution.
ZFS likes memory; the more the better. (I use ECC memory.) For 90 TB,
I would consider filling all memory slots with the fastest and largest
modules that are supported.
ZFS allows SSD's to be added as read cache devices and/or write cache
devices. Done correctly, either or both can improve performance at a
fraction the cost of all-SSD storage.
If your analysis can make use of concurrent I/O, more drives of smaller
size each will improve performance. One or more external chassis may be
desirable:
6 @ 15 TB
9 @ 10 TB
10 @ 9 TB
15 @ 6 TB
18 @ 5 TB
30 @ 3 TB
45 @ 2 TB
90 @ 1 TB
And, smaller drives make RAID more feasible. E.g. 20 @ 6 TB arranged as
5 raidz1 virtual devices (vdev) of 4 drives each would provide 90 TB of
storage, support 5 concurrent I/O operations, and tolerate 1 drive
failure per vdev at an incremental cost of +33%. Whereas 10 @ 18 TB
drives arranged as 5 mirror vdev's of 2 drives each would provide 90 TB
of storage, support 5 concurrent I/O operations, and tolerate 1 drive
failure per vdev at an incremental cost of +100%. But, the latter will
resilver faster when you replace a failed drive (or a spare activates).
If your analysis can be partitioned across multiple threads and the
threads have independent memory and I/O patterns, putting the data onto
a file server (or NAS) would allow multiple computers to work together
and do the analysis in less time. You will want a fast connection
between the analysis computers and the storage server (e.g. 10+ Gbps
Ethernet). (Alternatively, a storage area network; SAN.)
David