Storage engineering in practice

Top 5 most common mistakes when choosing storage for a server

Introduction

Today, we will compare different solutions for building storage and discuss the most common mistakes that can occur.

In general, we will focus on the differences between currently popular solutions, with a particular focus on comparing them, before moving on to the materials from this link, primarily the Storage types section.
Among the solutions we will focus more on, mainly from the perspective of solution architecture, and which will be the main topic of today’s discussion are:

HW RAID
ZFS
CEPH

We will take into account the issue of virtualization in Proxmox, but from a general perspective, it will be more about the world of Linux.

HW RAID

The technology itself grew during a time when disks did not have such high capacities, the need for redundancy was high, and the software development world was not as extensive. This older product has several problems from that time.
The main problems include:

Vendor lock/compatibility
Difficulty defining what HW RAID is
Problematic component replacement in case of failure
High price
SPoF
Limitations

Vendor lock

Nowadays, it is not as prevalent, but there were times when purchasing an HW RAID from a certain manufacturer would only allow for restoration on the same HW from the same manufacturer. Fortunately, this phenomenon has disappeared due to the decreasing demand for RAID servers. Nevertheless, it still occasionally occurs and is always a challenging issue to resolve. The ideal solution for reasonable companies is prevention. Simply have an additional RAID or the same server in stock.

Difficulty defining what HW RAID is

There are different manufacturers and they have different methods of delivering products. And in the market, you can find many ways to cater to the customer’s price. To be clear, HW RAID must have a dedicated card. Essentially, it is a dedicated machine in the server. Because HW RAID was and still is expensive, manufacturers looked for ways to save money by modifying this technology in different ways to meet the lower prices for end customers. So, for example, you can find HW RAID that basically simulate SW RAID. But they are cheap. And there are more similar specifications that can be found. Therefore, the concept of HW RAID is quite vague and it is always necessary to look at the circumstances of use in practice.

Problematic component replacement

In the case of an HW RAID failure, it is often difficult to find the same component or someone who still distributes it. This is a natural state in the market environment, which crystallizes over time.

High price

To add, high-quality HW RAID is often balanced with a high price. If the price of the server is low, it probably does not include a high-quality RAID.

SPoF

We are talking about a dedicated card, so it is the SPoF of the whole server.

Physical limitations

From today’s perspective, another disadvantage of RAID is physical limitations. That is, what maximum capacities they can handle. We can argue about this point, but we agree that a field with several PB is not really suitable for HW RAID.

To this day, SW solutions have significantly surpassed all the disadvantages of HW RAID in every respect, but HW RAID is still found in companies. The reasons for this are different. It is often administrators who learned about HW RAID in school a few years ago and consider it standard to have it. However, it is much more the result of buying a server that has RAID and it is not really considered what kind, because the invoice with the final price is simply more important than the product on which the whole company depends. So let’s take a look at how the world of software solutions has approached the issue of capacity growth.

ZFS

ZFS, or Zettabyte File System, is a file system. To put it in one sentence:

A file system that can repair itself, take snapshots, is focused on working with large amounts of data, and therefore supports capacities up to zettabytes.

ZFS allows administrators to achieve almost unattainable capacities in everyday practice. To quote from Wikipedia:

Some theoretical limits in ZFS are:

16 exbibytes (264 bytes): maximum size of a single file

248: number of entries in any individual directory[45]

16 exbibytes: maximum size of any attribute

256: number of attributes of a file (actually constrained to 248 for the number of files in a directory)

256 quadrillion zebibytes (2128 bytes): maximum size of any zpool

264: number of devices in any zpool

264: number of file systems in a zpool

264: number of zpools in a system

From the administrator’s point of view, managing ZFS is very easy. ZFS also allows for working with parts that can increase work efficiency (we have options such as SLOG, L2ARC, SPECIAL). ZFS also allows for disk redundancy analogy, only in a slightly different form of nomenclature than is customary in the RAID world.

General disadvantages of ZFS include:

Lower speed, which is balanced by the ability to use cache devices and the benefit of the array to ensure better data integrity.
The need to have the same disk capacities in the array (ideally), they can be larger, but not smaller.
RAM. If you’re wondering how much RAM is needed for ZFS, the answer is: ALL RAM.
It is easy to increase the pool, but shrinking the pool is always a problem.
A beautiful comparison of HW RAID, SW RAID, and ZFS is available at this link

CEPH

To better understand what CEPH is, let me use an analogy. Imagine a RAID6 array. It is resilient to the failure of two disks. Now, imagine something like RAID6 between servers. Simply put, if two servers fail, the entire infrastructure still works.

CEPH is software-defined storage. All abstraction runs at the software level and communicates with hardware that can be cheap, commodity, and variable. This technology focuses on building high-capacity arrays with a limit that is essentially unlimited.

The strength of CEPH lies in several points, such as:

Universality (it can be deployed on any hardware)
High redundancy (resistance to the failure of n servers)
Self-management (CEPH manages itself)
Self-healing (CEPH fixes itself)
Distributed data (data is distributed pseudo-randomly across all machines in CEPH using the CRUSH map)
Massive speed (with some approximation, it can be said that the speed of the entire CEPH increases linearly with the number of servers)

However, we cannot just sing praises for CEPH. Often, we encounter situations where a company wants to have CEPH but does not know how to administer it. Administration can be easy, but it can also be difficult. Under normal circumstances, CEPH on Proxmox “just runs” and is just like any other array, but in the case of various atypical situations, which usually arise from a lack of knowledge of this technology, the advantage can easily turn into a disadvantage. Let us look at a small example that demonstrates how CEPH differs from typical RAID arrays.

Example:

I have 4 servers, and each server has 4 slots for 2.5-inch disks. Each server has 2 dedicated disks on which Proxmox runs. I place a 1TB SSD in each server, resulting in a total of 16TB SSD on 4 servers.

Questions:

What will be the resulting capacity of CEPH storage?
What will be the effective working capacity of CEPH storage?
What will be the minimum usable capacity of CEPH storage?

The above example shows exactly how important it is to understand the given pool. Therefore, its complexity and higher requirements for expertise are considered a general disadvantage of CEPH. Dealing with this issue is more challenging. You can judge it for yourself from this video:

https://youtu.be/7I9uxoEhUdY

Most common mistakes

Finally, we come to the most common mistakes found in practice for the above solutions. The list is not exhaustive, but it consists of the TOP mistakes that keep happening over and over again, and that’s why this article was created with a request to minimize these issues.

Error 1: HW RAID on ZFS

If I have storage on HW RAID, it doesn’t make sense to build ZFS on top of this pool. It will work, but it’s wrong. This is the most common mistake of all. If we ask why, it’s simply because HW RAID can lie about some instructions or information about the pool. ZFS is used to having a clean disk underneath it and can work with it itself, sending the necessary information about its status, lifespan, and especially when working with checksums. It may happen that on ZFS, which has RAID underneath, ZFS starts calculating checksums, RAID gives it wrong information, ZFS identifies that some block is faulty even though it’s not, and the problem arises. On the internet, you will find daring people who have put ZFS under RAID, and it really works for them, and yes, it will really work until the first problem.

Hardware RAID controllers should not be used with ZFS. While ZFS will likely be more reliable than other filesystems on Hardware RAID, it will not be as reliable as it would be on its own

Here are some other materials from the internet on this issue, often answering why HW RAID should not be used on ZFS:

Error 2: HW RAID on CEPH

HW RAID on CEPH is not recommended. It can be done, however, if the given RAID card can handle IT mode, but even then, it’s not good. Here’s an excerpt from the documentation

Disk controllers (HBAs) can have a significant impact on write throughput. Carefully consider your selection of HBAs to ensure that they do not create a performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency than simpler “JBOD” (IT) mode HBAs. The RAID SoC, write cache, and battery backup can substantially increase hardware and maintenance costs. Some RAID HBAs can be configured with an IT-mode “personality”.

In addition, if a HW RAID pool is used on CEPH, it is redundancy on redundancy

Yes, a RAID controller takes care of redundancy with a handful of disks in one chassis. But that’s cost and complexity when you run already redundant, multi node distributed storage solutions like Ceph. Why bother mirroring a physical disk when Ceph already has multiple copies of it?

Error 3: Improperly chosen hardware for deploying ZFS/CEPH

One of the critical issues in infrastructures is the use of inappropriate hardware for CEPH or ZFS. In the case of a solution built on CEPH, it can be operated on commodity hardware, as CEPH itself presents:

A Ceph Node leverages commodity hardware and intelligent daemons, and a Ceph Storage Cluster accommodates large numbers of nodes, which communicate with each other to replicate and redistribute data dynamically.

However, people interpret this passage regarding commodity hardware differently. Some believe that they can deploy anything they find in the warehouse or on the ground and expect it to work. This is where another common problem in hardware selection arises. We have an old HP server in stock, so we put it in the cluster. A few months later, someone is looking for someone who knows CEPH and can help fix the CEPH cluster that isn’t working, and no one knows why. There are certain minimum requirements for CEPH, and then there are recommended requirements. In the case of CEPH, it is always necessary to choose the recommended requirements for production. The minimum requirements are suitable for laboratory conditions and study. From our experience, we recommend the following minimums to be adhered to for building a CEPH cluster:

2x10G network cards and at least 4x1G ports.
All the RAM that can be installed. The more the server can handle, the better.
16 or more cores, hyper threading is welcome.
No HW RAID.
NVMe disks (in the case of low-cost solutions, SSD disks).
4 nodes in one cluster.
Backup solution – e.g., PBS

If the above is not adhered to, CEPH is more demanding to build, and the natural consequence will be possible complications in its operation. As a technical writer, it is crucial to emphasize the importance of following the recommended hardware requirements for CEPH or any other technical solution to avoid issues during deployment and operation.

Error 4: Misunderstanding of CEPH documentation

One of the frequent problematic issues is a misunderstanding of the documentation. CEPH works in such a way that the more servers in the cluster, the more redundancy and its speed grows almost linearly. The documentation states that the recommended minimum number of servers is four, and the minimum functional number is three. Almost always, companies converge to three servers because of the price, which results in losing some of the small benefits of CEPH, even though a three-node cluster also works. This is also a consequence of earlier times when the minimum number of nodes in a cluster was always three.

We have gone through the top errors, and now let’s take a look at an interesting feature associated with storage engineering in Proxmox, which is often discussed when creating architecture.

Types of storage in Proxmox and their possibilities.

At the beginning lets take a look together at this table:

Description	PVE type	Level	Shared	Snapshots	Stable
ZFS (local)	zfspool	file	no	yes	yes
Directory	dir	file	no	no1	yes
BTRFS	btrfs	file	no	yes	technology preview
NFS	nfs	file	yes	no1	yes
CIFS	cifs	file	yes	no1	yes
Proxmox Backup	pbs	both	yes	n/a	yes
GlusterFS	glusterfs	file	yes	no1	yes
CephFS	cephfs	file	yes	yes	yes
LVM	lvm	block	no2	no	yes
LVM-thin	lvmthin	block	no	yes	yes
iSCSI/kernel	iscsi	block	yes	no	yes
iSCSI/libiscsi	iscsidirect	block	yes	no	yes
Ceph/RBD	rbd	block	yes	yes	yes
ZFS over iSCSI	zfs	block	yes	yes	yes

1: On file based storages, snapshots are possible with the qcow2 format.

2: It is possible to use LVM on top of an iSCSI or FC-based storage. That way you get a shared LVM storage.

Infrastructure Design

When designing infrastructure on Proxmox and we have the freedom to choose hardware, it’s a win. We can freely choose storage and customize it to the hardware. However, in many companies, servers are converted from some other structure (VMWare/Hyper-V) to Proxmox, and the best way to utilize storage is sought.

For the purpose of this article, we are not considering only one node, but a cluster, and therefore we will focus on the shared storage section. We believe this topic is more interesting for our readers.

An example from practice:

Here is a specific example from practice with inputs:

It is undesirable to have HW RAID on CEPH/ZFS.
Shared storage is required.
Only Fibre Channel connections (FC) are available.
Snapshotting is required.
Servers have HW RAID and are not in pass-through mode.

Answer and possible solution:

Shared storage capable of snapshots includes:

NFS
CIFS
GlusterFS
CephFS
LVM (with the use of FC)
iSCSI variants

With FC connections, these options are eliminated:

NFS
CIFS
iSCSI

(Of course, if our model SAN storage could do NFS/CIFS/iSCSI, we’d win, but in the model example, it cannot.)

Since the storage is on an HW controller, the option of CEPH and ZFS is eliminated. In this case, quality is better than quantity.

We have two options for storage to work with:

LVM
GlusterFS

LVM can be deployed, but it cannot do snapshots. GlusterFS has the ability to snapshot, subject to certain conditions:

On file-based storages, snapshots are possible with the qcow2 format.

So, the correct solution to meet the customer’s needs is to use GlusterFS.