By Prasad Surampudi | Sr. Systems Engineer (The ATS Group)
Up until a few years ago, only performance, reliability and ease of administration were considered major factors when selecting a clustered file system for enterprise data storage. But as cloud-based systems and services are gaining more attention, companies are looking for a complete data management solution that can leverage cloud services to cost effectively scale and support explosive growth of data.
Today’s clustered file systems must be highly scalable across different server architectures, whether they reside on-premises and/or off-premises. They also need to be highly available with minimal down time during maintenance and should be able to scale across multiple tiers of storage. The file system also should be able archive legacy and less frequently accessed data to cost effective storage. The file system should protect, secure, and encrypt the data to prevent unauthorized access and should be able to provide the highest granular level of access in terms of ACLs.
Apart from the above, the file system also should provide consistent performance in terms of throughput and IOPs to a wide variety of data ranging from small files to very large data sets used by various big data and analytical applications. Also, it should support various protocols to access data like POSIX, NFS, CIFS and Object.
IBM Spectrum Scale is a clustered file system that meets all the above requirements. Spectrum Scale is a highly scalable, secure and high-performance file system for large scale enterprise data storage. It is widely used in several industries including financial analytics, healthcare, weather forecasting, genomics and many other industries across the world.
Spectrum Scale has a long history of more than twenty years of development since 1998. Earlier versions of Spectrum Scale were known as General Parallel File System (GPFS). IBM rebranded its General Parallel File System as Spectrum Scale starting with version 4.1.1.
IBM added many new features and functions with the goal of delivering a complete software-defined storage solution rather than being just a clustered file system that is shared across several nodes. In 2017, IBM released Spectrum Scale 5.0 after a significant development effort to achieve performance and reliability requirements set forth from the US Department of Energy’s CORAL supercomputing project.
The purpose of this document is to briefly discuss some of the new exciting features and functions of Spectrum Scale 5.0 and understand how they can be leveraged to meet today’s demanding business requirements.
New Features and Functionality
Let’s look at some of the new features of Spectrum Scale 5.0. They have been categorized based on the Spectrum Scale function group.
Core GPFS functionality Changes
Variable Sub-Block Size
Earlier versions of Spectrum Scale (GPFS) have a fixed 32 sub-blocks per single file system block size. With Spectrum Scale 5.0 and above the number of sub-blocks depends on the file system block size chosen.
|File System Block Size
||Number of Sub-blocks
Starting from Version 5.0, if not specified, Spectrum Scale file systems are created with the default file system block size 4MiB with a sub-block size of 8 KiB.
NSD Server Priority
The preferred or primary NSD server of NSD can be changed dynamically without unmounting the file system.
File System Rebalancing:
With version 5.0, Spectrum Scale uses a lenient round-robin algorithm which makes rebalancing much faster vs the strict round-robin method used in earlier versions.
File System Integrity Check
While doing a file system integrity check, if the mmfsck command is running for a long period of time, another instance of mmfsck can be launched with the –stats-report option to display current status from all the nodes that are running the mmfsck command.
Spectrum Scale cluster health check commands have been enhanced with options to verify file system, SMB and NFS nodes.
The mmcallhome command has a new option ‘–pmr’ which can be used to specify an existing PMR number for data upload.
Spectrum Scale installation toolkit was introduced with version 4.1 and many enhancements are made in Version 5.0. The installation kit now supports deploying protocol nodes in a cluster that uses Spectrum Scale Elastic Storage Server (ESS). The installation toolkit also supports configuring Call Home and File Audit Logging. Deployment of Ubuntu 16.04 LTS nodes as part of the cluster are also supported by the installation toolkit.
Encryption and Compression
The file compression feature was added in Spectrum Scale 4.2 and has been enhanced in Spectrum Scale 5.0 to optimize read performance. Local Read-only cache (LROC) can be used for storing compressed files. Spectrum Scale 5.0 also simplifies IBM SKLM configuration for file encryption.
Starting with Spectrum Scale 4.1.1, data in IBM Spectrum Scale can be accessed using a variety of protocols like NFS, CIFS and Object. The packaged Samba version has been upgraded to 4.0. Spectrum Scale 5.0 also supports the option to use a Unix primary group in AD. You can also modify NFS exports dynamically without impacting connected clients. CES Protocol node functionality is now supported on Ubuntu 16.04 and above.
File Audit Logging
Spectrum Scale File Audit Logging logs all file operations like create, delete modify etc. in a central place. These logs can be used to track user access to the file system.
Files can be compressed in AFM and AFM-DR filesets. Spectrum Scale 5.0 also made improvements for load balancing across AFM gateways. Information Life-cycle Management for snapshots is now supported for AFM and AFM-DR filesets. AFM and AFM-DR filesets can be managed using IBM Spectrum Scale GUI.
Transparent Cloud Tiering (TCT)
TCT now supports remote mounted file systems. Clients can access tiered files on a remotely mounted file system.
Spectrum Scale Big-Data Analytics is now certified with Hortonworks Data Platform 2.6 on both Power 8 and x86 platforms and also certified with Ambari 2.5 for rapid deployment.
Spectrum Scale GUI Changes
The Spectrum Scale GUI was introduced in version 4.1. IBM made significant upgrades to the GUI in IBM Spectrum Scale 5.0. Call Home and monitoring of remote clusters, file system creation and integration of Transparent Cloud Tiering are some of significant features that were added in version 5.0.
With the enhancements included in version 5.0, Spectrum Scale has truly become an enterprise class file system for the modern cloud era. Let’s see how we can leverage some of the new features and functions.
File System Block Size
File System block size is a critical parameter that needs to be considered for optimal performance before a file system is created and used for large amounts of data. The wide range of file system block sizes and sub-block sizes offered by Spectrum Scale makes it possible to store different sizes of files in a single file system and still get better throughput and IOPs performance for various sequential and random workloads.
While a larger block size helps improve throughput performance, having a variable sub-block size and number of sub-blocks enables you to minimize file system fragmentation and use the storage effectively.
But keep in mind that only new file systems created with Spectrum Scale 5.0 and above can take advantage of the variable sub-block size enhancement.
NSD Server Priority Change
Today’s businesses expect their servers, storage, file systems and applications to run with minimal downtime. Dynamically changing NSD server priority for each NSD without unmounting the file system on all NSD servers helps minimize downtime in several scenarios including NSD server retirement, for example.
File System Rebalancing
Network Storage Devices (NSDs) must be added or removed to expand or shrink a Spectrum Scale file system. The data in the file system need to be rebalanced across all NSDs for optimal performance. after the NSD addition or removal. Most of the time, System Administrators let Spectrum Scale do the rebalance in the background without actually forcing it at the time of removal or addition of NSD. The file system performance is not optimal until the NSDs are balanced. With Spectrum Scale 5.0 the rebalancing occurs at a faster speed using lenient round robin instead of strict round robin.
Within IT organizations, its fairly standard for applications to run on a variety of platforms such as AIX, Linux and/or Windows. Spectrum Scale Protocol Support of industry standard protocols like CIFS, NFS and Object will allow users to access the data stored using these protocols in the most efficient way.
Protocol Support enables businesses to consolidate all of their enterprise data into a global name space with unified file system and object access avoiding multiple copies of data. With Spectrum Scale 5.0, customers can now configure servers running Ubuntu 16.04 as protocol nodes in addition to Redhat Enterprise Linux.
Configuring the call-home feature in Spectrum Scale 5.0 enables IBM to detect cluster/file system issues proactively and enables automatic sending of logs and other required data for a timely resolution. This helps customers minimize down time and improves reliability of the cluster.
With the installation toolkit, clusters can be configured and deployed seamlessly by defining cluster topology in a more intuitive way. The installation toolkit performs all necessary pre checks to make sure all the required package dependencies are met, automatically installs Spectrum Scale RPMS, configures the cluster and configures Protocols, Call Home, File Audit Logging etc. It simplifies the installation process and eliminates many manual configuration tasks.
Complex tasks like balancing the NSD servers, installing and configuring Kafka message brokers for file system Dudit Logging, and Spectrum Scale AD/LDAP authentication, for example, are much easier and simplified with installation took kit.
With explosive rates of data growth, organizations are always looking for ways to reduce storage costs.
Spectrum Scale File compression introduced in Version 4.2 addresses this need to minimize storage costs effectively by compressing legacy and less frequently used data. File compression is driven by Spectrum Scale ILM policies and typically provides a compression efficiency of 2:1 and 5:1 in some cases. Compression not only reduces the amount of storage required, but also improves I/O bandwidth across the network and reduces cache (pagepool) consumption.
With Spectrum Scale 5.0, file compression supports zlib and lz4 libraries. Zlib is primarily intended for cold data where as lz4 is intended to compress active data. Compression using lz4 favors read-access speed than space saving.
Though, Regular File compression and Object compression use the same technology, keep in mind that Object compression is available in CES environment and whereas File compression is only available in non-CES environments.
File Encryption was introduced by IBM in Spectrum Scale version 4.1 and is available in Spectrum Scale Advanced and Data-Management editions only.
The data is encrypted at rest. Only data is encrypted, not metadata. Keep in mind Spectrum Scale encryption protects data storage device misuse and attacks by unprivileged users but not against deliberate malicious acts by cluster administrators.
Spectrum Scale 5.0 enables encryption of files stored in local disk (LROC) and simplifies the SKLM configuration.
Spectrum Scale encryption can be leveraged where organizations need to store PII data, business critical and any other confidential data. It can be also used by organizations that are required to meet federal and other security compliance standards like GDPR.
Spectrum Scale encryption is also certified as Federal Information Processing Standard (FIPS) compliant.
File Audit Logging
File Audit Logging was introduced in Spectrum Scale 5.0. File Audit Logging addresses the need to track the access of files for auditing purposes. It’s not an easy task to track individual file access in large scale clustered file systems with petabytes of data and billions of files that are accessed by hundreds of applications and thousands of users. Spectrum Scale File Audit Logging is designed to be highly scalable as the file system grows.
File Audit Logging is not required to be installed and configured on each and every node in the cluster as required by some of the operating systems audit logging processes. It just needs to be configured on minimum three quorum nodes in the Spectrum Scale cluster and can be scaled to other nodes as required.
File Audit Logging also supports tracking of file access using NFS, CIFS and Object Protocols.
Addressing GDPR requirements using Spectrum Scale
IBM Spectrum Scale allows organizations to avoid multiple data islands as it provides a single name space for both structured and unstructured data. This helps to have a single point of control when protecting and managing all data that is subject to GDPR compliance.
Spectrum Scale Encryption helps to secure personal data while at rest to meet GDPR security requirements.
IBM Spectrum Scale supports industry standard Microsoft AD and LDAP Directory Sever authentication and a rich set of ACL support to comply with GDPR Right of Access policies.
Active File Monitoring (AFM)
Active File Monitoring can be used to transfer and cache data over a WAN between two Spectrum Scale clusters. One cluster being the home cluster that stores all data and the other cluster being a cache cluster, which can cache all data in the home cluster or only limited amount of data. AFM can also be implemented as a Disaster Recovery solution with AFM-DR.
With Spectrum Scale 5.0, storing compressed data with Spectrum Scale file compression is supported. Load balancing improvements and ILM support for AFM and AFM-DR fileset snapshots has also been added.
Transparent Cloud Tiering (TCT)
As the name implies, Transparent Cloud Tiering is another way to reduce high performance storage costs by transparently migrating aged data into less expensive cloud storage. This makes more room to ingest new data into the high-performance storage tier. Spectrum Scale ILM policies can be used to scan the file system metadata and identify the files that are not accessed for months or years and tier them to cloud storage. Since only file data gets migrated and not metadata, the migration process is transparent to user and applications. The data gets pulled from cloud storage to the local file system storage when users or applications try to access the data.
Spectrum Scale Transparent Cloud Tearing was introduced starting from Version 4.2 and is available with data-management edition only.
With Spectrum Scale 5.0, TCT supports file systems mounted from remote clusters. TCT enabled file sets can now use different containers. Multiple cloud accounts and containers are also supported.
As companies started leveraging social media feeds and other unstructured data for business analytics, today’s file systems have a need to support both structured and unstructured data under a single name space. Spectrum Scale introduced File Placement Organizer (FPO) architecture and Hadoop plug-in starting with Version 4.1 to support Big-Data applications and frame-work. Later versions of Spectrum Sale enhanced Hadoop Distributed File System (HDFS) transparency. Spectrum Scale is certified on Hortonworks Data Platform (HDP) 2.6.5 provided by Hortonworks, a large Hartonworks which is major big-data frame-work distributor.
Spectrum Scale also supports Ambary for easy and quick deployment in large scale Hadoop clusters.
Spectrum Scale GUI Changes
Ease of deployment and administration is one of the major requirement of large clustered file systems that are deployed across hundreds of servers. The Spectrum Scale GUI introduced in version 4.1 and above simplifies the installation, configuration and administration of large scale clusters. IBM made many significant enhancements to the Spectrum Scale GUI to make it more intuitive for routine system administration and monitoring tasks.
The Spectrum Scale GUI can now monitor cluster performance, cluster health, individual node health, SMB, NFS and Object protocol health and several other enhancements. In certain cases, routine actions can be applied to fix errors using a simple click from the Spectrum Scale GUI.
The Spectrum Scale 5.0 GUI now supports monitoring of remote clusters, Transparent Cloud Tiering and also provides IBM call home support for cluster, node, file system issues.
IBM Spectrum Scale continues to add and align features into a complete data management solution. This solution meets market demands of a highly scalable solution across different server architectures on-premises or in the cloud. It support new-era big data and artificial intelligence workloads along with traditional applications while ensuring security, reliability and high performance. The IBM Spectrum Scale solution also exhibits the stability, maturity and trust with a long history of more than twenty years of development.