Metadata Attribute Caching in the upstream Kernel
As part of the Upstream kernel as of 4.9, caching of metadata attributes increases performance significantly, reducing the activity of stats over the wire.
As part of the linux 4.9 kernel and newer you will get the benefit of the attribute caching.
LMDB as a metadata alternative
We have provided the option of LMDB for higher performing metadata operations. It is in the Trunk and has shown significant performance improvement over BDB. However, there is no way to upgrade from BDB to LMDB, so if you want to use LMDB, you must start a new file system using LMDB.
LMDB is now available as a build time option for OrangeFS, it increases OrangeFS performance significantly so we recommend its use where possible.
The access control project is working to implement file system security in a distributed environment by using strong encryption to sign and verify data structures that encode user and file attributes. This is also called Capability Based Access.
PVFS implements standard Posix access control in the form of a file owner, group, and access control list. The client library gets the user id and group ids from the OS on the client and sends them along with requests to the server. Because PVFS is implemented at the user level, it is possible for a program to fake its identity and send it along to a server, thus bypassing the access control. Also, PVFS is highly distributed, with metadata stored separately from data, thus it is tricky to manage access control without resorting to shared state.
OrangeFS is implementing a new access control scheme. Two items are signed using a public key encryption scheme: credentials, which hold user identity info, and capabilities, which hold the results of access control decisions. All access control data is stored with a file's metadata. Clients send a credential which contains user attributes to the server holding the metadata. An access control decision is made using a policy that refers both to the user's attributes, and to the file's attributes. The result of the decision is that the user is granted specific access rights (read, write, etc.) to the specific data objects holding the file data. These objects may be on a different server. A capability records this information, is signed, and returned to the client. The client then uses the capability to access data objects.
Credentials and capabilities have timeouts and servers maintain revocation lists, both of which serve to limit the lifetime of the objects. Clients can cache these items to reduce the number of times they must be generated. Capabilities can be shared with other processes owned by the user.
Currently, the attributes used in this system are much the same as with Posix file systems: userid, group, etc., but more elaborate schemes can be developed using different attributes and corresponding access policies. Examples could include role based schemes or tiered security levels. OrangeFS does not handle authentication. An existing mechanism such as local login with password or remote authentication with Shibboleth can be used to generate a signed credential which is then used by OrangeFS.
Redundancy on Immutable Project
The redundancy project provides a system administrator with the tools to add or remove servers and to move meta data objects from one server to another. The administrator may take servers offline for backup or maintenance or add new servers without interfering with normal processing. The system administrator may also move objects between servers if he wants to manually load balance the number of objects per server.
In addition to new administrator tools, the redundancy project also provides automatic failover when a server unexpectedly goes down by duplicating the meta data and physical data across multiple servers. The amount of system-wide redundancy is controlled by the system administrator, while the individual user may control the amount of redundancy for his own data.
Currently, the OrangeFS team has implemented the infrastructure that creates copies of physical data files amongst the network of servers and automatically reads from any of the servers if the primary source is unobtainable. Creation of the copies occurs when the user marks a file as immutable, either by calling chattr in Linux (or something similar in other unix systems) or by calling the OrangeFS application pvfs2-xattr. Prior to setting the file immutable, the user must set two other extended attributes, the number of copies and the mirroring mode, by using pvfs2-setmattr. The mirroring mode allows the user to turn failover on and off at will but must be active when immutable is set to trigger the creation process. The user may view his failover settings for a particular file by calling pvfs2-getmattr.
This implementation is limited in scope but provides the basic steps upon which to build a fully functional failover system. In the future, the OrangeFS team will provide dynamic mirroring in which mirrored copies are created as a file is being written, eliminating the need to set the immutable attribute. In addition, the meta data will be mirrored, and the system will continue to function in the event that a server is down, both of which must exist before failover can be realized.
This project is developing a complete a robust user interface library that allows OrangeFS users to bypass the local kernel and achieve the best performance from their file system. This interface includes standard Posix-like system call and stdio library functions as well as extensions that allow use of advanced file system features.
High Performance User Interface
PVFS provides two different user interfaces:
- A Linux kernel interface
- A direct interface via pvfs_lib
They kernel interface is the easiest to use as it makes a PVFS file system appear like most other file systems, and allow most programs to use it directly. This is the interface employed by most users. Unfortunately, the kernel interface offers considerably lower performance and functionality compared to the direct interface. The Linux file system infrastructure is based on the Posix standard which in turn was developed for local disk file systems. The PVFS system interface provides "direct" access to the PVFS server, gives the best performance, and is the most reliable. The problem with the system interface is that the only way for a user to access it directly is using MPI-IO. This is fine for MPI users, but not so convenient for others.
The OrangeFS project is developing a multi-layer user interface that allows programs to link directly to the system interface. There is a common IO layer which is used to implement the higher layers, a new MPI-IO interface designed specifically for PVFS, a Posix-like system call layer (open, close, read, write), and a C stdio library layer (fopen, fclose, fread, fwrite). These layers are designed to work together and provide the common functions used by applications and high-level applications libraries. These layers can be linked directly with an application, or preloaded as a shared library so that existing applications can be run without recompiling. Best of all, there are extensions to these interfaces that allow high performance applications to more directly control their IO and utilize the underlying file system.
Examples of such extended features include:
- Buffering (exists in stdio, but can be adjusted more easily)
- User-level caching (more aggressive than buffering and more features)
- Run-time monitoring - the ability for the program to monitor traffic on servers and adjust
- Distribution - the ability to manage how data is distributed in the file system
This new would not replace the existing kernel level interface for all users, but would supplement it, proving a wider range of choices. Projects are also under way to improve this interface, focusing on the experimental FUSE interface.
This code has been added to the OrangeFS source tree under src/client/usrint and currently consists of the following files:
- stdio.c/stdio.h - implementation of stdio library functions primarily to add special features and bind to the lower-level libraries
- posix.c/posix.h - implements wrappers for all of the Posix IO system calls - these calls use a method table to select the proper implementation to call (either PVFS or not)
- posix-pvfs.c/posix-pvfs.h - implements all of the Posix IO system calls using a descriptor table in user space and directs all IO through PVFS
- filetable-util.c - implements user space file descriptor and stream functions
- iocommon.c - implements IO operations using PVFS sysint calls - supports both Posix and MPI-IO
In addition there is an MPI-IO implementation that can be used with either MPICH or OpenMPI on top of the iocommon.c routines. This is not yet installed into the source tree.
The distributed directories project will allow directory entries for a given directory to be spread across multiple servers. This will allow very large numbers of files to be efficiently handled in a directory as multiple tasks may access different parts of a directory in parallel. See the wiki page for more details.
Traditionally in PVFS, all entries for a given directory are stored on the server holding the metadata for the directory itself. The distributed directories project will allow directory entries for a given directory to be spread across multiple servers. This will allow very large numbers of files to be efficiently handled in a directory as multiple tasks may access different parts of a directory in parallel.
With distributed directories, OrangeFS allocates multiple dspace handles spread among the various servers for storing directory entries in much the same way data is spread among servers for regular files. Which server holds a particular directory entry is determined by applying a hash function to the name of the entry, similar to the approach used in GPFS and GIGA+ to distribute directories. The list of dspace handles is returned by getattr so that a client may use these handles to access directory entry information.
In the initial implementation, directory entries are spread across all available metadata servers. Future plans include mechanisms to dynamically expand and contract the number of servers holding a directory based on the number of entries that are present. A system administrator will be able to control system-wide settings that indicate the number of servers among which entries are distributed. A user will be able to control settings that indicate the number of servers among which entries are distributed for his own directories.
Another limitation of the initial implementation is that the root directory is not distributed. As with earlier releases of PVFS, all root directory entries are stored on the server holding the metadata for the root directory.
Future research will evaluate advanced strategies for retrieving entries from multiple servers.