High-frequency trading has undergone a revolution through the application of machine learning for trading strategies, and success is measured by the fast execution of scientific model results. Leading quantitative trading companies are constantly seeking out new strategies to gain better market insight and improve trading outcomes. The teams who work in quantitative trading are among the best minds in math, computer science, and engineering. As academics from major institutions, they look for the most advanced technologies and techniques to develop a competitive edge. The constant evolution of new models presents a high level of unpredictability to the underlying infrastructure and requires a modern architecture that can handle the most demanding of storage workloads – data intensive, latency sensitive applications.


Financial modeling using machine learning algorithms is a hot new market, but the tools that have been traditionally used for quantitative trading were developed for high performance computing (HPC) and fail to meet the performance demands of GPU-centric workloads.  Our customer profiles show that new trading workloads handle millions of tiny files at a very high rate and performing near real-time analytics on the data.


Legacy file systems - both NFS NAS and parallel file systems - have been developed and optimized decades ago for spinning media and cannot not easily leverage low-latency storage media based on modern flash technology. Typically, new financial models are developed on a single GPU node in a sand-box environment, with synthetic data sets. However, when these models are put into production the storage systems buckle under the small file, latency sensitive workload demanded in machine learning. We have had customers tell us they experienced a 75% reduction in wall clock time compared to what they could get with local SSDs inside the GPU node. This performance reduction in the production systems– and associated increase in wall clock time to complete a market simulation - renders the model unusable as the trading window is unacceptable.  The predominant response to this performance hit is to run the workload on local-drive SSD storage inside the GPU nodes to restore a viable trading window.

Local SSD storage inside the individual GPU compute nodes delivers great performance however it introduces complexity to the workload. First, the data sets are now limited to the physical capacity available inside the individual GPU node. This fundamentally breaks the first rule of great machine learning algorithms (i.e the more data used in the model the more accurate the outcome). A second challenge with utilizing local drives for machine learning is the practical limitation of flash itself. Flash technology provides great random, small file performance but unlike disk drives, it has limited endurance and is particularly vulnerable when a workload is write-intensive. Many of the financial machine learning use cases require writing millions of files at very high speed, computing algorithms and then capturing the output.  Under this workload, flash drives will fail creating a daunting challenge – replacing physical media inside the GPU nodes.  This may seem a trivial problem, but our customers have reported burning out their SSDs and suffering the painful downtime of valuable GPU compute resources simply to change out drives.


Matrix is a fully parallel and distributed file system that has been designed from scratch to leverage high performance Flash technology. It is deployed on standard server architecture as a shared storage system that is connected to the GPU nodes via high performance networks. Both data and metadata are distributed across the entire storage infrastructure to ensure massively parallel access to NVMe drives. The software’s ultra-low latency network stack runs over Ethernet or InfiniBand, delivering the lowest latency and highest bandwidth performance for the most demanding data and metadata operations.  In numerous environments Matrix has demonstrated better than local disk performance due to the parallelism of the file system combined with its extreme low latency.

Matrix allows machine learning models to share a data set far greater than can reside on local storage inside the GPU nodes,  In addition by distributing workloads across many drives, it eliminates the challenge of single device wear that is a challenge in write intensive workloads. And when drives need to be replaced, it does not require downtime of the most expensive resources (the GPU servers), Matrix allows for hardware upgrades while the cluster continues to deliver full performance to the GPU nodes.  Finally, Matrix has the ability to seamlessly tier to object storage, allowing the data set to scale to Exabytes of storage in a single namespace.

If it sounds too good to be true, check out a live demo of Matrix that was recently recorded during Storage Field Day.  Notice how WekaIO Matrix continued to deliver great performance even with two full nodes down for maintenance.

Figure 1: WekaIO Matrix Software on Commodity Server Infrastructure

WekaIO will be at the Deep Learning in Finance Summit, in London on 19 & 20 March. They will be presenting WekaIO Matrix: A Modern Data Lake Accelerating Deep Learning Workloads. Join them now by getting your tickets here.