I’ve used the D programming language to implement a high-frequency trading (HFT) platform. I’ve been quite satisfied with the experience and thought I’d share how I got here. It wasn’t a direct path.
In 2008, I stumbled across a book on Amazon called Learn to Tango with D. That grabbed my curiosity, so I decided to research D further. That led me to Digital Mars and Walter Bright. I had first heard of Walter when I learned about Zortech C++, the first native C++ compiler. His work had been a huge influence on my C++ learning experience. So I was immediately interested in the language just because it was his, and excited to learn that he was working with Andrei Alexandrescu on version 2. Still, I decided to wait until they were further along with the new version before I dove in.
In 2010, I bought Andrei’s The D Programming Language as soon as it was published and started reading. At the time, I was working at BNP Paribas using C++ to optimize their HFT platform, so high performance was prevalent in my thoughts. When I saw that D’s classes were reference types, with functions that are virtual by default, I was disappointed. I didn’t see how this could be useful for low-latency programming. I became too busy with work to explore further at the time, so I put the book and the language aside.
In 2014, I began preparing for a new adventure. As part of that, I started working on a feed handler framework from scratch in C++, using my own long-maintained C++ library of low-level components useful in low-latency, high-performance applications. Andrei’s book came to my attention again, so I decided to give it another look.
This time, I read the book through to the end and learned that my initial impression had been misplaced. I found that I liked D’s metaprogramming features and its support for programming in a functional style. By the end of the book, I was ready to give D a try.
I started by porting my C++ library and feed handler to D. It wasn’t difficult. I use very little inheritance in my C++ code, preferring composition and concrete classes. I found myself quite productive with D’s structs, templates, and mixins. All the while, I kept a close eye on performance benchmarks. When D turned out to give me the same performance as my C++ code, I was sold. I found D to be much more elegant, cleaner, more readable, and easier to maintain. I made the switch and never looked back.
My goal was to develop a complete HFT system using D. The system would consist of different subsystems:
- Feed-Handler Framework: receives market data from exchanges; builds the books for all securities; publishes the updates to the other subsystems.
- Strategies Framework: receives market data updates from feed handlers; facilitates communications with the Order Management System; allows for plugging into it strategies that make decisions on stock trades.
- Order Management System: communicates with the exchange and the strategies framework; maintains a database of orders.
- Signal Generator: receives market data updates from feed handlers; generates different signals as indicator values, predictions of stock prices, etc.; sends the different signals to strategies.
Ultimately, I found a new data structure and better design for my feed-handler framework. I developed the new version completely in D. This implementation heavily uses templates. I like D’s template syntax and generally find the error messages clearer than the complex error messages I was used to from C++. I needed to drop down to assembly for some specific x86 instructions and it was straightforward to do in D.
Later, I needed to work with configuration files. I prefer to write my config files in Lua, a lightweight scripting language that is easy to integrate into a program as an extension via its C API. For this, I found a D Lua binding called DerelictLua. Using, again, D’s metaprogramming facilities, I developed a very easy and practical way to interface with Lua on top of DerelictLua. Editor’s Note: DerelictLua has since been discontinued; new projects should use its successor, bindbc-lua, instead.
The feed handler on the Bats market comes on 31 simultaneous channels, so it is more efficient to use multithreading. For this, I chose not to use the multithreading facilities provided by Phobos. I felt I needed more control in such a low-latency environment, particularly the ability to map each thread to a specific core. I opted to use the pthreads library and its affinity feature. D’s C ABI compatibility made it a straightforward thing to do.
I’m running on FreeBSD. For my intercommunication needs, I’m using kernel queues and sockets. The same functionality is available on macOS, my preferred development platform. D did not get in my way in using these APIs on either macOS or FreeBSD. It was as seamless as using kernel queues from C.
A few notes about problems and limitations:
- I encountered one compiler bug. I found a workaround, so it wasn’t a blocker. I was able to reproduce it with a few lines of code and contacted the D community. They solved the problem and had a fix in a later version of the compiler.
- I did not use D’s garbage collector. This is not a strike against D or its GC, though. In a low-latency system like this, even the use of
freecan be costly, so I’m not going to take a chance on a nondeterministic system with unpredictable latency. Instead, I used my library to handle allocation/deallocation via free lists, with memory preallocated upfront. As a consequence, I also decided not to use D’s standard library for anything.
- I had to work with fixed-size ASCII strings that are not NUL-terminated and are, instead, padded with spaces at the end. Without the standard library, I found it easier to manipulate them C-style via pointers.
I was the sole developer on this project but completed it successfully in a relatively short period. Big credit to D and its productivity, readability, and ease of modifications.
Nice. But how about IDE of choice and integrated debuger?
No IDE and no debugger used.
I have used Sublime Text which had a plugin for D for indentation and syntax highlighting.
used the terminal and the DUB build system and dmd and ldc2 as compilers.
This kind of applications which has huge amount of data and multithreading the debugger could be helpful, but most of the time finding the problems using logs.
In D generating logs is really easy, just use writeln and the structure and it works. In C++ you need to write a function like dump() manually.
I have tried from time to time lldb as a debugger, but didn’t used it much, just very occasionally.
Two more quesions:
1. How big (in loc) the project is? I ask because I am wonder how big project can rely entirelly on logs as debug info.
2. How about with database access? Native C inteface? Or maybe D has some nice soluion?
I didn’t count loc, the project has multiple components, it is not one monolithic program.
multiple components which communicate using socket or internally using queues.
logs are huge, as the amount of data is huge, the debugger is not practical on this type of application, it is not related on how big the project is, but it is related on how big the data is and with multithreading logs are the best tools to find the problems.
sure you need to use grep to find what you are looking for in these huge logs.
No database, as all databases are not adequate for this type of low-latency applications.
the needed database are all custom build and in memory, mostly hash tables.
only has some asynchronous archiving for messages between OMS and exchange, just in case something go wrong, we could build the state of the system and continue working.
Can we test your hot platform ?
The platform is not open source, so I cannot deliver it for a third party to test.
if you are interested for a demo, we can arrange that.
I write the blog, as a success story using D programming language. and to promote D as a real contender to C/C++ in the domain of high-performance, low-latency.
I bought Zortech C++ in 1988, and loved it. So, I’ve followed D’s progress from a distance for years, but never had a chance to use it professionally.
Yes, me too I bought Zortech C++ in maybe 1987 or 1988 with the source code library.
it costed me 2 month salary, as I was in Syria at the time doing my obligatory service with very low salary.
but it worth it, as I learned a lot from the library, and from the book C++ Tools.
I like very much Walter Bright style of programming.
Maybe for that reason, I feel at home in D.
It is very hard to convince your employer to use D, if it is a big institution, as they want to be able to replace any programmer if he leave for example. so for them it is risky.
In startup company, it is a possibility of they are open minded and willing to take some risk.
Else you need to build something on your own in your free time.
D is excellent, and it is a shame that it is not mainstream till now. hopefully that could change with the time.
Hi, can you give us tips on how to use D without the GC
If your question is how to disable the garbage collector, I think that I have used GC.disable;
But if your question is how the system survived without freeing memory the answer is:
I allocate all the needed memory for the system at the startup of the system, I used free list a lot.
the system is not running 24/7 as trading hours are from 9:30 am to 4:00 pm, so every day we start fresh.
also there is some aggressive strategy when more memory is needed on system which is not allocated, we sacrifice some irrelevant data space and give that space to a more important data.
for example, imagine you have a security book; level 1000 is less important than top of the book.
if we have a new top of the book price, and we don’t have memory for it, we just remove the level 1000 and use its space for the new top of the book.
in some cases the book could be not 100% correct but that’s ok (trading is an art more than a science).
Every application has its specific requirements, in our case it is possible to do it without the need to allocate more memory.
(honestly, there is some memory allocation during the run time, but it is only occasionally and at the first minutes, for example when a hash table bucket need to grow. but we never shrink the bucket size, and early all the buckets reach a size which will not grow further, and no more memory allocation is needed).
Any thoughts on JVM and Scala. I am developing one in Scala, but I wonder if memory allocation is really going to be a bottleneck. Like isn’t it rather possible to pin down critical parts like the depth book than going no GC from the start?
Thank you for inspirational success-story!
Two questions from my side:
1) Last date in the post is something like “later 2014”. Is the system still in production use? How do you rate D maitenance? How hard to keep system up-to-date and developing new functional? Have you ever had any problems with using new version of compiler (because from 2014 I believe some breaking changes should happened)?
2) How is the experience with adding different types of specific protocols to the system (NYSE, NASDAQ)? I’ve heard that FIX protocol (Financial Information eXchange) widely used. Did you implement it from scratch?
the system ran in production in 2019 till Mars 2020, when I was working with a startup. unfortunately they didn’t have winning strategy, even the system was running perfectly, with a very low latency.
they didn’t hire a professional trader to get the best of the system, so I left that startup, as they didn’t fulfill a lot of promises.
December 2020, I started working on my own, and built a database for simulation, so I could do strategy simulations. (now, I am working on this simulation system, with some strategies).
D maintenance is great. and didn’t have any issues upgrading the compilers.
the platform developed as a multicomponents, each component is a framework; as an example I have the feed handler framework, it is a framework which easily could add new protocol for different exchange.
it worked perfectly on Bats (real time data). on Nasdaq it worked also from 1 day data file, Nasdaq and Bats are very similar. Pitch and Itch protocols. I didn’t try it on NYSE. but I am confident, that it could be adapted to almost any protocol.
the OMS tested on BATS only on real life data and it worked perfectly.
I didn’t work on FIX, as it is not a good candidate for HFT, as the format is text and very heavy, and most of the exchanges using binary protocols in these days.
Is it possible to see the platforme in action?
Sure it is possible to see it in simulation mode.
as at the moment, I don’t have access to the exchange.
as I need to put the server in colocation at the exchange, it will be the next step.
Hi. Thank you for sharing your story. It was interesting read.
This isn’t really related to D itself but could you give generic advice on how one might find employment opportunity in trading business as a programmer? I am doing masters on applied math program (simulation and numerical analysis) and always wanted to try my luck at such places.
you need to apply to investment banks and hedge funds.
or go through a recruiter, it is not easy to get a job with the trading team honestly, it is a very closed environment. but you should try and go as a junior programmer helping a trader implementing his ideas.
applied math, should help you go there. you need to be good in C++, or Python could be enough.
honestly, I don’t have a better answer.
You could apply to Symmetry Investments – see our monthly Hacker News posting. We use D – and an internal language written in D – as the main language throughout the firm.
Cool project, George, and I am glad you wrote about your experience.
Btw I saw someone here is working on a FIX parsing library in D and I guess we could open source it if people would find it interesting.
Why not, I can consider Symmetry Investments, I need to look at what they are doing.
If there is an opportunity to work together, and using my trading platform it will be interesting.
when, I was working with BNP Paribas, they were using an open source C++ FIX parser, but it was very slow. I developped a fast FIX parser for order execution. it wasn’t a full FIX protocol parser, it was only for their need on order execution system.
I don’t know if open source a FIX parser in D, could be interesting for people like banks, as these institutions, don’t take the risk to use D, they will continue using Java and C++. and I think that FIX protocol is used much less, as every exchange moving to binary protocols as they are much lighter and faster.
But open source a FIX parser in D, for sure it is a positive point for D.
Enjoyed this a lot, thanks for sharing.
I’ve on the same road but I’m trying it with Julia, I’m not sure how much I can avoid GCC though. Have you tried anything latency sensitive in Julia ?
I like to learn Julia, but unfortunately, didn’t have the time to learn it.
it should be a very interesting programming language.
So, I didn’t try anything latency sensitive in Julia.
but this type of platform, you need some low level programming, and multithreading, I don’t know if Julia is good enough at that.
wish you good luck with your adventure.
and please keep me posted.
Thanks for the response.
Low level is not really an option with Julia although you can get real fast but not hard-real-time latency sensitive fast (as far as I understand). “https://www.youtube.com/watch?v=dmWQtI3DFFo” This talk gives an idea what can be done. Hence Im targeting trade executions around few seconds.
Multithreaded workload on the other hand is a strength here.
I will post a blog when finished and will let you know.
Julia is very interesting for sure. and could be used on a lot of domains, which could be more interesting than high-frequency trading like Robotics, Scientific domains, etc…
by the way, when we speak on latency in these high-frequency trading system, we don’t measure in seconds, we speak in sub microsecond (hundreds of nanoseconds).
and a lot of firm use FPGA solutions, as they don’t think that it is possible to develop low-latency trading platform using C/C++ and D.
Yeah Im not going to deploy anything in equities. I will be testing mostly in crypto. I will start with low digit seconds execution then from there I will see how far I can reduce my holding period. I don’t have the skills (and/or manpower) for anything sub millisecond. If working on statarb on multiple exchanges then what you can achieve between exchanges would be around 20~50 milliseconds (I have not heard anyone tried radio communication between exchanges in crypto).
The new infromation from last JuliaConf: https://pretalx.com/juliacon-2022/talk/GUQBSE/
Julia could stop the GC – so the ASML is decided it is enough for Real-time application.
I kind of agree with what has been written before about low latency. I don’t know if the assembly generated by Julia JIT is good enough for low-latency. It is kind of a nice hobby project to write an order book in Julia to see how it performs. What I tested and that is what was described in my talk is mainly capability for throughput, my experience has been mainly in solving optimization problem in real-time. Still my goal was throughput (solve large system of equations as quick as possible) and not latency (being able to react to market changes as fast as possible).
How’s hiring in D going? 🙂
I am not at the stage of hiring people in D yet. as I developed the platform, and still working on it.
next step, I will start looking for investors, then we can grow up and hire people that could help.
sure the pool of D developpers is tiny, but these guys should be passionate and like to learn and try new things. and working with one or two could be enjoyable and very competitive.
Also, a good C++ programmer could switch very easily to D if he willing to, specially with some support and help.
You say you have 4 systems and communicate with sockets.
Why not use shared memory for better performance?
How many different processes do you use and how do you distribute them?
15 years ago, I used shared memory for process communication when all the systems run on the same server. I used shared memory and semaphores at the time, I have created queues in the shared memory and it was faster than sockets at the time.
with the current trading system, in the OMS I developed and used a lock free queue using shared memory and atomic operation, the latency was around100 nanoseconds.
between the other systems I used sockets for flexibility as they could move to different servers.
and discovered that using KQueue was fast enough for the strategies developed at the time.
I didn’t measure the latency of the kQueue and sockets, but it is in the order of sub microseconds, which was good enough.
the feed handler use multithreading and the number of threads depends on the needed securities, as we used the feed handler for the Bats market, which the data coming on 31 channels, so it depends on how many channels needed the distribution could be different.
then the strategy/signal generator used 2 threads, and the OMS 2 threads also, one for communications and managing orders, and the others used for asynchronous archiving of all events.
Are your strategies developed in D or in a scripting language? (you mentioned lua for your configuration files)
The startup which I worked with to use my platform didn’t deliver high-frequency strategies.
They delivered some basic and non-sense strategies, which I coded for them in D.
These strategies never were profitable, and didn’t need this type of high-frequency platform.
One of the reason, that I discontinued working with that startup, as unfortunately they didn’t deliver of a lot of their promises.
Since, 1 year, I am working on my own, and continue developing the platform on my own.
Lua used for configuration, and also it is possible to extend the system and use Lua for writing strategies for real life (latency not very important). or for simulation.
Hi, could you tell more about the latencies inside the platform? What times have been achieved?
The latency of the feed handler (BATS), in average is 100 nanoseconds.
We had signal generator, which I don’t include it is latency, as it is not on the path of (feed handler -> strategy -> OMS), and it has some heavy calculations, which was taking around 30 microseconds, and send signal to strategy.
The strategy was a vert simple one (not successful).
The time from the strategy receive a signal from the signal generator to the packet reach the output socket of the OMS was around 300 nanoseconds.
the latency of the packet from feed handler to (signal generator and strategy) should be in the order of 200 nanoseconds.
So, you could say 100 nanoseconds for the feed handler + 200 nanoseconds + 300 nanoseconds = 600 nanoseconds.
at the time we had Mellanox network cards, I did the measurement using 2 similar servers with similar cards, doing ping-pong -> latency of the card in+out is 5.4 microseconds.
which make the entire latency of the system with these cards around 6 microseconds.
recently, I saw some network cards from Solarflare advertising 300 nanoseconds using unload, or 15-20 nanoseconds using DirecTCP (I didn’t try these cards).
but if it is true, the latency of the system could improve dramatically.
Thanks for your interest,
Do you think 100 nano is the fastest you can get it in D for an ITCH/Pitch feed? As it can def be much faster. A pure SW based trading stack in C++ with kernelbypass cards could easily achieve sub 2uS and we have/are at 1100-1300 nano for simple stuff.