Deep Q-learning in trading
Recently, some articles appear in my linkedin thread which describe an approach of applyign RL in trading algorithm. I have learned a lot about DL, traditional ML and CV stuff during my master, but still RL is a new continent for me.
Deep Q learning is a combination of Deep Learning and Q-learning, where Q-learning is a subcategory of Reinforcement learning which is named after its action-value function Q.
For readers who have interest on digging into mathematical source of Q-learning, i recommand reading these zhihu tutorials: DQN3,DQN4,DQN5. A more pragmatical tutorial with pseudo-codes for Q-learning can be found in this article. The main focus for this blog is however placed on the review of trading-bot, which is a python lib for applying DQN in trading, source is here.
Learning Environment for RL in trading
- Action at each timestep t:
- buy, sell or hold
- State at each timestep t:
- Ideally: current market/stock status at timestep t (OHLCV Data)
- Trading-bot: sigmoid wrapped intraday Stock Close Price Difference
- which is Close(t)-Close(t-1)
- Size: 10 which is the oberserve window size backwards
- Reward:
- If Action is Sell:
- Reward = Sell Price-Buy Price
- If Action is Buy or Hold
- Reward = 0
- If Action is Sell:
- Policy:
- The Decision to take action based on current state
- approximate Q-function policy with DNN:
- Keras MLP with Huber Loss
- all Dense Layers: 10-128-128-256-256-3
- Input: Current State
- Size: 10
- Output: Action Probabilities
- Size: 3
Hyperparameter
- Batch Size:
- the size of how much we interact with environment to collect enough data(state,action,reward)
- default:32
- fit the policy model DNN after each data collection
- Window Size:
- determines the state size
- look-back period from current timestep t
- default: 32
- In this case is the stock price from t-10 to t
- Episode:
- equivalent to Epoch in normal Deep Learning
- Purpose is to refine model during the retraining
- default: 50
Main Loop pseudocode
Foreach Episode
Foreach timestep t in range(data)
// training
state = get_current_state(t) //look back 10 days, collect data, pass through sigmoid
next_state = get_current_state(t+1) //from tomorrow look back 10 days, collect data, pass through sigmoid
action = get_current_action() //random sample from action size(3) for buy(1),sell(2),hold(3), return buy(1) for the initial
reward = get_reward() //based on action, calculate reward:0 or sell_price-buy_price
memory = save_status_in_memory() //save: state,action,reward,next_state,done
if len(memory)>batch_size:
train_policy_dnn(batch_size) //sample batch_size dataset from memory, prepare the dataset X,Y, then fit dnn with X,Y
Pseudocode for DNN Training
def train_policy_dnn():
"""
target:
"assumed" ground truth of Q-value,
"""
model = keras_4_layers_model()
mini_batch = sample_memory(batch_size) //sample raw dataset
//prepare data
for state,action,reward,next_state,done in mini_batch:
if done:
target = reward //last instance
else:
target = reward + gamma*np.amax(model.predict(next_state)) //Q-value, the maximal value of choosing proper action
q_values = model.predict(state) //get current prediction
q_values[action] = target //correct the output of selected action with ground-truth(future info,next state)
X_train.append(state)
Y_train.append(q_values)
loss = model.fit(X_train,Y_train,epoch=1).history["loss"]
return loss
Notes
Codes above deal only with one single DQN scenario. After the initial paper for introducing DQN, ppl have developed T-DQN and Double-DQN which are also implemented in trading-bot, but not discussed here for now. I will discuss them in the future.
The fundamental notion of these 2 new DQN is called “model copy” or “model transfer”. That means we only update (weights copy) the target_model for a given period, before that we used the previous model(which we believe is still valid for current context) to calculate the ground-truth.
def train_policy_dnn_tdqn():
"""
target:
"assumed" ground truth of Q-value,
"""
model = keras_4_layers_model()
tárget_mdoel = model.copy()
mini_batch = sample_memory(batch_size) //sample raw dataset
if n_iter % reset_count == 0:
tárget_mdoel.weights = model.weights
//prepare data
for state,action,reward,next_state,done in mini_batch:
if done:
target = reward //last instance
else:
target = reward + gamma*np.amax(target_model.predict(next_state)) //we use here target_model to update the ground-truth
q_values = model.predict(state) //get current prediction
q_values[action] = target //correct the output of selected action with ground-truth(future info,next state)
X_train.append(state)
Y_train.append(q_values)
loss = model.fit(X_train,Y_train,epoch=1).history["loss"]
return loss