Deep Q-learning in trading

3 minute read

Recently, some articles appear in my linkedin thread which describe an approach of applyign RL in trading algorithm. I have learned a lot about DL, traditional ML and CV stuff during my master, but still RL is a new continent for me.

Deep Q learning is a combination of Deep Learning and Q-learning, where Q-learning is a subcategory of Reinforcement learning which is named after its action-value function Q.

For readers who have interest on digging into mathematical source of Q-learning, i recommand reading these zhihu tutorials: DQN3,DQN4,DQN5. A more pragmatical tutorial with pseudo-codes for Q-learning can be found in this article. The main focus for this blog is however placed on the review of trading-bot, which is a python lib for applying DQN in trading, source is here.

Learning Environment for RL in trading

Action at each timestep t:
- buy, sell or hold
State at each timestep t:
- Ideally: current market/stock status at timestep t (OHLCV Data)
- Trading-bot: sigmoid wrapped intraday Stock Close Price Difference
  - which is Close(t)-Close(t-1)
  - Size: 10 which is the oberserve window size backwards
Reward:
- If Action is Sell:
  - Reward = Sell Price-Buy Price
- If Action is Buy or Hold
  - Reward = 0
Policy:
- The Decision to take action based on current state
- approximate Q-function policy with DNN:
  - Keras MLP with Huber Loss
  - all Dense Layers: 10-128-128-256-256-3
  - Input: Current State
    - Size: 10
  - Output: Action Probabilities
    - Size: 3

Hyperparameter

Batch Size:
- the size of how much we interact with environment to collect enough data(state,action,reward)
- default:32
- fit the policy model DNN after each data collection
Window Size:
- determines the state size
- look-back period from current timestep t
- default: 32
- In this case is the stock price from t-10 to t
Episode:
- equivalent to Epoch in normal Deep Learning
- Purpose is to refine model during the retraining
- default: 50

Main Loop pseudocode

Foreach Episode
  Foreach timestep t in range(data)
    // training
    state = get_current_state(t) //look back 10 days, collect data, pass through sigmoid
    next_state = get_current_state(t+1) //from tomorrow look back 10 days, collect data, pass through sigmoid

    action = get_current_action() //random sample from action size(3) for buy(1),sell(2),hold(3), return buy(1) for the initial
    reward = get_reward() //based on action, calculate reward:0 or sell_price-buy_price
    memory = save_status_in_memory() //save: state,action,reward,next_state,done

    if len(memory)>batch_size:
      train_policy_dnn(batch_size) //sample batch_size dataset from memory, prepare the dataset X,Y, then fit dnn with X,Y

Pseudocode for DNN Training

def train_policy_dnn():
  """
  target:
    "assumed" ground truth of Q-value, 
  """

  model = keras_4_layers_model()
  mini_batch = sample_memory(batch_size) //sample raw dataset

  //prepare data
  for state,action,reward,next_state,done in mini_batch:
    if done:
      target = reward //last instance
    else:
      target = reward + gamma*np.amax(model.predict(next_state)) //Q-value, the maximal value of choosing proper action

    q_values = model.predict(state) //get current prediction
    q_values[action] = target //correct the output of selected action with ground-truth(future info,next state)

    X_train.append(state)
    Y_train.append(q_values)

  loss = model.fit(X_train,Y_train,epoch=1).history["loss"]

  return loss

Notes

Codes above deal only with one single DQN scenario. After the initial paper for introducing DQN, ppl have developed T-DQN and Double-DQN which are also implemented in trading-bot, but not discussed here for now. I will discuss them in the future.

The fundamental notion of these 2 new DQN is called “model copy” or “model transfer”. That means we only update (weights copy) the target_model for a given period, before that we used the previous model(which we believe is still valid for current context) to calculate the ground-truth.

def train_policy_dnn_tdqn():
  """
  target:
    "assumed" ground truth of Q-value, 
  """

  model = keras_4_layers_model()
  tárget_mdoel = model.copy()

  mini_batch = sample_memory(batch_size) //sample raw dataset

  if n_iter % reset_count == 0:
    tárget_mdoel.weights = model.weights

  //prepare data
  for state,action,reward,next_state,done in mini_batch:
    if done:
      target = reward //last instance
    else:
      target = reward + gamma*np.amax(target_model.predict(next_state)) //we use here target_model to update the ground-truth

    q_values = model.predict(state) //get current prediction
    q_values[action] = target //correct the output of selected action with ground-truth(future info,next state)

    X_train.append(state)
    Y_train.append(q_values)

  loss = model.fit(X_train,Y_train,epoch=1).history["loss"]

  return loss

Twitter Facebook LinkedIn

niemand-01

Deep Q-learning in trading

Learning Environment for RL in trading

Hyperparameter

Main Loop pseudocode

Pseudocode for DNN Training

Notes

You May Also Enjoy

AI For Trading

Notes for QuantConnect

Sites Impacting Trading

ML DL Questions