3 minute read

Recently, some articles appear in my linkedin thread which describe an approach of applyign RL in trading algorithm. I have learned a lot about DL, traditional ML and CV stuff during my master, but still RL is a new continent for me.

Deep Q learning is a combination of Deep Learning and Q-learning, where Q-learning is a subcategory of Reinforcement learning which is named after its action-value function Q.

For readers who have interest on digging into mathematical source of Q-learning, i recommand reading these zhihu tutorials: DQN3,DQN4,DQN5. A more pragmatical tutorial with pseudo-codes for Q-learning can be found in this article. The main focus for this blog is however placed on the review of trading-bot, which is a python lib for applying DQN in trading, source is here.

Learning Environment for RL in trading

  • Action at each timestep t:
    • buy, sell or hold
  • State at each timestep t:
    • Ideally: current market/stock status at timestep t (OHLCV Data)
    • Trading-bot: sigmoid wrapped intraday Stock Close Price Difference
      • which is Close(t)-Close(t-1)
      • Size: 10 which is the oberserve window size backwards
  • Reward:
    • If Action is Sell:
      • Reward = Sell Price-Buy Price
    • If Action is Buy or Hold
      • Reward = 0
  • Policy:
    • The Decision to take action based on current state
    • approximate Q-function policy with DNN:
      • Keras MLP with Huber Loss
      • all Dense Layers: 10-128-128-256-256-3
      • Input: Current State
        • Size: 10
      • Output: Action Probabilities
        • Size: 3

Hyperparameter

  • Batch Size:
    • the size of how much we interact with environment to collect enough data(state,action,reward)
    • default:32
    • fit the policy model DNN after each data collection
  • Window Size:
    • determines the state size
    • look-back period from current timestep t
    • default: 32
    • In this case is the stock price from t-10 to t
  • Episode:
    • equivalent to Epoch in normal Deep Learning
    • Purpose is to refine model during the retraining
    • default: 50

Main Loop pseudocode

Foreach Episode
  Foreach timestep t in range(data)
    // training
    state = get_current_state(t) //look back 10 days, collect data, pass through sigmoid
    next_state = get_current_state(t+1) //from tomorrow look back 10 days, collect data, pass through sigmoid

    action = get_current_action() //random sample from action size(3) for buy(1),sell(2),hold(3), return buy(1) for the initial
    reward = get_reward() //based on action, calculate reward:0 or sell_price-buy_price
    memory = save_status_in_memory() //save: state,action,reward,next_state,done

    if len(memory)>batch_size:
      train_policy_dnn(batch_size) //sample batch_size dataset from memory, prepare the dataset X,Y, then fit dnn with X,Y

Pseudocode for DNN Training

def train_policy_dnn():
  """
  target:
    "assumed" ground truth of Q-value, 
  """

  model = keras_4_layers_model()
  mini_batch = sample_memory(batch_size) //sample raw dataset

  //prepare data
  for state,action,reward,next_state,done in mini_batch:
    if done:
      target = reward //last instance
    else:
      target = reward + gamma*np.amax(model.predict(next_state)) //Q-value, the maximal value of choosing proper action

    q_values = model.predict(state) //get current prediction
    q_values[action] = target //correct the output of selected action with ground-truth(future info,next state)

    X_train.append(state)
    Y_train.append(q_values)

  loss = model.fit(X_train,Y_train,epoch=1).history["loss"]

  return loss


Notes

Codes above deal only with one single DQN scenario. After the initial paper for introducing DQN, ppl have developed T-DQN and Double-DQN which are also implemented in trading-bot, but not discussed here for now. I will discuss them in the future.

The fundamental notion of these 2 new DQN is called “model copy” or “model transfer”. That means we only update (weights copy) the target_model for a given period, before that we used the previous model(which we believe is still valid for current context) to calculate the ground-truth.

def train_policy_dnn_tdqn():
  """
  target:
    "assumed" ground truth of Q-value, 
  """

  model = keras_4_layers_model()
  tárget_mdoel = model.copy()

  mini_batch = sample_memory(batch_size) //sample raw dataset

  if n_iter % reset_count == 0:
    tárget_mdoel.weights = model.weights

  //prepare data
  for state,action,reward,next_state,done in mini_batch:
    if done:
      target = reward //last instance
    else:
      target = reward + gamma*np.amax(target_model.predict(next_state)) //we use here target_model to update the ground-truth

    q_values = model.predict(state) //get current prediction
    q_values[action] = target //correct the output of selected action with ground-truth(future info,next state)

    X_train.append(state)
    Y_train.append(q_values)

  loss = model.fit(X_train,Y_train,epoch=1).history["loss"]

  return loss