A sequel to “Bush Pilots in the Outback”, we explore using reinforcement learning to schedule thousands at scale.
Last year we demonstrated an application of reinforcement learning for bush pilots in the Australian Outback.
We left on a cliffhanger: Can the model scale for an industrial-wide application? If you don’t like reading blogs, you can stop after reading these three words: Yes it can.
(Even if you don’t like reading blogs, please stay tuned and we’ll walk through the process for applying to many more pilots.)
Let's start with a brief recap of what reinforcement learning is: Given a system where an agent is given a state and may make any number of actions in order to progress to a new state, potentially yielding a reward. Reinforcement Learning works to develop a decision making policy to maximize the reward over time.
More simply, reinforcement learning is like strapping a robot into a chair and forcing it to play a video game for years until the robot develops superhuman skills at the game. Then you force the robot to keep playing.
Reinforcement learning also takes a different approach to data than other machine learning techniques. Whereas other techniques consume volumes of static data to generalize patterns, reinforcement learning constantly re-samples from the environment. A necessity. As the policy changes, so too will the consequences and rewards of temporal decisions. But also a boon. Instead of hounding the Australian government or NGOs operating in the land down under for as much data as they can possibly gather since 1952, we can develop a simulator to serve as our perpetual data source.
A few key questions we received as feedback and wanted to answer in this update:
In my previous blog I developed such a simulator. Adding more pilots, adding more origins and destinations couldn’t be simpler. But what about scheduling? In the previous installment, perfect and complete information fed into the neural network each time step (time required to move to neighboring grid point) to make assignment decisions.
Perfect information only really works in an ideal setting. There are practical limits for memory size and performance to consider if we ever want to schedule more than a dozen or so airplanes. Therefore, we instead experimented with a simple scoping policy that would only consider the “most relevant” loads and competitors.
The strategy proved successful with the agent easily keeping up with the “closest available” policy even when scoped to a subset of the full simulation. There is some cost for computing scope and running inference on the trained model, however, this scales linearly with the number of orders/pilots and can be efficiently parallelized.
A key problem encountered when attempting to scale the problem up to hundreds of simultaneous pilots and deliveries came from an unexpected area: A curse of resources.
We could quote the ancient wisdoms here, an oyster without sand produces no pearl, a tree with no wind falls over, etc. Either way, in our resource rich environment, outperforming the “closest available” policy became really hard. Of course it was hard! With a 1:1 ratio of orders to pilots, our pilots never needed more than a dozen or so time steps of deadhead to reach a new order.
So, if we could not compete well in a rich environment, let's instead compete in a restricted environment. We tweaked our ratio to 1:0.8 pilots to orders. Now, competition was real. Rushing towards the nearest available order will much more frequently lead to getting beat and getting no order.
In this environment, our RL agent did indeed learn a policy which out-performed “nearest order”.
Not only did our newly trained RL model outperform the greedy policy in the restricted ratio, we found that it actually outperformed the greedy policy in the target-rich environment as well. Furthermore, we found the RL model to perform well at various scales, from 100 simultaneous orders to thousands.
On average, the trained model allowed the pilots to deliver 3.5 percent more orders than with a simple “get the nearest available order” algorithm.
Clearly, our agent learned something about the system in order to gain efficiency on the simple heuristic. Perhaps, as in the previous blog entry, the agent learns to wait instead of pursuing every lead. Perhaps the agent learns to yield to competitors in more advantageous situations. That’s part of the miracle - the agent has found an edge over the market and has learned to exploit it without explicit instruction.
And why not apply this learning method to real world system? Expero has the experience, process and organization required to model much more sophisticated real world scenarios. Unless you have perfectly captured the market, we are uniquely positioned to help you find - and exploit - an edge.
Again, I must emphasize - in a noisy random environment with ample opportunities, the agent has found a better solution.
Furthermore, we have shown that the solution can scale to much larger problems by use of scoping. We have shown that a model trained in one environment can be applied to other environments, and that it can even be advantageous to develop models under higher stress situations.
We still have more to come with adding in greater restrictions and adapting to incidents and events. With further layers of supervision on the agents and the network and other classical methods such as A/B pruning we may even further improve performance of the agents.
What difficult, unpredictable problems do you have? Send Ryan an email!
Tell us what you need and one of our experts will get back to you.