Counterfactual policy evaluation provides insights to engineers deploying RL models in an offline setting. This graph compares several CPE methods with the logged policy (the system that initially generated the training data). A score of 1.0 means that the RL and the logged policy match in performance. These results show that the RL model should achieve roughly 2x as much cumulative reward as the logged system.
Counterfactual policy evaluation provides insights to engineers deploying RL models in an offline setting. This graph compares several CPE methods with the logged policy (the system that initially generated the training data). A score of 1.0 means that the RL and the logged policy match in performance. These results show that the RL model should achieve roughly 2x as much cumulative reward as the logged system.

To help personalize content, tailor and measure ads and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookie Policy