Timing and Parameter Optimization for One-time Motion Problem Based on Reinforcement Learning
Machine Learning Research
Volume 5, Issue 1, March 2020, Pages: 10-17
Received: Feb. 17, 2020;
Accepted: Mar. 3, 2020;
Published: Mar. 24, 2020
Views 19 Downloads 9
Boxuan Fan, Xi’an Research Inst. of High-tech, Xi’an, China
Guiming Chen, Xi’an Research Inst. of High-tech, Xi’an, China
Hongtao Lin, Xi’an Research Inst. of High-tech, Xi’an, China
Baseball hitting, swatter swing and football catching, there are many tasks can be seen as a one-time action, whose goal is to control the timing and parameters of the action to achieve optimal results. Many one-time motion problems are difficult to obtain the optimal policy through model solving, and model-free reinforcement learning has advantages for such problems. However, although reinforcement learning has developed rapidly, there is currently no universal one-time motion problem algorithm architecture. Decomposing the one-time motion problem into the action timing problem and the action parameter problem, we construct a suitable reinforcement learning method for each of them. We design a combination mechanism that allows the two modules to learn simultaneously by passing the estimated value between the two modules while interacting with the environment. We use REINFORCE + DPG to solve the problem of continuous motion parameter space, and use REINFORCE + Q learning to solve the problem of discrete motion parameter space. To testing the algorithm model, we designed and realized an aircraft bombing simulation environment. The test results show that the algorithm can converge quickly and stably, and is robust to different time step and observation errors.
Timing and Parameter Optimization for One-time Motion Problem Based on Reinforcement Learning, Machine Learning Research.
Vol. 5, No. 1,
2020, pp. 10-17.
Copyright © 2020 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/
) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Wen-yan Pang. Optimal Output Regulation of Partially Linear Discrete-Time Systems Using Reinforcement Learning. CPCC 2019. 2019: 252.
J. Jabłońska, Ł. Szumiec, J. R. Parkitna. Reinforcement learning in a probabilistic learning task without time constraints. Pharmacological Reports. 2019, 71 (6).
Paulo C. Heredia, Shaoshuai Mou. Distributed Multi-Agent Reinforcement Learning by Actor-Critic Method. IFAC Papers On Line. 2019, 52 (20).
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski. Playing Atari with Deep Reinforcement Learning. Nature. 518 (7540), 529 (2015).
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, d. D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot. Human-level control through deep reinforcement learning. Nature. 529 (7587), 484 (2016).
Zhen-peng Zhou, K. Steven, Li Li, Z. Richard N, R. Patrick. Optimization of Molecules via Deep Reinforcement Learning. Scientific reports. 2019, 9 (1).
G. A. Rummery, M. Niranjan. On-line Q-learning using connectionist systems. vol. 37 (University of Cambridge, Department of Engineering Cambridge, England, 1994).
R. S. Sutton. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. in International Conference on Neural Information Processing Systems (1995). pp. 1038–1044.
C. J. C. H. Watkins, P. Dayan. Q -learning. Machine Learning. 8 (3-4), 279 (1992).
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing Atari with Deep Reinforcement Learning. Computer Science (2013).
R. S. Sutton, A. G. Barto. Reinforcement learning: An introduction (MIT press, 2018).
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning. 8 (3-4), 229 (1992).
I. H. Witten. An adaptive optimal controller for discrete-time Markov environments. Information & Control. 34 (4), 286 (1977).
Sutton, Richard. Temporal credit assignment in reinforcement learning. Phd Thesis University of Massachusetts. 34 (5), 601 (1984).
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller. Deterministic policy gradient algorithms. in ICML (2014).