融合自適應評判的隨機系統數據驅動(dòng)策略?xún)?yōu)化
doi: 10.16383/j.aas.c230678
-
1.
北京工業(yè)大學(xué)信息學(xué)部 北京 100124
-
2.
計算智能與智能系統北京市重點(diǎn)實(shí)驗室 北京 100124
-
3.
北京人工智能研究院 北京 100124
-
4.
智慧環(huán)保北京實(shí)驗室 北京 100124
Data-driven Policy Optimization for Stochastic Systems Involving Adaptive Critic
-
1.
Faculty of Information Technology, Beijing University of Technology, Beijing 100124
-
2.
Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing 100124
-
3.
Beijing Institute of Artificial Intelligence, Beijing 100124
-
4.
Beijing Laboratory of Smart Environmental Protection, Beijing 100124
-
摘要: 自適應評判技術(shù)已經(jīng)廣泛應用于求解復雜非線(xiàn)性系統的最優(yōu)控制問(wèn)題, 但利用其求解離散時(shí)間非線(xiàn)性隨機系統的無(wú)限時(shí)域最優(yōu)控制問(wèn)題還存在一定局限性. 本文融合自適應評判技術(shù), 建立一種數據驅動(dòng)的離散隨機系統折扣最優(yōu)調節方法. 首先, 針對寬松假設下的非線(xiàn)性隨機系統, 研究帶有折扣因子的無(wú)限時(shí)域最優(yōu)控制問(wèn)題. 所提的隨機系統 Q-learning 算法能夠將初始的容許策略單調不增地優(yōu)化至最優(yōu)策略. 基于數據驅動(dòng)思想, 隨機系統 Q-learning 算法在不建立模型的情況下直接利用數據進(jìn)行策略?xún)?yōu)化. 其次, 利用執行?評判神經(jīng)網(wǎng)絡(luò )方案, 實(shí)現了隨機系統 Q-learning 算法. 最后, 通過(guò)兩個(gè)基準系統, 驗證本文提出的隨機系統 Q-learning 算法的有效性.
-
關(guān)鍵詞:
- 自適應評判設計 /
- 數據驅動(dòng) /
- 離散系統 /
- 神經(jīng)網(wǎng)絡(luò ) /
- Q-learning /
- 隨機最優(yōu)控制
Abstract: Adaptive critic technology has been widely employed to solve the optimal control problems of complicated nonlinear systems, but there are some limitations to solve the infinite-horizon optimal problems of discrete-time nonlinear stochastic systems. In this paper, we establish a data-driven discounted optimal regulation method for discrete-time stochastic systems involving adaptive critic technology. First, we investigate the infinite-horizon optimal problems with the discount factor for stochastic systems under the relaxed assumption. The developed stochastic Q-learning algorithm can optimize an initial admissible policy to the optimal one in a monotonically nonincreasing way. Based on the data-driven idea, the policy optimization of the stochastic Q-learning algorithm is executed without a dynamic model. Then, the stochastic Q-learning algorithm is implemented by utilizing the actor-critic neural networks. Finally, two nonlinear benchmarks are given to demonstrate the overall performance of the developed stochastic Q-learning algorithm. -
圖 2 執行網(wǎng)絡(luò )權值曲線(xiàn) (基準系統 I)
Fig. 2 Curves of action network weights (Benchmark system I)
圖 3 控制策略測試曲線(xiàn) (基準系統 I)
Fig. 3 Curves of control policies for performance test (Benchmark system I)
圖 6 執行網(wǎng)絡(luò )權值曲線(xiàn) (基準系統 II)
Fig. 6 Curves of action network weights (Benchmark system II)
表 1 隨機 Q-learning 算法的主要參數
Table 1 Main parameters of the stochastic Q-learning algorithm
算法參數 $\mathcal{Q}$ $\mathcal{R}$ ${\rho}_{\max}$ $\lambda$ $\epsilon$ 基準系統I $2I_2$ 2.0 300 0.97 0.01 基準系統II $0.1I_4$ 0.1 500 0.99 0.01 下載: 導出CSV表 2 球臺平衡系統的主要參數
Table 2 Main parameters of the ball-and-beam system
符號及取值 物理意義 $S_t=0.001 \;{\rm{N}}/ {\rm{m}}$ 驅動(dòng)機械剛度 $L_\omega =0.5\; {\rm{m}}$ 平臺半徑 $L =0.48 \; {\rm{m}}$ 電機作用半徑 $f_c= 1\; {\rm{N_s/ m}}$ 驅動(dòng)電機的機械摩擦系數 $I_\omega = 0.140\;25 \;{\rm{kg} }\cdot {\rm{m} }^2$ 平臺慣性矩 $g = 9.8\; {\rm{m/s}}^2$ 重力加速度 $\varpi =0.016\;2 \;{\rm{kg} }$ 球體質(zhì)量 $\tau =0.02 \;{\rm{m}}$ 球體滾動(dòng)半徑 $I_b=4.32\times10^{-5} \;{\rm{kg}}\cdot {\rm{m}}^2$ 球體轉動(dòng)慣量 下載: 導出CSV亚洲第一网址_国产国产人精品视频69_久久久久精品视频_国产精品第九页 -
[1] Liu D R, Xue S, Zhao B, Luo B, Wei Q L. Adaptive dynamic programming for control: A survey and recent advances. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2021, 51(1): 142?160 doi: 10.1109/TSMC.2020.3042876 [2] Bellman R. Dynamic programming. Science, 1966, 153(3731): 34?37 doi: 10.1126/science.153.3731.34 [3] Wang F Y, Zhang H G, Liu D R. Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine, 2009, 4(2): 39?47 doi: 10.1109/MCI.2009.932261 [4] Prokhorov D V, Wunsch D C. Adaptive critic designs. IEEE Transactions on Neural Networks, 1997, 8(5): 997?1007 doi: 10.1109/72.623201 [5] Zhao M M, Wang D, Qiao J F, Ha M M, Ren J. Advanced value iteration for discrete-time intelligent critic control: A survey. Artificial Intelligence Review, 2023, 56(10): 12315?12346 doi: 10.1007/s10462-023-10497-1 [6] Wang D, Gao N, Liu D R, Li J N, Lewis F L. Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications. IEEE/CAA Journal of Automatica Sinica, 2024, 11(1): 18?36 doi: 10.1109/JAS.2023.123843 [7] Liu T, Tian B, Ai Y F, Li L, Cao D P, Wang F Y. Parallel reinforcement learning: A framework and case study. IEEE/CAA Journal of Automatica Sinica, 2018, 5(4): 827?835 doi: 10.1109/JAS.2018.7511144 [8] Miao Q H, Lv Y S, Huang M, Wang X, Wang F Y. Parallel learning: Overview and perspective for computational learning across Syn2Real and Sim2Real. IEEE/CAA Journal of Automatica Sinica, 2023, 10(3): 603?631 doi: 10.1109/JAS.2023.123375 [9] Zhao M M, Wang D, Ha M M, Qiao J F. Evolving and incremental value iteration schemes for nonlinear discrete-time zero-sum games. IEEE Transactions on Cybernetics, 2023, 53(7): 4487?4499 doi: 10.1109/TCYB.2022.3198078 [10] 王鼎, 胡凌治, 趙明明, 哈明鳴, 喬俊飛. 未知非線(xiàn)性零和博弈最優(yōu)跟蹤的事件觸發(fā)控制設計. 自動(dòng)化學(xué)報, 2023, 49(1): 91?101Wang Ding, Hu Ling-Zhi, Zhao Ming-Ming, Ha Ming-Ming, Qiao Jun-Fei. Event-triggered control design for optimal tracking of unknown nonlinear zero-sum games. Acta Automatica Sinica, 2023, 49(1): 91?101 [11] 王鼎. 一類(lèi)離散動(dòng)態(tài)系統基于事件的迭代神經(jīng)控制. 工程科學(xué)學(xué)報, 2022, 44(3): 411?419Wang Ding. Event-based iterative neural control for a type of discrete dynamic plant. Chinese Journal of Engineering, 2022, 44(3): 411?419 [12] Wang D, Hu L Z, Zhao M M, Qiao J F. Dual event-triggered constrained control through adaptive critic for discrete-time zero-sum games. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(3): 1584?1595 doi: 10.1109/TSMC.2022.3201671 [13] Wang D, Li X, Zhao M M, Qiao J F. Adaptive critic control design with knowledge transfer for wastewater treatment applications. IEEE Transactions on Industrial Informatics, DOI: 10.1109/TⅡ.2023.3278875 [14] 王鼎, 趙慧玲, 李鑫. 基于多目標粒子群優(yōu)化的污水處理系統自適應評判控制. 工程科學(xué)學(xué)報, 2024, 46(5): 908?917Wang Ding, Zhao Hui-Ling, Li Xin. Adaptive critic control for wastewater treatment systems based on multi-objective particle swarm optimization. Chinese Journal of Engineering, 2024, 46(5): 908?917 [15] Wu T Y, He S Z, Liu J P, Sun S Q, Liu K, Han Q L, et al. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 2023, 10(5): 1122?1136 doi: 10.1109/JAS.2023.123618 [16] Luo B, Liu D R, Wu H N, Wang D, Lewis F L. Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Transactions on Cybernetics, 2017, 47(10): 3341?3354 doi: 10.1109/TCYB.2016.2623859 [17] Luo B, Yang Y, Liu D R. Policy iteration Q-learning for data-based two-player zero-sum game of linear discrete-time systems. IEEE Transactions on Cybernetics, 2021, 51(7): 3630?3640 doi: 10.1109/TCYB.2020.2970969 [18] Lin M D, Zhao B, Liu D R. Policy gradient adaptive critic designs for model-free optimal tracking control with experience replay. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2022, 52(6): 3692?3703 doi: 10.1109/TSMC.2021.3071968 [19] Su S, Zhu Q Y, Liu J Q, Tang T, Wei Q L, Cao Y. A data-driven iterative learning approach for optimizing the train control strategy. IEEE Transactions on Industrial Informatics, 2023, 19(7): 7885?7893 doi: 10.1109/TII.2022.3195888 [20] Wei Q L, Song R Z, Yan P F. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444?458 doi: 10.1109/TNNLS.2015.2464080 [21] Liang M M, Wang D, Liu D R. Improved value iteration for neural-network-based stochastic optimal control design. Neural Networks, 2020, 124: 280?295 doi: 10.1016/j.neunet.2020.01.004 [22] Pang B, Jiang Z P. Reinforcement learning for adaptive optimal stationary control of linear stochastic systems. IEEE Transactions on Automatic Control, 2023, 68(4): 2383?2390 doi: 10.1109/TAC.2022.3172250 [23] Wei Q L, Zhou T M, Lu J W, Liu Y, Su S, Xiao J. Continuous-time stochastic policy iteration of adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(10): 6375?6387 doi: 10.1109/TSMC.2023.3284612 [24] Lee J, Haddad W M, Lanchares M. Finite time stability and optimal finite time stabilization for discrete-time stochastic dynamical systems. IEEE Transactions on Automatic Control, 2023, 68(7): 3978?3991 [25] Liang M M, Wang D, Liu D R. Neuro-optimal control for discrete stochastic processes via a novel policy iteration algorithm. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020, 50(11): 3972?3985 doi: 10.1109/TSMC.2019.2907991 [26] 王鼎, 趙明明, 哈明鳴, 喬俊飛. 基于折扣廣義值迭代的智能最優(yōu)跟 蹤及應用驗證. 自動(dòng)化學(xué)報, 2022, 48(1): 182?193Wang Ding, Zhao Ming-Ming, Ha Ming-Ming, Qiao Jun-Fei. Intelligent optimal tracking with application verifications via discounted generalized value iteration. Acta Automatica Sinica, 2022, 48(1): 182?193 [27] Wang D, Ren J, Ha M M, Qiao J F. System stability of learning-based linear optimal control with general discounted value iteration. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(9): 6504?6514 doi: 10.1109/TNNLS.2021.3137524 [28] Lincoln B, Rantzer A. Relaxing dynamic programming. IEEE Transactions on Automatic Control, 2006, 51(8): 1249?1260 doi: 10.1109/TAC.2006.878720 [29] Ha M M, Wang D, Liu D R. Generalized value iteration for discounted optimal control with stability analysis. Systems & Control Letters, 2021, 147: Article No. 104847 [30] Ha M M, Wang D, Liu D R. Discounted iterative adaptive critic designs with novel stability analysis for tracking control. IEEE/CAA Journal of Automatica Sinica, 2022, 9(7): 1262?1272 doi: 10.1109/JAS.2022.105692 [31] Yang X, Wei Q L. Adaptive critic designs for optimal event-driven control of a CSTR system. IEEE Transactions on Industrial Informatics, 2021, 17(1): 484?493 doi: 10.1109/TII.2020.2972383 [32] Heydari A. Revisiting approximate dynamic programming and its convergence. IEEE Transactions on Cybernetics, 2014, 44(12): 2733?2743 doi: 10.1109/TCYB.2014.2314612 [33] Ha M M, Wang D, Liu D R. Neural-network-based discounted optimal control via an integrated value iteration with accuracy guarantee. Neural Networks, 2021, 144: 176?186 doi: 10.1016/j.neunet.2021.08.025 [34] Wang D, Wang J Y, Zhao M M, Xin P, Qiao J F. Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control. IEEE/CAA Journal of Automatica Sinica, 2023, 10(9): 1797?1809 doi: 10.1109/JAS.2023.123684 [35] Liu D R, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621?634 doi: 10.1109/TNNLS.2013.2281663 [36] Zhong X N, Ni Z, He H B. Gr-GDHP: A new architecture for globalized dual heuristic dynamic programming. IEEE Transactions on Cybernetics, 2017, 47(10): 3318?3330 doi: 10.1109/TCYB.2016.2598282 [37] Ha M M, Wang D, Liu D R. A novel value iteration scheme with adjustable convergence rate. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(10): 7430?7442 doi: 10.1109/TNNLS.2022.3143527