非平衡概念漂移數據流主動(dòng)學(xué)習方法
doi: 10.16383/j.aas.c230233
-
1.
山西大學(xué)計算機與信息技術(shù)學(xué)院 太原 030006
-
2.
山西大學(xué)計算智能與中文信息處理教育部重點(diǎn)實(shí)驗室 太原 030006
Active Learning Method for Imbalanced Concept Drift Data Stream
-
1.
School of Computer and Information Technology, Shanxi University, Taiyuan 030006
-
2.
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006
-
摘要: 數據流分類(lèi)研究在開(kāi)放、動(dòng)態(tài)環(huán)境中如何提供更可靠的數據驅動(dòng)預測模型, 關(guān)鍵在于從實(shí)時(shí)到達且不斷變化的數據流中檢測并適應概念漂移. 目前, 為檢測概念漂移和更新分類(lèi)模型, 數據流分類(lèi)方法通常假設所有樣本的標簽都是已知的, 這一假設在真實(shí)場(chǎng)景下是不現實(shí)的. 此外, 真實(shí)數據流可能表現出較高且不斷變化的類(lèi)不平衡比率, 會(huì )進(jìn)一步增加數據流分類(lèi)任務(wù)的復雜性. 為此, 提出一種非平衡概念漂移數據流主動(dòng)學(xué)習方法(Active learning method for imbalanced concept drift data stream, ALM-ICDDS). 定義基于多預測概率的樣本預測確定性度量, 提出邊緣閾值矩陣的自適應調整方法, 使得標簽查詢(xún)策略適用于類(lèi)別數較多的非平衡數據流; 提出基于記憶強度的樣本替換策略, 將難區分、少數類(lèi)樣本和代表當前數據分布的樣本保存在記憶窗口中, 提升新基分類(lèi)器的分類(lèi)性能; 定義基于分類(lèi)精度的基分類(lèi)器重要性評價(jià)及更新方法, 實(shí)現漂移后的集成分類(lèi)器更新. 在7個(gè)合成數據流和3個(gè)真實(shí)數據流上的對比實(shí)驗表明, 提出的非平衡概念漂移數據流主動(dòng)學(xué)習方法的分類(lèi)性能優(yōu)于6種概念漂移數據流學(xué)習方法.
-
關(guān)鍵詞:
- 數據流分類(lèi) /
- 主動(dòng)學(xué)習 /
- 概念漂移 /
- 多類(lèi)不平衡
Abstract: Data stream classification researchs how to provide more reliable data-driven prediction models in open and dynamic environment. The key is how to detect and adapt to concept drift from continuously changing data stream that arrive in real-time. Currently, in order to detect concept drift and update classification models, data stream classification methods usually assume that the labels of all samples are known, which is unrealistic in real scenarios. Additionally, real data stream may exhibit a high and constantly changing class imbalance ratios, further increasing the complexity of the data stream classification task. In this paper, we propose an active learning method for imbalanced concept drift data stream (ALM-ICDDS). Firstly, we define a sample prediction certainty measure based on multiple prediction probabilities and propose an adaptive adjustment method for the margin threshold matrix, which makes the label query strategy suitable for imbalanced data stream with a number of categories. Then, we propose a sample replacement strategy based on memory strength, which saves the samples that are difficult-to-distinguish, minority class and represent the current data distribution in the memory window, and improves the classification performance of new base classifier. Finally, we define the importance evaluation and update method of base classifier based on classification accuracy, which realizes the ensemble classifier update after drift. Comparative experiments on seven synthetic data streams and three real data streams show that the active learning method for imbalance concept drift data stream is better than six concept drift data stream learning methods in classification performance.-
Key words:
- Data stream classification /
- active learning /
- concept drift /
- multi-class imbalance
-
表 1 數據流特征
Table 1 Data stream feature
編號 數據流 樣本數 特征數 類(lèi)別數 類(lèi)分布 異常點(diǎn)(%) 漂移次數 1 DS1 400000 25 15 類(lèi)平衡 0 0 2 DS2 400000 25 15 類(lèi)平衡 5 3 3 DS3 400000 25 15 (1/1/1/1/1/1/1/1/1/1/2/2/3/3/5) 0 0 4 DS4 400000 25 15 (1/1/1/1/1/1/1/1/1/1/2/2/3/3/5) 5 3 5 DS5 400000 25 15 (1/1/1/1/1/1/1/1/1/1/2/2/3/3/5), 0 0 (2/2/3/3/5/1/1/1/1/1/1/1/1/1/1) 6 DS6 400000 25 15 (1/1/1/1/1/1/1/1/1/1/2/2/3/3/5), 5 3 (2/2/3/3/5/1/1/1/1/1/1/1/1/1/1) 7 DS7 400000 25 50 類(lèi)平衡 5 3 8 Kddcup99_10% 494000 42 23 — — — 9 Shuttle 570000 10 7 — — — 10 PokerHand 830000 10 10 — — — 下載: 導出CSV表 2 概念漂移數據流特征
Table 2 Concept drift data stream feature
編號 數據流 概念漂移類(lèi)型 樣本數 特征數 類(lèi)別數 漂移寬度 1 DS8 突變型 400000 25 15 1 2 DS9 重復型 400000 25 15 1 3 DS10 增量型 400000 25 15 10000 4 DS11 逐漸型 400000 25 15 10000 下載: 導出CSV表 3 7種算法的P值(%)
Table 3 P value of seven algorithms (%)
數據流 LB BOLE ARFRE CALMID OALM-IDS ALM-ICDDS-E ALM-ICDDS DS1 96.89±0.31 96.36±0.11 98.07±0.43 98.01±0.41 98.03±0.25 97.18±0.48 99.07±0.34 DS2 90.61±0.21 88.63±0.54 92.77±0.42 93.31±0.14 93.27±0.49 91.97±0.26 94.64±0.15 DS3 94.41±0.11 96.07±0.23 96.74±0.45 96.64±0.34 96.75±0.56 96.46±0.61 97.84±0.24 DS4 86.91±0.45 85.23±0.52 88.30±0.29 89.90±0.28 90.27±0.42 89.70±0.72 92.06±0.28 DS5 93.60±0.48 94.04±0.52 96.30±0.18 94.65±0.49 95.47±0.32 94.24±0.35 96.17±0.19 DS6 86.59±0.19 84.69±0.48 88.02±0.47 88.44±0.19 88.65±0.25 87.41±0.40 90.86±0.37 DS7 88.25±0.86 87.21±0.79 90.16±0.92 90.49±0.47 90.51±0.53 89.32±0.38 93.67±0.40 Kddcup99_10% 83.85±0.59 81.10±0.15 85.56±0.54 92.12±0.45 92.13±0.31 91.24±0.51 95.80±0.17 Shuttle 64.63±0.42 63.85±0.27 79.07±0.31 85.35±0.14 85.70±0.32 83.48±0.25 85.99±0.13 PokerHand 51.63±0.39 50.36±0.35 52.51±0.56 53.93±0.28 54.57±0.50 52.90±0.18 55.89±0.51 下載: 導出CSV表 6 7種算法的${\rm{Kappa }}$值(%)
Table 6 ${\rm{Kappa }}$ value of seven algorithms (%)
數據流 LB BOLE ARFRE CALMID OALM-IDS ALM-ICDDS-E ALM-ICDDS DS1 95.09±0.43 95.47±0.26 97.11±0.33 97.84±0.18 97.52±0.50 96.31±0.53 98.72±0.18 DS2 89.66±0.50 88.28±0.45 91.80±0.17 92.55±0.25 92.65±0.28 91.27±0.29 93.56±0.46 DS3 93.08±0.13 95.68±0.22 95.62±0.53 96.50±0.46 96.46±0.60 96.05±0.36 97.69±0.21 DS4 86.97±0.46 85.86±0.13 88.18±0.25 89.94±0.24 89.99±0.36 88.61±0.46 90.19±0.57 DS5 92.32±0.37 94.18±0.45 95.86±0.28 94.40±0.50 95.52±0.14 94.29±0.20 95.81±0.35 DS6 86.59±0.32 85.25±0.29 87.81±0.54 88.90±0.51 89.00±0.13 87.68±0.47 89.80±0.25 DS7 88.28±0.46 87.51±0.97 89.93±0.71 90.01±0.92 90.19±0.40 89.51±0.59 93.67±0.54 Kddcup99_10% 80.94±0.22 75.68±0.25 79.36±0.35 83.32±0.24 85.83±0.50 84.87±0.16 86.81±0.33 Shuttle 58.73±0.39 61.54±0.22 73.78±0.20 79.39±0.43 80.11±0.53 80.97±0.24 83.56±0.54 PokerHand 50.34±0.58 49.86±0.40 50.36±0.16 51.24±0.21 51.39±0.16 50.55±0.41 52.25±0.35 下載: 導出CSV表 4 7種算法的R值(%)
Table 4 R value of seven algorithms (%)
數據流 LB BOLE ARFRE CALMID OALM-IDS ALM-ICDDS-E ALM-ICDDS DS1 94.78±0.13 96.04±0.24 96.81±0.59 97.87±0.24 97.92±0.25 96.15±0.31 98.63±0.17 DS2 88.65±0.25 87.86±0.53 90.35±0.30 91.54±0.54 91.84±0.58 90.78±0.70 92.30±0.24 DS3 92.55±0.45 95.92±0.32 94.80±0.43 96.12±0.14 97.92±0.54 95.99±0.52 98.55±0.29 DS4 87.03±0.49 87.08±0.39 88.23±0.31 90.50±0.30 91.07±0.52 90.13±0.43 91.15±0.11 DS5 91.54±0.11 92.33±0.51 96.04±0.20 93.82±0.55 94.94±0.27 92.91±0.42 96.53±0.42 DS6 86.56±0.50 85.48±0.24 87.83±0.49 89.43±0.18 88.85±0.36 88.39±0.34 90.63±0.21 DS7 87.19±0.42 86.12±0.11 87.29±0.36 88.41±0.50 88.77±0.43 87.87±0.20 91.61±0.78 Kddcup99_10% 60.89±0.50 63.05±0.50 58.26±0.38 61.88±0.38 63.71±0.54 63.42±0.67 69.34±0.57 Shuttle 61.40±0.21 50.84±0.31 54.36±0.35 59.52±0.41 63.12±0.59 61.79±0.16 64.59±0.29 PokerHand 43.57±0.30 44.78±0.46 55.21±0.60 56.84±0.11 52.77±0.54 55.36±0.25 59.57±0.43 下載: 導出CSV表 5 7種算法的${\rm{F}}1$值 (%)
Table 5 ${\rm{F}}1$ value of seven algorithms (%)
數據流 LB BOLE ARFRE CALMID OALM-IDS ALM-ICDDS-E ALM-ICDDS DS1 95.82±0.18 96.20±0.16 97.44±0.50 97.94±0.30 97.97±0.25 96.66±0.37 98.85±0.23 DS2 89.62±0.23 88.24±0.53 91.54±0.35 92.42±0.22 92.55±0.53 91.37±0.43 93.46±0.18 DS3 93.47±0.18 95.99±0.27 95.76±0.44 96.38±0.20 97.33±0.55 96.22±0.57 98.19±0.26 DS4 86.97±0.47 86.15±0.45 88.26±0.30 90.20±0.29 90.67±0.46 89.91±0.59 91.60±0.16 DS5 92.55±0.17 93.18±0.30 96.17±0.19 94.23±0.52 95.20±0.29 93.57±0.38 96.35±0.26 DS6 86.57±0.27 85.08±0.32 87.92±0.48 88.93±0.18 88.75±0.30 87.90±0.35 90.74±0.27 DS7 87.72±0.56 86.66±0.19 88.70±0.52 89.44±0.48 89.61±0.47 88.59±0.29 92.63±0.40 Kddcup99_10% 70.55±0.54 70.94±0.23 69.32±0.45 74.03±0.22 75.33±0.39 74.82±0.54 80.45±0.49 Shuttle 62.97±0.28 56.61±0.29 64.43±0.33 70.13±0.21 72.70±0.41 71.01±0.20 73.77±0.18 PokerHand 47.26±0.34 47.41±0.40 53.83±0.57 55.35±0.16 56.12±0.52 54.10±0.23 57.67±0.72 下載: 導出CSV亚洲第一网址_国产国产人精品视频69_久久久久精品视频_国产精品第九页 -
[1] Liao G, Zhang P, Yin H, Luo T, Lin J. A novel semi-supervised classification approach for evolving data streams. Expert Systems with Applications, 2023, 215: 119273 doi: 10.1016/j.eswa.2022.119273 [2] 朱飛, 張煦堯, 劉成林. 類(lèi)別增量學(xué)習研究進(jìn)展和性能評價(jià). 自動(dòng)化學(xué)報, 2023, 49(3): 1?26Zhu Fei, Zhang Xu-Yao, Liu Cheng-Lin. Class incremental learning: A review and performance evaluation. Acta Automatica Sinica, 2023, 49(3): 1?26 [3] Zhou Z H. Open-environment machine learning. National Science Review, 2022, 9(8): 211?221 [4] Wang P, Jin N, Woo W L, Woodward J R, Davies D. Noise tolerant drift detection method for data stream mining. Information Sciences, 2022, 609: 1318?1333 doi: 10.1016/j.ins.2022.07.065 [5] Yu H, Liu W, Lu J, Wen Y, Luo X, Zhang G. Detecting group concept drift from multiple data streams. Pattern Recognition, 2023, 134: 109113 doi: 10.1016/j.patcog.2022.109113 [6] Suárez-Cetrulo A L, Quintana D, Cervantes A. A survey on machine learning for recurring concept drifting data streams. Expert Systems with Applications, 2022, 213: 118934 [7] Yang L, Shami A. A lightweight concept drift detection and adaptation framework for IoT data streams. IEEE Internet of Things Magazine, 2021, 4(2): 96?101 doi: 10.1109/IOTM.0001.2100012 [8] Bayram F, Ahmed B S, Kassler A. From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems, 2022, 245: 108632 doi: 10.1016/j.knosys.2022.108632 [9] Karimian M, Beigy H. Concept drift handling: A domain adaptation perspective. Expert Systems with Applications, 2023, 224: 119946 doi: 10.1016/j.eswa.2023.119946 [10] Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 2018, 31(12): 2346-2363 [11] Shahraki A, Abbasi M, Taherkordi A, Jurcut A D. Active learning for network traffic classification: A technical study. IEEE Transactions on Cognitive Communications and Networking, 2021, 8(1): 422?439 [12] Pham T, Kottke D, Sick B, Krempl G. Stream-based active learning for sliding windows under the influence of verification latency. Machine Learning, 2022, 111(6): 2011?2036 doi: 10.1007/s10994-021-06099-z [13] Khowaja S A, Khuwaja P. Q-learning and LSTM based deep active learning strategy for malware defense in industrial IoT applications. Multimedia Tools and Applications, 2021, 80(10): 14637?14663 doi: 10.1007/s11042-020-10371-0 [14] Wang S, Luo H, Huang S, Li Q, Liu L, Su G, et al. Counterfactual-based minority oversampling for imbalanced classification. Engineering Applications of Artificial Intelligence, 2023, 122: 106024 doi: 10.1016/j.engappai.2023.106024 [15] Malialis K, Panayiotou C G, Polycarpou M M. Nonstationary data stream classification with online active learning and siamese neural networks. Neurocomputing, 2022, 512: 235?252 doi: 10.1016/j.neucom.2022.09.065 [16] Du H, Zhang Y, Gang K, Zhang L, Chen Y. Online ensemble learning algorithm for imbalanced data stream. Applied Soft Computing, 2021, 107(1): 107378 [17] Wang W, Sun D. The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 2021, 563: 358?374 doi: 10.1016/j.ins.2021.03.042 [18] Gao J, Fan W, Han J, Yu P. A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of the International Conference on Data Mining. Minnesota, USA: 2007. 3?14 [19] Lu Y, Cheung Y, Tang Y Y. Dynamic weighted majority for incremental learning of imbalanced data streams with concept drift. In: Proceedings of the International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI, 2017. 2393?2399 [20] Jiao B, Guo Y, Gong D, Chen Q. Dynamic ensemble selection for imbalanced data streams with concept drift. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 1278-1291 [21] Guo H S, Zhang S, Wang W J. Selective ensemble-based online adaptive deep neural networks for streaming data with concept drift. Neural Networks, 2021, 142: 437?456 doi: 10.1016/j.neunet.2021.06.027 [22] Wang S, Minku L L, Yao X. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(5): 1356?1368 [23] Cano A, Krawczyk B. ROSE: Robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Machine Learning, 2022, 111(7): 2561?2599 doi: 10.1007/s10994-022-06168-x [24] Bifet A, Gavalda R. Learning from time-changing data with adaptive windowing. In: Proceedings of the International Conference on Data Mining. Minnesota, USA: 2007. 443?448 [25] Barros R S M, Carvalho Santos S G T, Júnior P M G. A boosting-like online learning ensemble. In: Proceedings of the International Joint Conference on Neural Networks. Vancouver, Canada: 2016. 1871?1878 [26] Gama J, Medas P, Castillo G, Rodrigues P. Learning with drift detection. In: Proceedings of the Advances in Artificial Intelligence. Maranhao, Brazil: Springer, 2004. 286?295 [27] 張永清, 盧榮釗, 喬少杰, 韓楠, Gutierrez L A, 周激流. 一種基于樣本空間的類(lèi)別不平衡數據采樣方法. 自動(dòng)化學(xué)報, 2022, 48(10): 2549?2563Zhang Yong-Qing, Lu Rong-Zhao, Qiao Shao-Jie, Han Nan, Gutierrez L A, Zhou Ji-Liu. A sampling method of imbalanced data based on sample space. Acta Automatica Sinica, 2022, 48(10): 2549?2563 [28] Bifet A, Holmes G, Pfahringer B. Leveraging bagging for evolving data stream. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Barcelona, Spain: Springer, 2010. 135?150 [29] Ferreira L E B, Gomes H M, Bifet A, Oliveira L. Adaptive random forests with resampling for imbalanced data streams. In: Proceedings of the International Joint Conference on Neural Networks. Budapest, Hungary: IEEE, 2019. 1?6 [30] Gu Q, Tian J, Li X, Song J. A novel random forest integrated model for imbalanced data classification problem. Knowledge-Based Systems, 2022, 250: 109050 doi: 10.1016/j.knosys.2022.109050 [31] Martins V E, Cano A, Junior S B. Meta-learning for dynamic tuning of active learning on stream classification. Pattern Recognition, 2023, 138: 109359 doi: 10.1016/j.patcog.2023.109359 [32] Yin C Y, Chen S S, Yin Z C. Clustering-based active learning classification towards data stream. ACM Transactions on Intelligent Systems and Technology, 2023, 14(2): 1?18 [33] Xu W H, Zhao F F, Lu Z C. Active learning over evolving data streams using paired ensemble framework. In: Proceedings of the 8th International Conference on Advanced Computational Intelligence. Chiang Mai, Thailand: 2016. 180?185 [34] Liu S X, Xue S, Wu J, Zhou C, Yang J, Li Z, et al. Online active learning for drifting data streams. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(1): 186?200 [35] Liu W K, Zhang H, Ding Z Y, Liu Q B, Zhu C. A comprehensive active learning method for multiclass imbalanced data streams with concept drift. Knowledge-Based Systems, 2021, 215: 106778 doi: 10.1016/j.knosys.2021.106778 [36] 李艷紅, 任霖, 王素格, 李德玉. 非平衡數據流在線(xiàn)主動(dòng)學(xué)習方法. 自動(dòng)化學(xué)報, DOI: 10.16383/j.aas.c211246Li Yan-Hong, Ren Lin, Wang Su-Ge, Li De-Yu. Online active learning method for imbalanced data stream. Acta Automatica Sinica, DOI: 10.16383/j.aas.c211246 [37] Zhao P, Cai L W, Zhou Z H. Handling concept drift via model reuse. Machine learning, 2020, 109: 533?568 doi: 10.1007/s10994-019-05835-w [38] Karimi M R, Gürel N M, Karlas B, Rausch J, Zhang C, Krause A. Online active model selection for pre-trained classifiers. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. San Diego, California, USA: 2021. 307?315 [39] Zyblewski P, Wozniak M, Sabourin R. Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Information Fusion, 2021, 66: 138?154 doi: 10.1016/j.inffus.2020.09.004 [40] Moraes M, Gradvohl A. MOAFS: A Massive Online Analysis library for feature selection in data streams. The Journal of Open Source Software, 2020, 5: 1970 doi: 10.21105/joss.01970