6 Aug 2012 09:19
[imbalanced datset]performance drop a lot by applying independent dataset
王佳 <brandydydy <at> gmail.com>
2012-08-06 07:19:35 GMT
2012-08-06 07:19:35 GMT
Hi Weka expert,
I am currently doing a project at a social game company.
We wanted to predict a customer is going to pay or not.
The dataset is highly imbalanced (paying customer and non pay customer, ratio is around 1:70).
In order to increase the TPR, I used SMOTE to increase the true class sampling ratio up to 1000%.
The TPR is increased from 0.018 to 0.8 (more or less depends on other factors, such as methods use to training).
I used cross validation and percentage split to test the performance.
But, when I try to supply a totally different dataset which is generated from another time frame, the performance of
the model dropped a lot. The TPR for example, is drop from 0.76 (use 10 folders cross validation) to 0.26.
I wonder, why the drop is so much.
In this case, imbalanced dataset, what should I do? I wanted to use under sampling method also but how should I do it in weka?
Thanks a lot!!!!
Best regards,
Brandy
--
Jia WANG
Uppsala University
Rackarbergsgatan 24-232
UPPSALA
Sweden
_______________________________________________ Wekalist mailing list Send posts to: Wekalist <at> list.scms.waikato.ac.nz List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
RSS Feed