王佳 | 6 Aug 2012 09:19
Picon

[imbalanced datset]performance drop a lot by applying independent dataset

Hi Weka expert,


I am currently doing a project at a social game company.

We wanted to predict a customer is going to pay or not.

The dataset is highly imbalanced (paying customer and non pay customer, ratio is around 1:70).

In order to increase the TPR, I used SMOTE to increase the true class sampling ratio up to 1000%.

The TPR is increased from 0.018 to 0.8 (more or less depends on other factors, such as methods use to training).

I used cross validation and percentage split to test the performance.

But, when I try to supply a totally different dataset which is generated from another time frame, the performance of

the model dropped a lot. The TPR for example, is drop from 0.76 (use 10 folders cross validation) to 0.26.

I wonder, why the drop is so much. 

In this case, imbalanced dataset, what should I do? I wanted to use under sampling method also but how should I do it in weka?

Thanks a lot!!!!

Best regards,
Brandy
--
--

Jia WANG
Uppsala University

Rackarbergsgatan 24-232
UPPSALA
Sweden



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
David Sharpe | 6 Aug 2012 21:05
Favicon

Re: [imbalanced datset]performance drop a lot by applying independent dataset

On Mon, Aug 6, 2012 at 12:19 AM, 王佳 <brandydydy <at> gmail.com> wrote:
Hi Weka expert,


I am currently doing a project at a social game company.

We wanted to predict a customer is going to pay or not.

The dataset is highly imbalanced (paying customer and non pay customer, ratio is around 1:70).

In order to increase the TPR, I used SMOTE to increase the true class sampling ratio up to 1000%.

The TPR is increased from 0.018 to 0.8 (more or less depends on other factors, such as methods use to training).

I used cross validation and percentage split to test the performance.

But, when I try to supply a totally different dataset which is generated from another time frame, the performance of

the model dropped a lot. The TPR for example, is drop from 0.76 (use 10 folders cross validation) to 0.26.

I wonder, why the drop is so much. 

In this case, imbalanced dataset, what should I do? I wanted to use under sampling method also but how should I do it in weka?

Thanks a lot!!!!

Best regards,
Brandy
--
--

Jia WANG
Uppsala University

Rackarbergsgatan 24-232
UPPSALA
Sweden


Hello Brandy,

I doubt that anyone will be able to explain why performance dropped so much without knowing more about the problem you are working on. What kind of data, what kind of features, and so on.

It sounds like the model is overfitting the initial dataset. In other words, the "totally different dataset which is generated from another time frame" is very unlike your initial dataset, so performance drops. I have been taught to use more data or fewer features in this case, but that is just a general recommendation which might not apply here.

I wanted to use under sampling method also but how should I do it in weka?

If you want to undersample the negative instances, I assume you can just remove a random subset of them.

I am new to Weka and machine learning too, but I hope this helps.

Regards,

--
David Sharpe
Software Developer
Seeker Solutions Inc.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Gmane