utku yabas | 15 Jun 2012 13:29
Picon

Working with large datasets, preprocessing and issues with categorical features

Hi,  

I have 3 questions:
1-
I have a data set around 1GB (50.000 instances and 15.000 attr.)
Can you advise some ways to work with this data set for me? I can not even load the data. I need to apply Decision trees, Random Forests, Logistic Regression and so on.

2- How can I apply the same pre-processing steps as Training set to Test Set?
These steps include normalization, replacing missing values, etc. as well as data transformations like PCA's

3- If there are nominal attributes with huge vocabulary, Is there a preprocessing step for grouping the 10 or 20 most common values and grouping others as another group.

Thanks in advance 
Utku

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Mark Hall | 20 Jun 2012 10:45
Favicon

Re: Working with large datasets, preprocessing and issues with categorical features

On 15/06/12 11:29 PM, utku yabas wrote:
> Hi,
> I have 3 questions:
> 1-
> I have a data set around 1GB (50.000 instances and 15.000 attr.)
> Can you advise some ways to work with this data set for me? I can not
> even load the data. I need to apply Decision trees, Random Forests,
> Logistic Regression and so on.
>
> 2- How can I apply the same pre-processing steps as Training set to Test
> Set?
> These steps include normalization, replacing missing values, etc. as
> well as data transformations like PCA's
>
> 3- If there are nominal attributes with huge vocabulary, Is there a
> preprocessing step for grouping the 10 or 20 most common values and
> grouping others as another group.
>
> Thanks in advance
> Utku
>

There is some info on handling large data sets at:

http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka

The FilteredClassifier is the most convenient way of ensuring that the 
pre-processing that occurs during training also gets applied to test 
instances. Since PCA has a run time that is cubic in the number 
attributes, it will be infeasible to apply it to your full data set. 
Applying PCA to random subspaces as part of an ensemble approach might 
be one possibility.

Cheers,
Mark.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist <at> list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


Gmane