Bert Gunter | 17 Aug 2012 19:32

Opinion: Why I find factors convenient to use

Folks:

Over the years, many people -- including some who I would consider
real expeRts -- have criticized factors and advocated the use
(sometimes exclusively) of character vectors instead. I would just
like to point out that, for me, factors provide one feature that I
find to be very convenient: ordering of levels. **

As an example, suppose one has a character vector of labels "small,"
medium", and "large". Then most R functions (e.g. tapply()) will
display results involving this vector in alphabetical order, which I
think most would view as undesirable. By converting to a factor with
levels in the logical order, displays will automatically be "logical."
For example:

> x <- sample(c("small","medium","large"),12,rep=TRUE)
> table(x)
x
 large medium  small
     2      3      7
> y <- factor(x,lev=c("small","medium","large")) ##ordered() also would do, but is not necessary for this
> table(y)
y
 small medium  large
     7      3      2

Naturally, this is just my opinion, and I understand why lots of smart
people find factors irritating (at least!). So contrary opinions
cheerily welcomed. But perhaps these comments might be helpful to
those who have been "bitten" by factors or just wonder what all the
(Continue reading)

PIKAL Petr | 17 Aug 2012 19:53
Picon
Favicon

Re: Opinion: Why I find factors convenient to use

I second to Bert's opinion, factors can be confusing, but they have quite nice features which can not be
easily mimicked by plain character vectors. I find extremelly usefull possibility of manipulating its levels.

> fac<-factor(sample(letters[1:5], 20, replace=TRUE))
> fac
 [1] e e d d e e c e a e a e b b d e c c d b
Levels: a b c d e
> levels(fac)[2:4]<- "new.level"
> fac
 [1] e         e         new.level new.level e         e         new.level
 [8] e         a         e         a         e         new.level new.level
[15] new.level e         new.level new.level new.level new.level
Levels: a new.level e
>

Regards
Petr

________________________________________
Odesílate: r-help-bounces <at> r-project.org [r-help-bounces <at> r-project.org] za uživatele Bert
Gunter [gunter.berton <at> gene.com]
Odesláno: 17. srpna 2012 19:32
To: r-help <at> r-project.org
Předmět: [R] Opinion: Why I find factors convenient to use

Folks:

Over the years, many people -- including some who I would consider
real expeRts -- have criticized factors and advocated the use
(sometimes exclusively) of character vectors instead. I would just
(Continue reading)

Jeff Newmiller | 17 Aug 2012 19:58
Picon
Picon

Re: Opinion: Why I find factors convenient to use

I don't know if my recent post on this prompted your post, but I don't see much to argue with in your
discussion. I find factors to be useful for managing display and some kinds of analysis.  

However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay
conversion until late in the game... but usually I do eventually convert in most cases.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil <at> dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Bert Gunter <gunter.berton <at> gene.com> wrote:

>Folks:
>
>Over the years, many people -- including some who I would consider
>real expeRts -- have criticized factors and advocated the use
>(sometimes exclusively) of character vectors instead. I would just
>like to point out that, for me, factors provide one feature that I
>find to be very convenient: ordering of levels. **
>
>As an example, suppose one has a character vector of labels "small,"
>medium", and "large". Then most R functions (e.g. tapply()) will
>display results involving this vector in alphabetical order, which I
>think most would view as undesirable. By converting to a factor with
>levels in the logical order, displays will automatically be "logical."
>For example:
(Continue reading)

Steve Lianoglou | 17 Aug 2012 20:09
Picon

Re: Opinion: Why I find factors convenient to use

Hi,

On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller
<jdnewmil <at> dcn.davis.ca.us> wrote:
> I don't know if my recent post on this prompted your post, but I don't see much to argue with in your
discussion. I find factors to be useful for managing display and some kinds of analysis.
>
> However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay
conversion until late in the game... but usually I do eventually convert in most cases.

Agreed here -- I actually haven't been tuned into any such recent
conversation (if there was one), but if I were a gambling man, I'd bet
that the majority of the problems people have with factors can
probably be boiled down to the fact that the default value for
stringsAsFactors is TRUE.

I like factors -- that said, I am annoyed by them at times, but I
still like them.

Also, Bert mentioned that he thinks they save space over characters --
I believe that this is no longer true, but I'm not certain.

-steve

--

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
(Continue reading)

Bert Gunter | 17 Aug 2012 20:19

Re: Opinion: Why I find factors convenient to use

Steve, et. al:

Yes, if object.size() is to be believed, you're right:

> x <-sample(c("small","medium","large"),1e4,rep=TRUE)
> y <- factor(x)
> object.size(x)
40120 bytes
> object.size(y)
40336 bytes

I stand (happily) corrected.

-- Bert

On Fri, Aug 17, 2012 at 11:09 AM, Steve Lianoglou
<mailinglist.honeypot <at> gmail.com> wrote:
> Hi,
>
> On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller
> <jdnewmil <at> dcn.davis.ca.us> wrote:
>> I don't know if my recent post on this prompted your post, but I don't see much to argue with in your
discussion. I find factors to be useful for managing display and some kinds of analysis.
>>
>> However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay
conversion until late in the game... but usually I do eventually convert in most cases.
>
> Agreed here -- I actually haven't been tuned into any such recent
> conversation (if there was one), but if I were a gambling man, I'd bet
> that the majority of the problems people have with factors can
(Continue reading)

Rui Barradas | 17 Aug 2012 20:34
Picon
Favicon

Re: Opinion: Why I find factors convenient to use

Hello,

No, factors may use less memory. System dependent?

 > x <-sample(c("small","medium","large"),1e4,rep=TRUE)
 > y <- factor(x)
 > object.size(x)
80184 bytes
 > object.size(y)
40576 bytes
 >
 > sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Portugal.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

other attached packages:
[1] Rcapture_1.2-0 xts_0.8-0      zoo_1.7-7

loaded via a namespace (and not attached):
[1] chron_2.3-39   fortunes_1.4-2 grid_2.15.1    lattice_0.20-6 tools_2.15.1

And I agree with what Steve said, stringsAsFactors = FALSE saves hours 
(Continue reading)

Peter Langfelder | 17 Aug 2012 20:42
Picon

Re: Opinion: Why I find factors convenient to use

On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <ruipbarradas <at> sapo.pt> wrote:
> Hello,
>
> No, factors may use less memory. System dependent?

I think it's a 32-bit vs. 64-bit distinction - I get Rui's results on
64-bit Windows and Linux installation, but Bert's result on a 32-bit
Linux machine.

Peter

>
>> x <-sample(c("small","medium","large"),1e4,rep=TRUE)
>> y <- factor(x)
>> object.size(x)
> 80184 bytes
>> object.size(y)
> 40576 bytes

Petr Savicky | 18 Aug 2012 09:48
Picon
Favicon

Re: Opinion: Why I find factors convenient to use

On Fri, Aug 17, 2012 at 07:34:35PM +0100, Rui Barradas wrote:
> Hello,
> 
> No, factors may use less memory. System dependent?
> 
> > x <-sample(c("small","medium","large"),1e4,rep=TRUE)
> > y <- factor(x)
> > object.size(x)
> 80184 bytes
> > object.size(y)
> 40576 bytes
> >
> > sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
> 
> locale:
> [1] LC_COLLATE=Portuguese_Portugal.1252 LC_CTYPE=Portuguese_Portugal.1252
> [3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C
> [5] LC_TIME=Portuguese_Portugal.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
> 
> other attached packages:
> [1] Rcapture_1.2-0 xts_0.8-0      zoo_1.7-7
> 
> loaded via a namespace (and not attached):
> [1] chron_2.3-39   fortunes_1.4-2 grid_2.15.1    lattice_0.20-6 tools_2.15.1
> 
(Continue reading)

Jim Lemon | 18 Aug 2012 10:27
Picon
Gravatar

Re: Opinion: Why I find factors convenient to use

On 08/18/2012 03:32 AM, Bert Gunter wrote:
> Folks:
> ...
> So contrary opinions
> cheerily welcomed. But perhaps these comments might be helpful to
> those who have been "bitten" by factors or just wonder what all the
> fuss is about.
>
I tend to use stringsAsFactors=FALSE quite a bit, as I am often 
manipulating character strings, and that

Error in strsplit(bugga, "") : non-character argument

is so annoying. Almost as annoying as printing out a list of selected 
cases with some of the fields turning up as integers rather than the 
strings I expected. That said, I often convert the results to factors so 
that some other function will work properly. So I must express my 
gratitude for motivating me to add

options(stringsAsFactors=FALSE)

to that wonderful .First function that makes my life a little happier 
every day.

Jim

S Ellison | 20 Aug 2012 13:30

Re: Opinion: Why I find factors convenient to use


> -----Original Message-----
> Over the years, many people -- including some who I would 
> consider real expeRts -- have criticized factors and 
> advocated the use (sometimes exclusively) of character 
> vectors instead. 

Exclusive use of character vectors is not going to do the job.

The concept of a factor is fundamental to a lot of statistics; a programming environment that does not
implement factors and their associated special behaviour is probably not a statistical programming language.

Special behaviours I have in mind include:
- Level order can be arbitrarily specified for display purposes
- A control level can be intentionally chosen for contrasts
- the option of "ordered" factors (for example, for polr and the like)

So I think the language does and will require a 'factor' type in one form or another.

 _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's
very often better to stick with character early and convert to factor a bit later. But personally, I think
that there is sufficient control over the coding of data to allow user discretion. and on the whole, it
seems to me that character input gets used as factor data so much of the time when it is used at all that the
default stringsAsFactors=TRUE setting seems the more sensible default.

S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

(Continue reading)

Rui Barradas | 20 Aug 2012 14:03
Picon
Favicon

Re: Opinion: Why I find factors convenient to use

Hello,

Em 20-08-2012 12:30, S Ellison escreveu:
>   
>
>> -----Original Message-----
>> Over the years, many people -- including some who I would
>> consider real expeRts -- have criticized factors and
>> advocated the use (sometimes exclusively) of character
>> vectors instead.
> Exclusive use of character vectors is not going to do the job.
>
> The concept of a factor is fundamental to a lot of statistics; a programming environment that does not
implement factors and their associated special behaviour is probably not a statistical programming language.
>
> Special behaviours I have in mind include:
> - Level order can be arbitrarily specified for display purposes
> - A control level can be intentionally chosen for contrasts
> - the option of "ordered" factors (for example, for polr and the like)
>
> So I think the language does and will require a 'factor' type in one form or another.
>
>   _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's
very often better to stick with character early and convert to factor a bit later. But personally, I think
that there is sufficient control over the coding of data to allow user discretion. and on the whole, it
seems to me that character input gets used as factor data so much of the time when it is used at all that the
default stringsAsFactors=TRUE setting seems the more sensible default.

I disagree with this last point. Just think of the number of questions 
to this list about, say, dates. When read from file using one of the 
(Continue reading)

Nutter, Benjamin | 20 Aug 2012 14:59
Picon
Favicon

Re: Opinion: Why I find factors convenient to use

Whether I use stringsAsFactors=FALSE or stringsAsFactors=TRUE tends to rely on where my data are coming
from.  If the data are coming from our Oracle databases (well controlled data), I import the with
stringsAsFactors=TRUE and everything is great.  If the data are given to me by a fellow in the form of an
Excel spreadsheet, I have a good cry and then set stringsAsFactors=FALSE.  Regardless, before I get to
analyzing the data, I convert them all to factors.  I imagine people's preferences for the default setting
are strongly tied to the quality of the data with which they tend to work.

I would prefer the default argument be left as it is, however.  Mostly because
1) I feel like it assumes you are importing data for analysis and not for data management; and more importantly
2) Changing the default would mean I have to change the way I approach data import--and I don't like to change.

  Benjamin Nutter |  Biostatistician     |  Quantitative Health Sciences
  Cleveland Clinic    |  9500 Euclid Ave.  |  Cleveland, OH 44195  | (216) 445-1365

-----Original Message-----
From: r-help-bounces <at> r-project.org [mailto:r-help-bounces <at> r-project.org] On Behalf Of Rui Barradas
Sent: Monday, August 20, 2012 8:03 AM
To: S Ellison
Cc: r-help
Subject: Re: [R] Opinion: Why I find factors convenient to use

Hello,

Em 20-08-2012 12:30, S Ellison escreveu:
>   
>
>> -----Original Message-----
>> Over the years, many people -- including some who I would consider 
>> real expeRts -- have criticized factors and advocated the use 
>> (sometimes exclusively) of character vectors instead.
(Continue reading)

S Ellison | 20 Aug 2012 17:41

Re: Opinion: Why I find factors convenient to use

>> on the whole, it seems to me that character 
>> input gets used as factor data so much of the time when it is 
>> used at all that the default stringsAsFactors=TRUE setting 
>> seems the more sensible default.
> 
> I disagree with this last point. Just think of the number of 
> questions to this list about, say, dates. 

Mileage on this issue is likely to vary within and between useRs. 

For a more than anecdotal answer to the question of whether 'on the whole', character input gets used as
factor data, one would have to construct a more careful survey than this. 

S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

PIKAL Petr | 20 Aug 2012 15:10
Picon
Favicon

Re: Opinion: Why I find factors convenient to use

Hi

> -----Original Message-----
> From: r-help-bounces <at> r-project.org [mailto:r-help-bounces <at> r-
> project.org] On Behalf Of Rui Barradas
> Sent: Monday, August 20, 2012 2:03 PM
> To: S Ellison
> Cc: r-help
> Subject: Re: [R] Opinion: Why I find factors convenient to use
> 
> Hello,
> 
> Em 20-08-2012 12:30, S Ellison escreveu:
> >
> >
> >> -----Original Message-----
> >> Over the years, many people -- including some who I would consider
> >> real expeRts -- have criticized factors and advocated the use
> >> (sometimes exclusively) of character vectors instead.
> > Exclusive use of character vectors is not going to do the job.
> >
> > The concept of a factor is fundamental to a lot of statistics; a
> programming environment that does not implement factors and their
> associated special behaviour is probably not a statistical programming
> language.
> >
> > Special behaviours I have in mind include:
> > - Level order can be arbitrarily specified for display purposes
> > - A control level can be intentionally chosen for contrasts
> > - the option of "ordered" factors (for example, for polr and the
(Continue reading)


Gmane