David Ciemiewicz (JIRA | 1 Jun 06:56 2009
Picon

[jira] Created: (PIG-826) DISTINCT as "Function" rather than statement - High Level Pig

DISTINCT as "Function" rather than statement - High Level Pig
-------------------------------------------------------------

                 Key: PIG-826
                 URL: https://issues.apache.org/jira/browse/PIG-826
             Project: Pig
          Issue Type: New Feature
            Reporter: David Ciemiewicz

In SQL, a user would think nothing of doing something like:

{code}
select
    COUNT(DISTINCT(user)) as user_count,
    COUNT(DISTINCT(country)) as country_count,
    COUNT(DISTINCT(url) as url_count
from
    server_logs;
{code}

But in Pig, we'd need to do something like the following.  And this is about the most
compact version I could come up with.

{code}
Logs = load 'log' using PigStorage()
        as ( user: chararray, country: chararray, url: chararray);

DistinctUsers = distinct (foreach Logs generate user);
DistinctCountries = distinct (foreach Logs generate country);
DistinctUrls = distinct (foreach Logs generate url);
(Continue reading)

David Ciemiewicz (JIRA | 1 Jun 18:27 2009
Picon

[jira] Updated: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig


     [
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Ciemiewicz updated PIG-826:
---------------------------------

    Summary: DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig  (was:
DISTINCT as "Function" rather than statement - High Level Pig)

> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
>                 Key: PIG-826
>                 URL: https://issues.apache.org/jira/browse/PIG-826
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
>     COUNT(DISTINCT(user)) as user_count,
>     COUNT(DISTINCT(country)) as country_count,
>     COUNT(DISTINCT(url) as url_count
> from
>     server_logs;
> {code}
> But in Pig, we'd need to do something like the following.  And this is about the most
> compact version I could come up with.
(Continue reading)

Alan Gates (JIRA | 2 Jun 21:35 2009
Picon

[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig


    [
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715639#action_12715639
] 

Alan Gates commented on PIG-826:
--------------------------------

It can be done like this:

{code}
Logs = load 'log' using PigStorage()
        as ( user: chararray, country: chararray, url: chararray);

Grouped = group Logs all;
foreach Grouped {
           duser = distinct Logs.user;
           dcountry = distinct Logs.country;
           durl = distinct Logs.url;
           generate COUNT(duser), COUNT(dcountry), COUNT(durl);
};
{code}

> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
>                 Key: PIG-826
>                 URL: https://issues.apache.org/jira/browse/PIG-826
>             Project: Pig
>          Issue Type: New Feature
(Continue reading)

Amr Awadallah (JIRA | 2 Jun 23:01 2009
Picon

[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig


    [
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715680#action_12715680
] 

Amr Awadallah commented on PIG-826:
-----------------------------------

neat.

> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
>                 Key: PIG-826
>                 URL: https://issues.apache.org/jira/browse/PIG-826
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
>     COUNT(DISTINCT(user)) as user_count,
>     COUNT(DISTINCT(country)) as country_count,
>     COUNT(DISTINCT(url) as url_count
> from
>     server_logs;
> {code}
> But in Pig, we'd need to do something like the following.  And this is about the most
> compact version I could come up with.
(Continue reading)

David Ciemiewicz (JIRA | 3 Jun 00:34 2009
Picon

[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig


    [
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715726#action_12715726
] 

David Ciemiewicz commented on PIG-826:
--------------------------------------

Alan, thanks!  But what if I want to do the following:

{code}
foreach Grouped {
           dcountryurl = distinct Logs.(country,url);
           generate COUNT(dcountryurl);
};
{code}

Projecting multiple aliases doesn't seem to work. I also tried the following and it doesn't work either.

{code}
foreach Grouped {
           dcountryurl = distinct Logs.country, Logs.url;
           generate COUNT(dcountryurl);
};
{code}

> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
>                 Key: PIG-826
(Continue reading)

Mridul Muralidharan (JIRA | 3 Jun 01:22 2009
Picon

[jira] Commented: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig


    [
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715745#action_12715745
] 

Mridul Muralidharan commented on PIG-826:
-----------------------------------------

This would be a welcome change !
Another usecase which would get enabled (which, imo cant be done 'easily' now) is to use DISTINCT in filter.

Like :

B = FILTER A by COUNT(DISTINCT($1)) > 1;

> DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig
> -------------------------------------------------------------------------------
>
>                 Key: PIG-826
>                 URL: https://issues.apache.org/jira/browse/PIG-826
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
>     COUNT(DISTINCT(user)) as user_count,
>     COUNT(DISTINCT(country)) as country_count,
>     COUNT(DISTINCT(url) as url_count
(Continue reading)


Gmane