Thursday, October 2, 2014

Hive Query to get 95th Percentiled ranked item

I was working on a query where I had to convert a complex MySql query which was providing 95th percentile value from a group_concat result.

Following is the hive query to do the same on a simple sample table.

 

Sample Table:

hive> describe coll;

 

col_name            data_type      

proc_id                 int                    

status                    string                 

 

Sample Data:

hive> select * from coll;

proc_id                status

53                           stopped

56                           stopped

1                              started

2                              started

52                           stopped

4                              started

59                           stopped

29                           stopped

13                           stopped

55                           stopped

54                           stopped

63                           stopped

8                              stopped

9                              stopped

51                           stopped

61                           stopped

69                           stopped

6                              stopped

23                           stopped

57                           stopped

3                              started

11                           stopped

7                              stopped

66                           stopped

12                           stopped

67                           stopped

 

How percent_rank() works?

Query: select status, proc_id, percent_rank() over (PARTITION BY status ORDER BY proc_id DESC) as proc_rank_desc from coll;

Result:

Status          proc_id     proc_rank_desc

started                4         0.0

started                3         0.3333333333333333

started                2         0.6666666666666666

started                1         1.0

stopped               69      0.0

stopped               67      0.047619047619047616

stopped               66      0.09523809523809523

stopped               63      0.14285714285714285

stopped               61      0.19047619047619047

stopped               59      0.23809523809523808

stopped               57      0.2857142857142857

stopped               56      0.3333333333333333

stopped               55      0.38095238095238093

stopped               54      0.42857142857142855

stopped               53      0.47619047619047616

stopped               52      0.5238095238095238

stopped               51      0.5714285714285714

stopped               29      0.6190476190476191

stopped               23      0.6666666666666666

stopped               13      0.7142857142857143

stopped               12      0.7619047619047619

stopped               11      0.8095238095238095

stopped               9       0.8571428571428571

stopped               8       0.9047619047619048

stopped               7       0.9523809523809523

stopped               6       1.0

 

Final Query:

select q.status, min(q.proc_id) as proc_id

from (

select * from (select status, proc_id, percent_rank() over (PARTITION BY status ORDER BY proc_id DESC) as proc_rank_desc

from coll

           ) ranked_table

where ranked_table.proc_rank_desc >=0.95) q

group by q.status;

Final Result:

status                    proc_id

started                 1

stopped               6