I was working on a query where I had to convert a complex MySql query which was providing 95th percentile value from a group_concat result.
Following is the hive query to do the same on a simple sample table.
Sample Table:
hive> describe coll;
col_name data_type
proc_id int
status string
Sample Data:
hive> select * from coll;
proc_id status
53 stopped
56 stopped
1 started
2 started
52 stopped
4 started
59 stopped
29 stopped
13 stopped
55 stopped
54 stopped
63 stopped
8 stopped
9 stopped
51 stopped
61 stopped
69 stopped
6 stopped
23 stopped
57 stopped
3 started
11 stopped
7 stopped
66 stopped
12 stopped
67 stopped
How percent_rank() works?
Query: select status, proc_id, percent_rank() over (PARTITION BY status ORDER BY proc_id DESC) as proc_rank_desc from coll;
Result:
Status proc_id proc_rank_desc
started 4 0.0
started 3 0.3333333333333333
started 2 0.6666666666666666
started 1 1.0
stopped 69 0.0
stopped 67 0.047619047619047616
stopped 66 0.09523809523809523
stopped 63 0.14285714285714285
stopped 61 0.19047619047619047
stopped 59 0.23809523809523808
stopped 57 0.2857142857142857
stopped 56 0.3333333333333333
stopped 55 0.38095238095238093
stopped 54 0.42857142857142855
stopped 53 0.47619047619047616
stopped 52 0.5238095238095238
stopped 51 0.5714285714285714
stopped 29 0.6190476190476191
stopped 23 0.6666666666666666
stopped 13 0.7142857142857143
stopped 12 0.7619047619047619
stopped 11 0.8095238095238095
stopped 9 0.8571428571428571
stopped 8 0.9047619047619048
stopped 7 0.9523809523809523
stopped 6 1.0
Final Query:
select q.status, min(q.proc_id) as proc_id
from (
select * from (select status, proc_id, percent_rank() over (PARTITION BY status ORDER BY proc_id DESC) as proc_rank_desc
from coll
) ranked_table
where ranked_table.proc_rank_desc >=0.95) q
group by q.status;
Final Result:
status proc_id
started 1
stopped 6