SQL Select to intersect records by at least one common attribute value?

I have a table with two fields:

+-----+---------+
| id  | feature |
+-----+---------+
| x1  |  f1     |
| x1  |  f5     |
| x2  |  f3     |
| x3  |  f1     |
| x4  |  f5     |
| x4  |  f2     |
| x5  |  f3     |
| x6  |  f4     |
+-----+---------+

Questions:

1) How to write select that will group id into sets with equal feature-s, like this: S1 = {x1, x3}, S2 = {x1,x4}, S3 = {x2, x5}, S4 = {x2, x5}

2) How to write select that will return a set with all id-s intersecting at least by one feature? How to get all these sets? In this example result should be: S5 = {x1, x3, x4} and S6 = {x2, x5}

3) It would also be great to know the query format for Hadoop Hive that supports some basic SQL subset.

Answers


Easiest way to do the second is probably a self join which is feasible if the dataset isn't too large

SELECT t1.id, t2.id, collect_set( feature ) features
FROM
  ( SELECT id, feature FROM mytable ) t1
JOIN
  ( SELECT id, feature FROM mytable ) t2
ON
  ( t1.feature = t2.feature )
WHERE 
   t1.id < t2. id
GROUP BY t1.id, t2.id;

Be sure to include the where clause to cut your output in half.


For question 1, use the collect_set or collect_list UDF.

(Protip: One question at a time works best on StackOverflow.)


Need Your Help

Conflict between Login form and Register form in the same page

php forms login

My website has a login form present in every pages (on a top-menu) so even when a user is on the Register page the login form is still available in this top-menu.

Resource locking in SQL Server Management Studio

sql-server ssms

I've noticed that when viewing query results in a grid in SQL Server Management Studio, it often causes resource locking on the server, preventing other queries from running. Why does this happen, ...