How to load data to same Hive table if file has different number of columns

I have a main table (Employee) which is having 10 columns and I can load data into it using load data inpath /file1.txt into table Employee

My question is how to handle the same table (Employee) if my file file2.txt has same columns but column 3 and columns 5 are missing. if I directly load data last columns will be NULL NULL. but instead it should load 3rd as NULL and 5th column as NULL.

Suppose I have a table Employee and I want to load the file1.txt and file2.txt to table.

file1.txt
==========
id name sal deptid state coutry  
1  aaa  1000 01   TS   india  
2  bbb  2000 02   AP   india  
3  ccc  3000 03   BGL   india  


file2.txt  

id  name   deptid country  
1  second   001   US  
2  third    002   ENG  
3  forth    003   AUS  

In file2.txt we are missing 2 columns i.e. sal and state.

we need to use the same Employee table how to handle it ?

Answers


I'm not aware of any way to create a table backed by data files with a non-homogenous structure. What you can do however, is to define separate tables for the different column configurations and then define a view that queries both.

I think it's easier if I provide an example. I will use two tables of people, both have a column for name, but one stores height as well, while the other stores weight instead:

> create table table1(name string, height int);
> insert into table1 values ('Alice', 178), ('Charlie', 185);

> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);

> create view people as
>     select name, height, NULL as weight from table1
>   union all
>     select name, NULL as height, weight from table2;

> select * from people order by name;
+---------+--------+--------+
| name    | height | weight |
+---------+--------+--------+
| Alice   | 178    | NULL   |
| Bob     | NULL   | 98     |
| Charlie | 185    | NULL   |
| Denise  | NULL   | 52     |
+---------+--------+--------+

Or as a closer example to your problem, let's say that one table has name, height and weight, while the other only has name and weight, thereby height is "missing from the middle":

> create table table1(name string, height int, weight int);
> insert into table1 values ('Alice', 178, 55), ('Charlie', 185, 78);

> create table table2(name string, weight int);
> insert into table2 values ('Bob', 98), ('Denise', 52);

> create view people as
>     select name, height, weight from table1
>   union all
>     select name, NULL as height, weight from table2;

> select * from people order by name;
+---------+--------+--------+
| name    | height | weight |
+---------+--------+--------+
| Alice   | 178    | 55     |
| Bob     | NULL   | 98     |
| Charlie | 185    | 78     |
| Denise  | NULL   | 52     |
+---------+--------+--------+

Be sure to use union all and not just union, because the latter tries to remove duplicate rows, which makes it very expensive.


It seems like there is no way to directly load into specified columns.

As such, this is what you probably need to do:

  1. Load data inpath to a (temporary?) table that matches the file
  2. Insert into relevant columns of final table by selecting the contents of the previous table.

The situation is very similar to this question which covers the opposite scenario (you only want to load a few columns).


Need Your Help

How to mange dependencies for a Glassfish JavaEE client application?

java maven-2 java-ee-6 ivy glassfish-3

JBoss has the jbossall-client.jar which can be used in client applications for JNDI lookups and more... It is available in the JBoss maven repository.

Which REGEXP to use

oracle plsql

I would suppose this can be best done with regexp, as that's probably the shortest way, but anything else as a suggestion is also good.