How Hive reads data even after dropping from hdfs?

I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.

Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).

select count(*) from tbl_name -- returns double time

But ,

select count(col_name) from tbl_name -- returns actual count.

Same table when I tried from Impala after

INVALIDATE METADATA

I could see only data count which is available in HDFS(not duplicate).

How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?

Answers


Hive is using statistics for computing cont(*). You deleted files manually (not using Hive) that is why the stats is wrong.

The solution is:

  1. to switch-off statistics usage in such cases:

    set hive.compute.query.using.stats=false;

  2. to analyze table as you mention in your comment:

    analyze table tbl_name partition(a,b,c) compute statistics;


Need Your Help

django change default runserver port

python django django-manage.py manage.py

I would like to make the default port that manage.py runserver listens on specifiable in an extraneous config.ini. Is there an easier fix than parsing sys.argv inside manage.py and inserting the

Linking errors when compiling code with OpenCV Libraries

c++ opencv compilation

I'm trying to compile a sample program after installing Opencv with the command: