Understanding hive query plan

I have the a query and its associated query and query plan (see gist) for simulated data.

The number of rows in the table lte_data_tenmillion is 10000000 The number of rows in the table subscriber data is 100000

For both tables none of the rows have a null value in the subscriber_id column.

I'm finding it difficult to understand, why the query plan displays the number of rows scanned (after applying predicate: subscriber_id is not null (type: boolean)) to be exactly half the value of original number of rows.

Similar is the case with the filter operator for the subscriber table.

Also, the total number of rows of the resulting data, as mentioned under "File Output Operator [FS_20]" is 5500000. However the actual number of rows in the resulting table is 2499723.

I might be wrongly interpreting the query plan . I would highly appreciate it if someone could clear the inconsistencies I observe in the query plan and the actual result.



The statistics is not fresh. Analyze each table using analyze table <table name> compute statistics; command then check the plan again. Also add

set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

before the explain command.

Need Your Help

Magento: How to separate mobile phone numbers saved in database as telephone attribute and put it into database under new name?

database magento phone-number

I want to filter mobile phone numbers from customer_address_entity_varchar and save/move it to customer_entity_varchar. I've created customer attribute "mobile" and added it to registration form. A...

Dynamic spinners - if an item is selected, hide it from other spinners - getDropdownView in onItemSelected?

android dynamic spinner

So I've gotten around creating dynamic spinners in Java, but I am in a bit of a pickle now.