Understanding hive query plan
I have the a query and its associated query and query plan (see gist) for simulated data.
The number of rows in the table lte_data_tenmillion is 10000000 The number of rows in the table subscriber data is 100000
For both tables none of the rows have a null value in the subscriber_id column.
I'm finding it difficult to understand, why the query plan displays the number of rows scanned (after applying predicate: subscriber_id is not null (type: boolean)) to be exactly half the value of original number of rows.
Similar is the case with the filter operator for the subscriber table.
Also, the total number of rows of the resulting data, as mentioned under "File Output Operator [FS_20]" is 5500000. However the actual number of rows in the resulting table is 2499723.
I might be wrongly interpreting the query plan . I would highly appreciate it if someone could clear the inconsistencies I observe in the query plan and the actual result.
The statistics is not fresh. Analyze each table using analyze table <table name> compute statistics; command then check the plan again. Also add
set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true;
before the explain command.