When we have very large number of data records with one or more associated key, then it is termed as Skewed on that key.
The problem with other join starts when the data is skewed.
In the skewed join, the data is scanned and the key with the largest value is detected and instead of processing those keys they are temporarily stored in HDFS and then map- reducer process those skewed keys.
Let’s say we have table A and table B with skewed data “naren” in the joining column. Suppose table B has less row with skewed data then table A.
Then, first of all, we have to scan table B and save all the rows with the key “naren” in memory hash table.
In order to read the table, we use set of the mapper.
- If it has skewed key “naren” it will use a hashed version of B for the join.
- For other keys, send the rows to reducer to perform join. The same reducer will get rows from mapper scanning table B.
- The skewed key in table A is processed by a mapper and perform map Side join and for rest key, they use common join.To perform Skewed join we need to set
To perform Skewed join we need to set parameter hive.optimize.skewjoin to true. This parameter is optional and is equal to 100000 by default.