![]() The other pain point when we are talking about database processing is the ORDER BY and functions that need to sort things(i.e Window functions) and we also have a feature to improve that, which is the SORTKEY, it is similar to the DISTKEY, it will grant that the data will be stored ordered by the chosen field/column. The leader node distributes using round-robin style.Ĭase you don’t define any of that styles the leader node will try distribute it in a optimal way according to the table size. The style is used when we have tables where we won’t use in JOINs and they are highly de-normalized. Examples of candidates tables are that ones storing City or country codes and names for standardization. In general, it is used for small tables(less than 2M rows) and tables that aren’t updated so often, once it are in differentes places(nodes) and update it will have a higher cost because we need to updated all those places and also all those copies will take more storage space. Imagine the following join with those two tables, using the fields we have chosen as key, the row where the values are, for example in the range of 0–100 will be in the same node/slice for both tables, so will not be necessary to ask to another node, avoiding the data movement.Īnother style we have is the ALL, this style will create a copy for the table in each node. ![]() CREATE TABLE person ( _id, name, last_name, address_id ) DISTSTYLE KEY DISTKEY (_id) CREATE TABLE employees ( person_id, salary, jobtitle) DISTSTYLE KEY DISTKEY (person_id) ![]() So we minimize the data movement and improve the parallel processing. See the sample below, where we have two tables and both are distributed based on a key(column), we are saying to Redshift to keep in the same slice the rows where the chosen column has the same values, in that case for both tables (using the fields person._id and employees.person_id) and on the query where we are joining the tables no further distribution will be needed, in other words, the Redshift knows that all the matches will happen into the slice/node so isn’t needed to one node communicate with other to obtain other rows because all the matches are in there. Using the DISTKEY is basically you letting RS know the way it will distribute the data(i.e a table) through the slices, in this case using a column as a KEY and storing the rows where the column has the same values physically in the same place, in the same slice. There are two major bottlenecks in databases processing, JOIN tables and SORT things, for those two we have the DISTKEY and SORTKEY Distribuition Styles
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |