What would be the right steps for horizontal partitioning in Postgresql?
We have an E-commerce portal with a Postgresql 9.1 database. One very important table has at the moment 32 million records. If we want to deliver all items this table would grow to 320 million records, mostly dates. Which would be to heavy.
So we are thinking about horizontal partitioning / sharding. We can divide items in this table into 12 pieces horizontal (1 per month). What would be the best steps and technics to do so? Would horizontal partitioning within the database be good enough or do we have to start thinking about sharding?
While 320 million is not small, it's not really huge either.
It largely depends on the queries you run on the table. If you always include the partition key in your queries then "regular" partitioning would probably work.
An example for this can be found in the PostgreSQL wiki: http://wiki.postgresql.org/wiki/Month_based_partitioning
The manual also explains some of the caveats of partitioning: http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html
If you are thinking about sharding, you might read how Instagram (which is powered by PostgreSQL) has implemented that:
If you have mostly read-queries, another option might be to use streaming replication to setup multiple servers and distribute the read queries by connecting to the hot-standby for read access and connecting to the master for write access. I think pg-pool II can do that (somewhat) automatically. That can be combined with partitioning to further reduce the query runtime.
If you are adventurous and don't have very immediate needs to do so, you might also consider Postgres-XC which promises to support transparent horizontal scaling: http://postgres-xc.sourceforge.net/
There is no final release yet, but it looks like this isn't taking too long
Here is my sample code for partitioning: t_master is a view to be select/insert/update/delete in your application t_1 and t_2 is the underlying tables actually storing the data.
create or replace view t_master(id, col1) as select id, col1 from t_1 union all select id, col1 from t_2 CREATE TABLE t_1 ( id bigint PRIMARY KEY, col1 text ); CREATE TABLE t_2 ( id bigint PRIMARY KEY, col1 text ); CREATE OR REPLACE FUNCTION t_insert_partition_function() returns TRIGGER AS $$ begin raise notice '%s', 'hello'; execute 'insert into t_' || ( mod(NEW.id, 2)+ 1 ) || ' values ( $1, $2 )' USING NEW.id, NEW.col1 ; RETURN NULL; end; $$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION t_update_partition_function() returns TRIGGER AS $$ begin raise notice '%s', 'hello'; execute 'update t_' || ( mod(NEW.id, 2)+ 1 ) || ' set id = $1, col1 = $2 where id = $1' USING NEW.id, NEW.col1 ; RETURN NULL; end; $$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION t_delete_partition_function() returns TRIGGER AS $$ begin raise notice '%s', 'hello'; execute 'delete from t_' || ( mod(OLD.id, 2)+ 1 ) || ' where id = $1' USING OLD.id; RETURN NULL; end; $$ LANGUAGE plpgsql; CREATE TRIGGER t_insert_partition_trigger instead of INSERT ON t_master FOR each row execute procedure t_insert_partition_function(); CREATE TRIGGER t_update_partition_trigger instead of update ON t_master FOR each row execute procedure t_update_partition_function(); CREATE TRIGGER t_delete_partition_trigger instead of delete ON t_master FOR each row execute procedure t_delete_partition_function();
If you don't mind upgrading to PostgreSQL 9.4, then you could use the pg_shard extension, which lets you transparently shard a PostgreSQL table across many machines. Every shard is stored as a regular PostgreSQL table on another PostgreSQL server and replicated to other servers. It uses hash-partitioning to decide which shard(s) to use for a given query. pg_shard would work well if your queries have a natural partition dimension (e.g., customer ID).
More info: https://github.com/citusdata/pg_shard