10 February 2007 - 3:55Handling large data sets in SQL
For the past week I have been working on an SQL script that was updating a few dozens millions of records in a table. It was very interesting and it was my first occasion at handling data sets so large. I first suggested Tom Kyte’s famous approach to updating very large tables, but given the fact that the team never tried something like this and that the table was so large they decided to play safe and do the job with regular update statements.
What did I learn while working on this?
1) Update in small chunks otherwise your logs will blow-up your partition. You have to find a way to split your data in chunks that you know will not explode your rollback segments.
2) Use indexes very carefully!!!! In my case I needed to build some hash temporary tables in order to hit some indexes, and in order to build those tables efficiently I needed to figure out how to hit other indexes. The path carved thru the data was pretty interesting…
3) Always determine the execution plan before carrying out the query in order to see what indexes you are hitting. A ’set showplan on, set noexec on’ should do the job on Sybase.
4) Know your data. Among other things it will help you find the appropriate indexes. It will also help you choose between different queries, check out this for an example of 2 different queries which return the same result but are tuned for different data sets.
5) This is not really related to SQL, but I thought I should say it: Don’t bother to optimize your data structures in Java unless you really have to. Don’t spend hours debating whether you should use a TreeMap vs a HashMap for a Map that will hold at best a few hundred items, do this when you are handling large data sets. The reason for this is that optimizing a data structure for access is so dependent on data (on how it is dispersed, on what you are querying, on how this data will grow, etc…) that you don’t have enough input for a small data set.
6) Efficient data access is tied to the data itself very strongly.
7) Prepare for this. I have been reading for the past months asKTom, Jonathan Lewis’s blog and Tom Kyte’s blog. These are good resources for starting.
Gotta go!!
Later Edit: While reading this post I started thinking about the duality of data vs functionality. Most developers spend most of their time dealing with functionality, I think they are missing out on a few interesting things. Another thought would be that if you have a problem regarding data the best person to talk about it would probably be a DBA, it probably has seen data more data than you can handle in a wide variety of formats. Just a thought.
No Comments | Tags: Development, Things for me