login about faq

Hi,

I am developing a project that deals with sports data. I am receiving xml feeds every seconds, and these xml will be parsed and save it into the ms sql database. My problem is the database size/records are increasing everyday, almost millions of records are save everyday and in just 2 days the database backup is 1GB. I am concern that it will effect the performance of the web application. There will be a page that it will query the database every seconds.

Please advice.

Thanks, -Ron

asked Dec 19 '11 at 18:58

Ron's gravatar image

Ron
112


Read up on data warehousing, specifically, OLTP vs OLAP.

Essentially, you're storing millions of 'historical' records—not transactional data. The typical OLTP approach and schema: normalized, entity-relationship, while good for quick updates and immediate, up-to-date queries won't fit your use case.

You should be looking at a multi-dimensional, star schema that favors the types of reports that you want to be generating from all that data.

Simplistic example: instead of querying 1 million rows for one day's worth of total sales (then performing an expensive SELECT SUM(...) ... GROUP BY(date) on that, you store one row containing the accumulated sales for each day (the date/time dimension) and query on that.


If your concern is about how manage the (growing) database size—then you really have to choice but to invest in large(r) scale storage and backup systems.

Yes, 1TB drives for consumer desktops are cheap but if you want a highly-available, redundant, high-performance applications then you really should be looking at storage-area networks, RAID arrays, and tape backup.

link

answered Dec 19 '11 at 20:46

Alistair%20A.%20Israel's gravatar image

Alistair A. Israel
3.1k210

edited Dec 19 '11 at 20:49

Would be ok to just use one database for this?

(Dec 20 '11 at 17:10) Ron Ron's gravatar image

One database instance? One database partition/shard? One database server?

It depends. I would guess that you can achieve acceptable performance on write-heavy, sequential, read-only data even with just one database instance, even without sharding.

For high-availability and performance purposes, though, it might help to have two servers acting in a cluster, or in a master-slave failover configuration at least.

(Dec 21 '11 at 11:07) Alistair A. Israel Alistair%20A.%20Israel's gravatar image

One database instance. I have one dedicated windows server 2008 R2 that have SQL Server 2008 and IIS 7 installed. I also have a web application that receives xml feeds from the provider. My web app is responsible in creating xml files based on the feeds (runs every seconds). Then on the background I have a running console application that will process this xml files and save it into the database and it runs forever. Now my database size is 2-3GB for just a few days. My concern is the performance the db query might be slow as the data grows.

(Dec 22 '11 at 20:30) Ron Ron's gravatar image

Do you think db archiving seems to be the solution in order to trim down the historical data? What are the strategies in archiving the database?

(Dec 22 '11 at 20:40) Ron Ron's gravatar image

Check your math. Let's just say your database grows at 2GB per day. That means it should take ~500 days before it fills up 1 TB.

In any case—have you looked at star schema, aggregate tables, like I mentioned? Anyone will experience performance issues querying tables with millions of rows. Sometimes adding indexes will help, but sometimes they end up slowing down inserts/updates. So the trick is not to query millions of rows but find some way to transform, aggregate, and/or cache frequently queried data even if you end up with a denormalized database.

(Dec 23 '11 at 10:18) Alistair A. Israel Alistair%20A.%20Israel's gravatar image

Yes, you should be looking at database archiving, whether or not your data grows at 2GB/day. The only question is how often should you archive—and how long can/should you keep historical data before archiving it?

Another solution is to use two databases or instances, one 'fast' for queries/reports and a larger, 'slow' one to hold historical data.

Finally, with regards to your background console app hogging the memory—frankly, I would run it in its own machine, and paralellize or tune the hell out of it, but that's another question for another day.

(Dec 23 '11 at 10:20) Alistair A. Israel Alistair%20A.%20Israel's gravatar image

You have two main processes.

1 process and save the data on your datastore 2 reporting on the data through the web application

The first and second task you can devote its own server to. Separating the datastore/database to the web and xml processing server.

If you profile the kinds of query that your web app users will mostly do then you can optimize for that. Here Alistairs recommendations of using aggregation/summary tables and the like will probably help your performance.

If you are doing this for the long haul then design your system for that.

(Dec 27 '11 at 15:03) AnGoL AnGoL's gravatar image

Thanks guys

(Dec 28 '11 at 10:17) Ron Ron's gravatar image
showing 5 of 8 show all
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×19

Asked: Dec 19 '11 at 18:58

Seen: 1,016 times

Last updated: Dec 28 '11 at 10:17