|
Hi, I am developing a project that deals with sports data. I am receiving xml feeds every seconds, and these xml will be parsed and save it into the ms sql database. My problem is the database size/records are increasing everyday, almost millions of records are save everyday and in just 2 days the database backup is 1GB. I am concern that it will effect the performance of the web application. There will be a page that it will query the database every seconds. Please advice. Thanks, -Ron |
|
Read up on data warehousing, specifically, OLTP vs OLAP. Essentially, you're storing millions of 'historical' records—not transactional data. The typical OLTP approach and schema: normalized, entity-relationship, while good for quick updates and immediate, up-to-date queries won't fit your use case. You should be looking at a multi-dimensional, star schema that favors the types of reports that you want to be generating from all that data. Simplistic example: instead of querying 1 million rows for one day's worth of total sales (then performing an expensive If your concern is about how manage the (growing) database size—then you really have to choice but to invest in large(r) scale storage and backup systems. Yes, 1TB drives for consumer desktops are cheap but if you want a highly-available, redundant, high-performance applications then you really should be looking at storage-area networks, RAID arrays, and tape backup. Would be ok to just use one database for this? One database instance? One database partition/shard? One database server? It depends. I would guess that you can achieve acceptable performance on write-heavy, sequential, read-only data even with just one database instance, even without sharding. For high-availability and performance purposes, though, it might help to have two servers acting in a cluster, or in a master-slave failover configuration at least. One database instance. I have one dedicated windows server 2008 R2 that have SQL Server 2008 and IIS 7 installed. I also have a web application that receives xml feeds from the provider. My web app is responsible in creating xml files based on the feeds (runs every seconds). Then on the background I have a running console application that will process this xml files and save it into the database and it runs forever. Now my database size is 2-3GB for just a few days. My concern is the performance the db query might be slow as the data grows. Do you think db archiving seems to be the solution in order to trim down the historical data? What are the strategies in archiving the database? Check your math. Let's just say your database grows at 2GB per day. That means it should take ~500 days before it fills up 1 TB. In any case—have you looked at star schema, aggregate tables, like I mentioned? Anyone will experience performance issues querying tables with millions of rows. Sometimes adding indexes will help, but sometimes they end up slowing down inserts/updates. So the trick is not to query millions of rows but find some way to transform, aggregate, and/or cache frequently queried data even if you end up with a denormalized database. Yes, you should be looking at database archiving, whether or not your data grows at 2GB/day. The only question is how often should you archive—and how long can/should you keep historical data before archiving it? Another solution is to use two databases or instances, one 'fast' for queries/reports and a larger, 'slow' one to hold historical data. Finally, with regards to your background console app hogging the memory—frankly, I would run it in its own machine, and paralellize or tune the hell out of it, but that's another question for another day. You have two main processes. 1 process and save the data on your datastore 2 reporting on the data through the web application The first and second task you can devote its own server to. Separating the datastore/database to the web and xml processing server. If you profile the kinds of query that your web app users will mostly do then you can optimize for that. Here Alistairs recommendations of using aggregation/summary tables and the like will probably help your performance. If you are doing this for the long haul then design your system for that. Thanks guys
showing 5 of 8
show all
|