Loading Very Large DataSet Without losing any information
-
Hi All; I have a large dataset stored in a CSV file (about 40,000 Rows and 10,000 Columns). I need to load it into a C# Windows application. So, any idea to do this. I tried different code, but some are able to loao 40,000 R * 255 C, and other codes are able to load 5,000 R and 10,000 C. Thanks
losan1985
-
Hi All; I have a large dataset stored in a CSV file (about 40,000 Rows and 10,000 Columns). I need to load it into a C# Windows application. So, any idea to do this. I tried different code, but some are able to loao 40,000 R * 255 C, and other codes are able to load 5,000 R and 10,000 C. Thanks
losan1985
It's probably going to take rolling your own custom class to hold it all. I don't know of anything "off-the-shelf" that will hold 10,000 columns. Frankly, I've never even HEARD of such a wide CSV file ever being used. It shouldn't be very hard at all to create a List> or whatever your item data type is. Basically, a List of List of Integers.
A guide to posting questions on CodeProject[^]
Dave Kreskowiak -
Hi All; I have a large dataset stored in a CSV file (about 40,000 Rows and 10,000 Columns). I need to load it into a C# Windows application. So, any idea to do this. I tried different code, but some are able to loao 40,000 R * 255 C, and other codes are able to load 5,000 R and 10,000 C. Thanks
losan1985
You probably can't load it all at once (at least in a 32-bit OS). 40,000 x 10,000 = 400,000,000 bytes if each cell is 1 byte. If you assume an average of 16 bytes (since you didn't say) per cell, thats 6,400,000,000 bytes = 5GB of data. You only have 2GB of address space for your application. You can do it on a 64-bit OS though. With that being said, I doubt you really need 40,000 x 10,000 cells loaded in memory at once. What is a person going to do with all that data? You might want to consider loading only the portion you need.
-
Hi All; I have a large dataset stored in a CSV file (about 40,000 Rows and 10,000 Columns). I need to load it into a C# Windows application. So, any idea to do this. I tried different code, but some are able to loao 40,000 R * 255 C, and other codes are able to load 5,000 R and 10,000 C. Thanks
losan1985
What is your actual requirement? You are loading this data for a reason. What is that reason? For instance, are you performing some calculation on certain columns? By breaking down your requirements, we can work out a practical solution.
I was brought up to respect my elders. I don't respect many people nowadays.
CodeStash - Online Snippet Management | My blog | MoXAML PowerToys | Mole 2010 - debugging made easier -
Hi All; I have a large dataset stored in a CSV file (about 40,000 Rows and 10,000 Columns). I need to load it into a C# Windows application. So, any idea to do this. I tried different code, but some are able to loao 40,000 R * 255 C, and other codes are able to load 5,000 R and 10,000 C. Thanks
losan1985
As POH has said your design has to be wrong for this to be a valid requirement. Go back and look at how the CSV was created, why does it require 10k columns (what a ridiculous number). Can your source break it up into more swallowable chunks. Do you need all 10k columns. Can you load and process 1 row at a time, presumably you want to dump this into some more reasonable format.
Never underestimate the power of human stupidity RAH
-
What is your actual requirement? You are loading this data for a reason. What is that reason? For instance, are you performing some calculation on certain columns? By breaking down your requirements, we can work out a practical solution.
I was brought up to respect my elders. I don't respect many people nowadays.
CodeStash - Online Snippet Management | My blog | MoXAML PowerToys | Mole 2010 - debugging made easier -
Hi All; I have a large dataset stored in a CSV file (about 40,000 Rows and 10,000 Columns). I need to load it into a C# Windows application. So, any idea to do this. I tried different code, but some are able to loao 40,000 R * 255 C, and other codes are able to load 5,000 R and 10,000 C. Thanks
losan1985
You'll need to build in a sort of paging mechanism that only loads that part that is shown on the screen.
-
Hi All; I have a large dataset stored in a CSV file (about 40,000 Rows and 10,000 Columns). I need to load it into a C# Windows application. So, any idea to do this. I tried different code, but some are able to loao 40,000 R * 255 C, and other codes are able to load 5,000 R and 10,000 C. Thanks
losan1985
As Sledgehammer01 says, that's an unreasonably large amount of data for most purposes. It's 400 million cells and so you're talking about GB of memory, depending on exactly what's in there. What do you want to do with this dataset? You almost certainly want a load-on-demand adapter of some kind, so you can run through the data without actually having it all in memory at once. This library is rather good; I used it in a real application (though not dealing with massive datasets) without problem.
-
Hi All; I have a large dataset stored in a CSV file (about 40,000 Rows and 10,000 Columns). I need to load it into a C# Windows application. So, any idea to do this. I tried different code, but some are able to loao 40,000 R * 255 C, and other codes are able to load 5,000 R and 10,000 C. Thanks
losan1985
Hi losan, I found your post very interesting because I've never encountered a data set that large. Are you trying to analyze that data? If so, I may be able to help. I have a product (www.patternscope.com) that finds patterns in extremely large data sets. I think your data set would be good for stress-testing the application, and it fits perfectly with two planned developments: 1. Reading CSV data (currently it only reads databases through ODBC, or flat files), and 2. Making a C#-callable API that you could use in your C# application to handle that much data (e.g. queries, retrieval, and analysis). My product extracts the patterns that comprise the raw data. These patterns are a fraction of the size of the original raw data, so they fit entirely into memory, even when the raw data is larger than the memory available. The patterns have the same information content as the raw data, so can be processed (e.g. queries or analysis) many times faster. If you could give me a copy of your data set, I could give you a free copy of PatternScope (after I adapt it for reading CSV data) which you could use to analyze the data, followed by a DLL you could call from C# for processing the data in your program. What does this data represent?
-
Hi losan, I found your post very interesting because I've never encountered a data set that large. Are you trying to analyze that data? If so, I may be able to help. I have a product (www.patternscope.com) that finds patterns in extremely large data sets. I think your data set would be good for stress-testing the application, and it fits perfectly with two planned developments: 1. Reading CSV data (currently it only reads databases through ODBC, or flat files), and 2. Making a C#-callable API that you could use in your C# application to handle that much data (e.g. queries, retrieval, and analysis). My product extracts the patterns that comprise the raw data. These patterns are a fraction of the size of the original raw data, so they fit entirely into memory, even when the raw data is larger than the memory available. The patterns have the same information content as the raw data, so can be processed (e.g. queries or analysis) many times faster. If you could give me a copy of your data set, I could give you a free copy of PatternScope (after I adapt it for reading CSV data) which you could use to analyze the data, followed by a DLL you could call from C# for processing the data in your program. What does this data represent?
Hi; Here is a link for the Dataset "www.dropbox.com/s/een9zlqce4vqqrl/ProjectData3.csv" What I need to do is to apply the collaborative filtering algorithms in the dataset. The data set is about Tweets, who is going to retweet from another person. Thanks
losan1985
-
Hi; Here is a link for the Dataset "www.dropbox.com/s/een9zlqce4vqqrl/ProjectData3.csv" What I need to do is to apply the collaborative filtering algorithms in the dataset. The data set is about Tweets, who is going to retweet from another person. Thanks
losan1985
Thanks. When I've adapted PatternScope for comma-separated values, I'll send you a copy. Collaborative filtering looks interesting.