Please help. Is multimap what I need?

knapak

Hi-- I've been banging my head trying to solve this problem. Of course I'm a self trained pseudo programmer who just started to use STL. I have a huge file with 10 million entries that are paired, say: 1 8 1 5 1 3 2 0 2 4 3 0 3 8 etc... They are stored as a two dimensional array of 10 million rows and two columns... you get the idea of the data. I need to find the records in the file that have one and only one of the tags in the first entry of the pair and put the second in another vector file in the order found. For example, if I need records with the tag 2, I'll create the vector 0 4 I suppose I need to use a multimap, but do not know how to read the file and then store the results in a vector. Right now I have this simple code: typedef vector DVECTOR; DVECTOR FDV; int tag; double FD; int id = 3; int idp = id + 1; while(tag != idp) { GetData >> tag >> FD; cout << tag << " " << FD << endl ; if(tag == id) FDV.push_back(FD); } This does what I want by scaning the file from the begining until it finds the target value and reads all the entries with such value. If the number is close to the first entry, cool it's fast, but if it is near the end it will take a long time (big program repeated many times...). The question is, would the multimap work better and faster to do the same task? If so... how do I do it??? that is, read the file, store the data in a multimap, search for the tag number of my interest and copy all values associated to the tag number into a vector. If I have to read the whole 10 million line file to put it in a map, then it's going to be hanging around in memory for further use (several times)... or should I read the file every time I need it?? If so, then, isn't my naive code more efficient??? By the way, I'm trying this other code using multimap but gives me horrible error messages: typedef multimap DMMAP; DMMAP mapfish; while(!GetData.eof()) { GetData >> tag >> FDS; mapfish.insert(make_pair(tag,FDS)); } argggggggggggggggggggggggggg X| Thank you so much! Carlos

Joaquin M Lopez Munoz

Well, if I understand your scenario correctly, you have two valid alternatives:

Scan the whole file into a multimap and use this later several times.
Improve your file searching by doing something like a binary search on it. This is a little complex, but doable: broadly sketched, you have to open the file and set the current offset to the middle of it, skip characters until you find a newline and then read the entry: if the tag is id-1, keep on reading till you get to your target portion and proceed. If the tag is greater than that, set the offset to 1/4 of the file, else set it to 3/4, proceed recursively etc.

If you'll be doing the search many times, the multimap solution is probably faster, otherwise you'll have better performance with binary searching on the file. We can do an estimate: Let us call N the number of entries (10 millions in your case) and S the number of search operations you're going to perform on the file. Each search takes approximately log2N entry readings, so the binary search method will read in total S*log2N file entries. With the first method, the entries read are exactly N, as you scan the entire file. So the first method is better (on a first approximation) than the second if N < S*log2N With 10 million entries, this means the first method wins if you do more than ~430,000 search operations. I guess you won't do that many searches, hence my hunch is that you're probably better off implementing the file binary search stuff. Hope this helps a little. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo Want a Boost forum in Code Project? Vote here[^]!

knapak

Thank you! Carlos