Constructing a Tree

llp00na

Dear everyone; I have got a pure C++/ algorithmic question that i would appreciate your input towards. I have a number of files (lets say 100 files) of similar input formats (Strings). To reduce the complexity of the problem I will give you a simplied example:

File 1: File 2: File 3: File4:
A A B A
B E A B
C D D A
D D F E
E F E E

I want to write a program which reads the files and produces a tree, in which the first node is the node from the first line (from above) with the highest number of occurences (so we basically check every first item in each file 'A, A, B, A' in this case. Item 'A' has occured three times so it will represent the root node). Consequently File 3 is ignored. The resulting tree will be something like

                       A (3/4 occurrences = 75%)
          B ( 2/4 occurrences = 50%)             E (1/4 occurrences = 25%)

C (1 occurrence=25%) A (1 occurrence= 25%) D (1 occurrence=25%)
D (2/4 occurrences = 50%) E (1 occurrence = 25%)
E (2/4 occurrences = 50%) F (1 occurrence = 25%)

SO basically the program should go through the files and construct a tree with a % value of each node. every time branching occurs the % will be divided depending on the number of children. My question is, is there any specific algorithm / similar algorithms which i can use to overcome this problem ? and What's the best way to visualise the tree? Your help is very much appreciated Best Regards

llp00na

me 0

Hi, I am a bit confused by your example. It appears that you keep only the highest value for the first line, but keep all values for the following lines. On the following lines the percentage is based on the number of times the string occurs total, not as part of the tree. My reading of your algorithum is: Take the first line from each file. Find the most common 'first line' value. Discard all file that do not have this as the first line value Using all the remaining files, for each line, count how many of each string occurs. This, to me, is not really a tree. If it were a tree you would get: A (3/4 occurrences = 75%) B ( 2/4 occurrences = 50%) E (1/4 occurrences = 25%) C (1 occurrence=25%) A (1 occurrence= 25%) D (1 occurrence=25%) D (1 occurrence=25%) E (1 occurrence= 25%) D (1 occurrence= 25%) E (1 occurrence=25%) E (1 occurrence= 25%) F (1 occurrence= 25%) Could you please clarify what you are after? Hugh

David Crow

llp00na wrote:

C (1 occurrence=25%) A (1 occurrence= 25%) D (1 occurrence=25%) D (2/4 occurrences = 50%) E (1 occurrence = 25%) E (2/4 occurrences = 50%) F (1 occurrence = 25%)

What parent node do these child nodes belong to? It would help if you'd line things up so that they are visually intuitive.

"Approved Workmen Are Not Ashamed" - 2 Timothy 2:15

"Judge not by the eye but by the heart." - Native American Proverb

llp00na

thanx for your reply, The problem is exactly how you understood it. except for your proposed tree,

                  A (3/4 occurrences = 75%)
            B ( 2/4 occurrences = 50%) E (1/4 occurrences = 25%)
     C (1 occurrence=25%) A (1 occurrence= 25%) D (1 occurrence=25%)
     D (1 occurrence=25%) E (1 occurrence= 25%) D (1 occurrence= 25%)
     E (1 occurrence=25%) E (1 occurrence= 25%) F (1 occurrence= 25%)

unfortunately i cant show you the arrows of the tree. the tree you proposed is correct. if you check level 4 and level 5 of the tree you will see 2 nodes labelled D and 2 nodes labelled E. The only difference between my tree and yours is that I have grouped the two similar nodes in one node (and increased the % to 50 instead of 25% for each node). If i would represent my tree i would have two arrows coming from Node C and Node D (level three) towards Node D in level 4. So a node can get have two parents or more. Why did i think a tree would be appropriate? I need to store the data of my files into a data structure for later processing. Please if you do have any other / better suggestion i would be grateful. Regards

llp00na

llp00na

Thanx for your reply; Unfortunately i cant use arrows to show what i really mean. I will try to explain. Level 1: C (1 occurrence=25%) A (1 occurrence= 25%) D (1 occurrence=25%) Level 2: D (2/4 occurrences = 50%) E (1 occurrence = 25%) Level 3: E (2/4 occurrences = 50%) F (1 occurrence = 25%) Node D (in level 2) will have two arrows coming towards it (from Node C and D -level 1-). So the % will accumulate to 50%. Node E is the child of node A. Node E (in level 3) will have two arrows coming from node D and node E (both are being the parent of E in this case). Node F (level 3) is the child of Node D. I hope this is clearer. Regards

llp00na