14 Replies Latest reply on Feb 7, 2011 4:45 PM by ilssac

    Parse Flat data file into a nested structure.

    ilssac Level 5

      This has been driving me crazy all day long.

       

      I have a flat data file that I would like to parse into a nested data structure.

       

      Small sample of the data:

      0 HEAD
      1 SOUR FTW
      2 VERS Family Tree Maker (16.0.350)
      2 NAME Family Tree Maker for Windows
      2 CORP MyFamily.com, Inc.
      3 ADDR 360 W 4800 N
      4 CONT Provo, UT 84604
      3 PHON (801) 705-7000
      0 TRLR
      

       

      If anybody recognizes this, yes that is a small piece of a GEDCOM file.  That is what I am trying to parse.  For anybody who is unfamiliar with this data format.  The first number is the level of a piece of data.  Level 0 are root elements of a data segment.  Level 1 rows relate to the closest preceding level 0 data row.  Level 2 rows relate to the closest preceding Level 1 data row. And so on.

       

      Here is an example of the desired output nesting the various elements to the related parent.

      <cfset foobar = {
       HEAD = {lvl=0,
       SOUR = {lvl=1,data="FTW",
       VERS = {lvl=2,data="Family Tree Maker (16.0.350)"},
       NAME = {lvl=2,data="Family Tree Maker for Windows"},
       CORP = {lvl=2,data="MyFamily.com, Inc.",
       ADDR = {lvl=3,data="360 W 4800 N",
       CONT = {lvl=4,data="Provo, UT 84604"}},
       PHON = {lvl=3,data="(801) 705-7000"}}}},
       TRLR = {lvl=0}
      }>
      
      <cfdump var="#foobar#">
      

       

      I think I am looking at some kind of recursive function to properly nest this data, but I just can not figure out how to do so.

       

      I have this basic function that will output each row of data as a seperate structure key

      <cffunction name="parseFile">
           <cfargument name="file" required="yes">
           <cfargument name="line" required="no" type="string" default="">
           
           <cfscript>
                var returnStruct = structNew();
                var subStruct = structNew();
                var cur_line = "";
                var next_line = "";
                var line_lvl = "";
                var line_key = "";
                var loop = true;
                
                if (len(trim(arguments.line)) EQ 0) {
                     cur_line = fileReadLine(arguments.file);
                }
                else
                {
                     cur_line = arguments.line;
                }
                
                do {
                     if (not FileISEOF(arguments.file)) {
                          next_line = fileReadLine(arguments.file);
                     }
                     else
                     {
                          next_line = "-1";
                          loop = false;
                     }
                     
                     line_lvl = listFirst(cur_line, ' ');
                     cur_line = listRest(cur_line, ' ');
                     line_key = listFirst(cur_line, ' ');
                     cur_line = listRest(cur_line, ' ');
                     
                     returnStruct[line_key] = structNew();
                     returnStruct[line_key]["level"] = line_lvl;
      
                     cur_line = next_line;
                } while (loop);
                
                return returnStruct;
           </cfscript>
      </cffunction>
      
      <cfscript>
           gedcom_file = FileOpen(getDirectoryFromPath(getCurrentTemplatePath()) & "Ian Skinner.GED","read");
           /*gedcom_data = {individuals = structNew(),
                          families = structNew(),
                                               sources = structNew(), 
                                               notes = structNew()};*/
                                               
           gedcom_data = parseFile(gedcom_file);
      </cfscript>
      
      <cfdump var="#gedcom_data#" label="Final Output">
      

       

      I have tried numerous ways to recursive call this function in order to nest the elements.  None of them have produced the expect output in the above hand coded example.  What has got me closest is to recursive call the parseFile() function near the end of the while loop if the level of the next line is greater than the current line level:

       

      if (listFirst(next_line,' ') GT line_lvl) {
           parseFile(arguments.file,next_line);
      }
      


      This works fairly well as long as the next line level is the same as or higher than the previous line level.  But once the next line level is lower, the recursive call will not fall back to the proper parent level.  The current function call just finishes out looping over the file data.  Everything I have tried to provide a proper exit to the recursive function calls when the next line data belong to a previous parent row has just mangled the data horribly.

        • 1. Re: Parse Flat data file into a nested structure.
          cfwild Level 1

          Hi,

           

          I parse using a somewhat different approach, but it works really well when ripping against cfhttp files.  I parsed a couple of pieces of the output you're looking for, you'd just need to follow these and write out the rest of the variables.  If the string is constant, this should be a breeze to implement.  If your start and end points continually change, you could need to calculate lengths into them.  Just my thoughts:

           

          <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

           

          <html>
          <head>
              <title>Untitled</title>
          </head>

           

          <body>

           

          <cfset foo = "
          0 HEAD
          1 SOUR FTW
          2 VERS Family Tree Maker (16.0.350)
          2 NAME Family Tree Maker for Windows
          2 CORP MyFamily.com, Inc.
          3 ADDR 360 W 4800 N
          4 CONT Provo, UT 84604
          3 PHON (801) 705-7000
          0 TRLR" />

           

          <cfset StartText = 'SOUR' />
          <cfset Start = FindNoCase(StartText, foo, 1) />
          <cfif Start gt 0>
          <cfset EndText='2' />
          <cfset Length=Len(StartText) />
          <cfset End = FindNoCase(EndText, foo, Start) />
          <cfset data1 = Mid(foo, Start+Length, End-Start-Length) />
          <cfset data1 = trim(data1) />
          </cfif>

           

          <cfset StartText = 'VERS' />
          <cfset Start = FindNoCase(StartText, foo, 1) />
          <cfif Start gt 0>
          <cfset EndText='2' />
          <cfset Length=Len(StartText) />
          <cfset End = FindNoCase(EndText, foo, Start) />
          <cfset data2 = Mid(foo, Start+Length, End-Start-Length) />
          <cfset data2 = trim(data2) />
          </cfif>

           

          <cfoutput>#data1# - #data2#</cfoutput>

           


          </body>
          </html>

           

           

          <cfwild />

          • 2. Re: Parse Flat data file into a nested structure.
            -==cfSearching==- Level 4

            Since recursion is not my forte, I would use a psuedo stack to keep track of the parents. Each time a lower level line (ie descendant) is found, add it to the stack. When a higher level (sibling/ancestor) is found, just pop items off the stack until you find it's parent.  This is rough, but it seemed to produce the desired nesting.

             

             

            <cfscript>
                gedcom_file = FileOpen(getDirectoryFromPath(getCurrentTemplatePath()) & "sample2.GED","read");
               
                delim   = chr(32);
                root    = {lvl = -1};
                stack   = [];
                arrayPrepend(stack, root);
               
                while (not FileISEOF(gedcom_file)) {
                    line = fileReadLine(gedcom_file);
                   
                    // extract this node's data
                    node      = {};
                    node.lvl  = listFirst(line, delim);
                    line      = listRest(line, delim);
                    key       = listFirst(line, delim);
                    if (listLen(line, delim) gt 1) {
                        node.data = listRest(line, delim);
                    }
                   
                    // get the most recent ancestor from the stack
                    lastNode = stack[1];

             


                    // this is a descendant. add it to the stack and
                    // continue to search for its descendants.
                    if (node.lvl gt lastNode.lvl) {
                        arrayPrepend(stack, node);       
                    }   
                    //otherwise, this is a sibling or ancestor. pop
                    //nodes off the stack until we find its parent
                    else {
                        while (arrayLen(stack) && node.lvl lte lastNode.lvl) {
                            arrayDeleteAt(stack, 1);
                            lastNode = stack[1];
                        }
                    }      
                    // add this node to its parent
                    lastNode[key] = node;
                }   
               
                // clean up
                fileClose(gedcom_file);
                structDelete(root, "lvl");

             

                // display results   

                WriteDump(root);

            </cfscript>

            1 person found this helpful
            • 3. Re: Parse Flat data file into a nested structure.
              Adam Cameron. Level 5

              I like your approach!

               

              Just for the hell of it, I tried another idea: building a JSON string then deserialising it.  I prefer your way though.

               

               

              <!--- cheat with the file read bit, just so as to make this stand-alone --->
              <cfsavecontent variable="sData">
              0 HEAD
              1 SOUR FTW
              2 VERS Family Tree Maker (16.0.350)
              2 NAME Family Tree Maker for Windows
              2 CORP MyFamily.com, Inc.
              3 ADDR 360 W 4800 N
              4 CONT Provo, UT 84604
              3 PHON (801) 705-7000
              0 TRLR
              </cfsavecontent>

               

              <cfset sJson        = "">
              <cfset iPrevLevel    = 0>

               

              <cfloop index="sLine" list="#sData#" delimiters="#chr(13)##chr(10)#">
                  <cfscript>
                      // extract all the bits from each line
                      stElements = reFindNoCase("^(\d+)\s+(\S+)\s*(.*)?$", sLine, 1, TRUE);
                      if (arrayLen(stElements.pos) EQ 4){
                          iLevel    = mid(sLine, stElements.pos[2], stElements.len[2]);
                          sKey    = mid(sLine, stElements.pos[3], stElements.len[3]);
                         
                          if (iLevel EQ iPrevLevel){
                              if (len(sJson)){    // only do this if it's after the first iteration
                                  // close previous struct
                                  sJson = sJson & "},";
                              }
                          }else if (iLevel GT iPrevLevel){
                              // it's a sub-struct, which amounts to being a key within the current struct
                              sJson = sJson & ",";
                          }else{    // implicitly LT
                              // the previous sub-struct(s) has/have finished: close 'em
                              sJson = sJson & "#repeatString('}', iPrevLevel-iLevel+1)#,";
                          }
                          sJson = sJson & '"#sKey#":{"lvl":#iLevel#';

               

                          if (stElements.len[4]){    // only add the data if there's actually something to add
                              sJson = sJson & ',"data":"#mid(sLine, stElements.pos[4], stElements.len[4])#"';
                          }
                      }else{
                          writeOutput("Malformed line skipped: [#sLine#]<br />");
                      }
                      iPrevLevel = iLevel;
                  </cfscript>
              </cfloop>
              <cfset sJson = "{" & sJson & "}}">
              <cfset stResult = deserializeJson(sJson)>
              <cfdump var="#stResult#">

               

               

               

              --

              Adam

               

              (edited it to try to get the code to show up...  I cannot stress how much these forums absolutely BITE).

              1 person found this helpful
              • 4. Re: Parse Flat data file into a nested structure.
                -==cfSearching==- Level 4

                Cool, I did not even think about using json. Very clean and compact.

                 

                (edited it to try to get the code to show up...  I cannot

                stress how much these forums absolutely BITE).

                 

                I almost ranted about it last night after it took three tries to get a stupid code snippet to display properly. Granted I was not functioning on all four cylinders, but ... grrrr. I loathe this software.

                • 5. Re: Parse Flat data file into a nested structure.
                  ilssac Level 5

                  Thanks for the great advice over the weekend everybody.  I was afraid that this problem might have been too esoteric to catch anybody's attention.

                   

                  I like the stack approach as well.  It makes a lot of sense as a good way to keep track of the multiple parents.

                   

                  Unfortunately the posted code (and I have yet to determine why) fails when I add other complex level 0 data elements.  I presume the stack is not being properly maintained when the data reaches a new 0 level element.  I am starting to debug this.  But I thought to throw it up here first, just in case somebody sees the issue right away.

                   

                  Here is a slightly larger example of a gedcom file (privitized).

                  0 HEAD
                  1 SOUR FTW
                  2 VERS Family Tree Maker (16.0.350)
                  2 NAME Family Tree Maker for Windows
                  2 CORP MyFamily.com, Inc.
                  3 ADDR 360 W 4800 N
                  4 CONT Provo, UT 84604
                  3 PHON (801) 705-7000
                  0 @SUBM@ SUBM
                  1 NAME Ian Lee Skinner
                  1 ADDR Apt. xxx
                  2 CONT xxxx xxxxxxxxx xxxxxxxx Lane
                  2 CONT Sacramento, CA  95842
                  2 PHON xxx-xxx-xxxx
                  0 @I001@ INDI
                  1 NAME Ian Lee /SKINNER/
                  2 SOUR @S01891@
                  1 SEX M
                  1 BIRT
                  2 DATE xx xxx 1967
                  2 PLAC xxxxxxxx, xxxxxxx County, MT
                  2 SOUR @S01891@
                  1 DEAT
                  2 SOUR @S01891@
                  1 RESI
                  2 DATE BET xxxx AND xxxx
                  2 PLAC Apt xxx xxxx xxxxxxxxxx xxxxx Lane, Sacramento CA 95842
                  1 RESI
                  2 DATE 1993
                  2 PLAC Grass Valley, California, USA
                  1 RESI
                  2 DATE BET xxxx AND xxxx
                  2 PLAC Apt xxxx xxxx xxxxxxxxx xxxxxxxx Lane, Sacramento, CA 95842
                  1 REFN 1
                  1 FAMS @F001@
                  1 FAMC @F002@
                  0 TRLR
                  
                  • 6. Re: Parse Flat data file into a nested structure.
                    ilssac Level 5

                    Adam's json code works properly with the gedcom data.

                     

                    I guess I will use the json solution unless I get some desire to figure out why the stack is getting confused.

                    • 7. Re: Parse Flat data file into a nested structure.
                      -==cfSearching==- Level 4

                      Adam's json code works properly with the gedcom data.

                       

                      I guess I will use the json solution unless I get some

                      desire to figure out why the stack is getting confused.

                       

                      Oh, duh. I think I see why. Let me run a comparison with Adam's code first.

                      • 8. Re: Parse Flat data file into a nested structure.
                        ilssac Level 5

                        It seems that when the code steps down to the root, the new 0 level element is not put at the top of the stack.

                         

                        I just haven't yet figured out where to tweak to the code to have that happen.

                        • 9. Re: Parse Flat data file into a nested structure.
                          -==cfSearching==- Level 4

                          Yes, that is exactly it. I think the node should always be added to the stack.

                           

                          I just had a deadline handed to me. But this is what I was thinking.

                           

                          ie  ...

                              while (not FileISEOF(gedcom_file)) {

                                  line = fileReadLine(gedcom_file);

                           

                                  // extract this node's data

                                  node         = {};

                                  node.lvl     = listFirst(line, " ");

                                  line         = listRest(line, " ");

                                  key          = listFirst(line, " ");

                                  if (listLen(line, " ") gt 1) {

                                      node.data = listRest(line, " ");

                                  }

                           

                                  // get the most recent ancestor from the stack

                                  lastNode = stack[1];

                                     

                                  // if this is a sibling/ancestor, search for its parent

                                  while (arrayLen(stack) && node.lvl lte lastNode.lvl) {

                                         arrayDeleteAt(stack, 1);

                                      lastNode = stack[1];

                                  }

                                   

                                    // add it to the stack

                                  arrayPrepend(stack, node);       

                           

                                    // add this node to its parent

                                  lastNode[key] = node;

                              }

                          • 10. Re: Parse Flat data file into a nested structure.
                            ilssac Level 5

                            We have a winner!  Man, I just was not seeing any of this.  It could give a long time coder a bit of a complex.

                             

                            Now that I have the basic structure, hopefully I can not modify it to handle the other data elements, and to deal with multiple nodes of the same name.

                             

                            But, I think I have an idea how that would go.

                            • 11. Re: Parse Flat data file into a nested structure.
                              ilssac Level 5

                              Here is my final code for future posterity.

                               

                              In case in future ColdFusion developers who happen to enjoy Genealogy come along and want to parse a gedcom file into a structure.

                               

                              Now, I can figure out if a structure is really the best representation of this data to make the reports I hope to make.

                               

                              <cfinclude template="gedcom-tags.cfm">
                              
                              <cfscript>
                                   gedcom_file = FileOpen(getDirectoryFromPath(getCurrentTemplatePath()) & "Ian Skinner.GED","read");
                                   
                                   delim = chr(32);
                                   gedcom_data  = {lvl = -1};
                                   stack = [];
                                   arrayPrepend(stack,gedcom_data);
                                   
                                   while (not FileIsEOF(gedcom_file)) {
                                        line = fileReadLine(gedcom_file);
                                        
                                        //extract this node's data
                                        node     = {};
                                        node.lvl = val(listFirst(line,delim));
                                        line     = listRest(line,delim);
                                        key      = listFirst(line,delim);
                                              //if key is not a gedcom tag, store the tag name in the node.
                                        if (not structKeyExists(gedcom_tags,key)) {
                                             line = listRest(line,delim);
                                             node.tag = listFirst(line,delim);
                                        }
                                              //if there is data associated with this tag, store it in the node.
                                        if (listLen(line,delim) GT 1) {
                                             node.data = listRest(line,delim);
                                        }
                                              //reteive the most recent node.
                                        lastNode = stack[1];
                                              /*if the current node is not a higher element then the last node,
                                                step through the stack to find the parrent of the current node.*/
                                        while (arrayLen(stack) && node.lvl lte lastNode.lvl) {
                                             arrayDeleteAt(stack,1);
                                             lastNode = stack[1];
                                        }
                                              //add the current node to the top of the stack.
                                        arrayPrepend(stack,node);
                                              //if the current key is alread in the node.
                                        if (structKeyExists(lastNode,key)) {
                                                //append the current node to the parent node array.
                              
                                                      /*if the current node is not an array,
                                                        make it one with the current node
                                                        as the first element in the array*/
                                             if (not isArray(lastNode[key])) {
                                                  firstNode = lastNode[key];
                                                  lastNode[key] = arrayNew(1);
                                                  arrayAppend(lastNode[key],firstNode);
                                             }
                                             arrayAppend(lastNode[key],node);
                                        }
                                        else {
                                              //add the current node to the parent node.
                                          lastNode[key] = node;
                                        }
                                   }
                                   
                                   fileClose(gedcom_file);
                                   structDelete(gedcom_data,"lvl");
                              </cfscript>
                              
                              • 12. Re: Parse Flat data file into a nested structure.
                                Adam Cameron. Level 5

                                Now, I can figure out if a structure is really the best representation of this data to make the reports I hope to make.

                                 

                                If you're publishing these reports to HTML, using XML and XSLT might be a good approach.

                                 

                                It would not take much effort to convert my logic from generating JSON to generate XML instead.

                                 

                                --

                                Adam

                                • 13. Re: Parse Flat data file into a nested structure.
                                  ilssac Level 5

                                  Adam Cameron. wrote:

                                  If you're publishing these reports to HTML, using XML and XSLT might be a good approach.

                                   

                                  Oh my, there is a very interesting idea!  I will have to play with that and see where it takes me!.

                                  • 14. Re: Parse Flat data file into a nested structure.
                                    ilssac Level 5

                                    I started a thread to carry this discussion into the relm of using XSLT on an XML representation of the gedcom file.

                                     

                                    http://forums.adobe.com/thread/788495