• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Parse Flat data file into a nested structure.

Valorous Hero ,
Feb 04, 2011 Feb 04, 2011

Copy link to clipboard

Copied

This has been driving me crazy all day long.

I have a flat data file that I would like to parse into a nested data structure.

Small sample of the data:

0 HEAD
1 SOUR FTW
2 VERS Family Tree Maker (16.0.350)
2 NAME Family Tree Maker for Windows
2 CORP MyFamily.com, Inc.
3 ADDR 360 W 4800 N
4 CONT Provo, UT 84604
3 PHON (801) 705-7000
0 TRLR

If anybody recognizes this, yes that is a small piece of a GEDCOM file.  That is what I am trying to parse.  For anybody who is unfamiliar with this data format.  The first number is the level of a piece of data.  Level 0 are root elements of a data segment.  Level 1 rows relate to the closest preceding level 0 data row.  Level 2 rows relate to the closest preceding Level 1 data row. And so on.

Here is an example of the desired output nesting the various elements to the related parent.

<cfset foobar = {
HEAD = {lvl=0,
SOUR = {lvl=1,data="FTW",
VERS = {lvl=2,data="Family Tree Maker (16.0.350)"},
NAME = {lvl=2,data="Family Tree Maker for Windows"},
CORP = {lvl=2,data="MyFamily.com, Inc.",
ADDR = {lvl=3,data="360 W 4800 N",
CONT = {lvl=4,data="Provo, UT 84604"}},
PHON = {lvl=3,data="(801) 705-7000"}}}},
TRLR = {lvl=0}
}>

<cfdump var="#foobar#">

I think I am looking at some kind of recursive function to properly nest this data, but I just can not figure out how to do so.

I have this basic function that will output each row of data as a seperate structure key

<cffunction name="parseFile">
     <cfargument name="file" required="yes">
     <cfargument name="line" required="no" type="string" default="">
     
     <cfscript>
          var returnStruct = structNew();
          var subStruct = structNew();
          var cur_line = "";
          var next_line = "";
          var line_lvl = "";
          var line_key = "";
          var loop = true;
          
          if (len(trim(arguments.line)) EQ 0) {
               cur_line = fileReadLine(arguments.file);
          }
          else
          {
               cur_line = arguments.line;
          }
          
          do {
               if (not FileISEOF(arguments.file)) {
                    next_line = fileReadLine(arguments.file);
               }
               else
               {
                    next_line = "-1";
                    loop = false;
               }
               
               line_lvl = listFirst(cur_line, ' ');
               cur_line = listRest(cur_line, ' ');
               line_key = listFirst(cur_line, ' ');
               cur_line = listRest(cur_line, ' ');
               
               returnStruct[line_key] = structNew();
               returnStruct[line_key]["level"] = line_lvl;

               cur_line = next_line;
          } while (loop);
          
          return returnStruct;
     </cfscript>
</cffunction>

<cfscript>
     gedcom_file = FileOpen(getDirectoryFromPath(getCurrentTemplatePath()) & "Ian Skinner.GED","read");
     /*gedcom_data = {individuals = structNew(),
                    families = structNew(),
                                         sources = structNew(),
                                         notes = structNew()};*/
                                        
     gedcom_data = parseFile(gedcom_file);
</cfscript>

<cfdump var="#gedcom_data#" label="Final Output">

I have tried numerous ways to recursive call this function in order to nest the elements.  None of them have produced the expect output in the above hand coded example.  What has got me closest is to recursive call the parseFile() function near the end of the while loop if the level of the next line is greater than the current line level:

if (listFirst(next_line,' ') GT line_lvl) {

     parseFile(arguments.file,next_line);

}


This works fairly well as long as the next line level is the same as or higher than the previous line level.  But once the next line level is lower, the recursive call will not fall back to the proper parent level.  The current function call just finishes out looping over the file data.  Everything I have tried to provide a proper exit to the recursive function calls when the next line data belong to a previous parent row has just mangled the data horribly.

TOPICS
Advanced techniques

Views

3.1K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Valorous Hero , Feb 07, 2011 Feb 07, 2011

Yes, that is exactly it. I think the node should always be added to the stack.

I just had a deadline handed to me. But this is what I was thinking.

ie ...

while (not FileISEOF(gedcom_file)) {

line = fileReadLine(gedcom_file);

// extract this node's data

node = {};

node.lvl = listFirst(line, " ");

line = listRest(line, " ");

key = listFirst(line, " ");

if (listLen(line, " ") gt 1) {

node.data = listRest(li

...

Votes

Translate

Translate
Guest
Feb 04, 2011 Feb 04, 2011

Copy link to clipboard

Copied

Hi,

I parse using a somewhat different approach, but it works really well when ripping against cfhttp files.  I parsed a couple of pieces of the output you're looking for, you'd just need to follow these and write out the rest of the variables.  If the string is constant, this should be a breeze to implement.  If your start and end points continually change, you could need to calculate lengths into them.  Just my thoughts:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html>
<head>
    <title>Untitled</title>
</head>

<body>

<cfset foo = "
0 HEAD
1 SOUR FTW
2 VERS Family Tree Maker (16.0.350)
2 NAME Family Tree Maker for Windows
2 CORP MyFamily.com, Inc.
3 ADDR 360 W 4800 N
4 CONT Provo, UT 84604
3 PHON (801) 705-7000
0 TRLR" />

<cfset StartText = 'SOUR' />
<cfset Start = FindNoCase(StartText, foo, 1) />
<cfif Start gt 0>
<cfset EndText='2' />
<cfset Length=Len(StartText) />
<cfset End = FindNoCase(EndText, foo, Start) />
<cfset data1 = Mid(foo, Start+Length, End-Start-Length) />
<cfset data1 = trim(data1) />
</cfif>

<cfset StartText = 'VERS' />
<cfset Start = FindNoCase(StartText, foo, 1) />
<cfif Start gt 0>
<cfset EndText='2' />
<cfset Length=Len(StartText) />
<cfset End = FindNoCase(EndText, foo, Start) />
<cfset data2 = Mid(foo, Start+Length, End-Start-Length) />
<cfset data2 = trim(data2) />
</cfif>

<cfoutput>#data1# - #data2#</cfoutput>


</body>
</html>

<cfwild />

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 04, 2011 Feb 04, 2011

Copy link to clipboard

Copied

Since recursion is not my forte, I would use a psuedo stack to keep track of the parents. Each time a lower level line (ie descendant) is found, add it to the stack. When a higher level (sibling/ancestor) is found, just pop items off the stack until you find it's parent.  This is rough, but it seemed to produce the desired nesting.

<cfscript>
    gedcom_file = FileOpen(getDirectoryFromPath(getCurrentTemplatePath()) & "sample2.GED","read");
   
    delim   = chr(32);
    root    = {lvl = -1};
    stack   = [];
    arrayPrepend(stack, root);
   
    while (not FileISEOF(gedcom_file)) {
        line = fileReadLine(gedcom_file);
       
        // extract this node's data
        node      = {};
        node.lvl  = listFirst(line, delim);
        line      = listRest(line, delim);
        key       = listFirst(line, delim);
        if (listLen(line, delim) gt 1) {
            node.data = listRest(line, delim);
        }
       
        // get the most recent ancestor from the stack
        lastNode = stack[1];


        // this is a descendant. add it to the stack and
        // continue to search for its descendants.
        if (node.lvl gt lastNode.lvl) {
            arrayPrepend(stack, node);       
        }   
        //otherwise, this is a sibling or ancestor. pop
        //nodes off the stack until we find its parent
        else {
            while (arrayLen(stack) && node.lvl lte lastNode.lvl) {
                arrayDeleteAt(stack, 1);
                lastNode = stack[1];
            }
        }      
        // add this node to its parent
        lastNode[key] = node;
    }   
   
    // clean up
    fileClose(gedcom_file);
    structDelete(root, "lvl");

    // display results   

    WriteDump(root);

</cfscript>

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Feb 05, 2011 Feb 05, 2011

Copy link to clipboard

Copied

I like your approach!

Just for the hell of it, I tried another idea: building a JSON string then deserialising it.  I prefer your way though.

<!--- cheat with the file read bit, just so as to make this stand-alone --->
<cfsavecontent variable="sData">
0 HEAD
1 SOUR FTW
2 VERS Family Tree Maker (16.0.350)
2 NAME Family Tree Maker for Windows
2 CORP MyFamily.com, Inc.
3 ADDR 360 W 4800 N
4 CONT Provo, UT 84604
3 PHON (801) 705-7000
0 TRLR
</cfsavecontent>

<cfset sJson        = "">
<cfset iPrevLevel    = 0>

<cfloop index="sLine" list="#sData#" delimiters="#chr(13)##chr(10)#">
    <cfscript>
        // extract all the bits from each line
        stElements = reFindNoCase("^(\d+)\s+(\S+)\s*(.*)?$", sLine, 1, TRUE);
        if (arrayLen(stElements.pos) EQ 4){
            iLevel    = mid(sLine, stElements.pos[2], stElements.len[2]);
            sKey    = mid(sLine, stElements.pos[3], stElements.len[3]);
           
            if (iLevel EQ iPrevLevel){
                if (len(sJson)){    // only do this if it's after the first iteration
                    // close previous struct
                    sJson = sJson & "},";
                }
            }else if (iLevel GT iPrevLevel){
                // it's a sub-struct, which amounts to being a key within the current struct
                sJson = sJson & ",";
            }else{    // implicitly LT
                // the previous sub-struct(s) has/have finished: close 'em
                sJson = sJson & "#repeatString('}', iPrevLevel-iLevel+1)#,";
            }
            sJson = sJson & '"#sKey#":{"lvl":#iLevel#';

            if (stElements.len[4]){    // only add the data if there's actually something to add
                sJson = sJson & ',"data":"#mid(sLine, stElements.pos[4], stElements.len[4])#"';
            }
        }else{
            writeOutput("Malformed line skipped: [#sLine#]<br />");
        }
        iPrevLevel = iLevel;
    </cfscript>
</cfloop>
<cfset sJson = "{" & sJson & "}}">
<cfset stResult = deserializeJson(sJson)>
<cfdump var="#stResult#">

--

Adam

(edited it to try to get the code to show up...  I cannot stress how much these forums absolutely BITE).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 05, 2011 Feb 05, 2011

Copy link to clipboard

Copied

Cool, I did not even think about using json. Very clean and compact.

(edited it to try to get the code to show up... I cannot

stress how much these forums absolutely BITE).

I almost ranted about it last night after it took three tries to get a stupid code snippet to display properly. Granted I was not functioning on all four cylinders, but ... grrrr. I loathe this software.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

Adam's json code works properly with the gedcom data.

I guess I will use the json solution unless I get some desire to figure out why the stack is getting confused.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

Adam's json code works properly with the gedcom data.

 

I guess I will use the json solution unless I get some

desire to figure out why the stack is getting confused.

Oh, duh. I think I see why. Let me run a comparison with Adam's code first.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

It seems that when the code steps down to the root, the new 0 level element is not put at the top of the stack.

I just haven't yet figured out where to tweak to the code to have that happen.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

Yes, that is exactly it. I think the node should always be added to the stack.

I just had a deadline handed to me. But this is what I was thinking.

ie ...

while (not FileISEOF(gedcom_file)) {

line = fileReadLine(gedcom_file);

// extract this node's data

node = {};

node.lvl = listFirst(line, " ");

line = listRest(line, " ");

key = listFirst(line, " ");

if (listLen(line, " ") gt 1) {

node.data = listRest(line, " ");

}

// get the most recent ancestor from the stack

lastNode = stack[1];

// if this is a sibling/ancestor, search for its parent

while (arrayLen(stack) && node.lvl lte lastNode.lvl) {

arrayDeleteAt(stack, 1);

lastNode = stack[1];

}

// add it to the stack

arrayPrepend(stack, node);

// add this node to its parent

lastNode[key] = node;

}

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

We have a winner!  Man, I just was not seeing any of this.  It could give a long time coder a bit of a complex.

Now that I have the basic structure, hopefully I can not modify it to handle the other data elements, and to deal with multiple nodes of the same name.

But, I think I have an idea how that would go.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

Here is my final code for future posterity.

In case in future ColdFusion developers who happen to enjoy Genealogy come along and want to parse a gedcom file into a structure.

Now, I can figure out if a structure is really the best representation of this data to make the reports I hope to make.

<cfinclude template="gedcom-tags.cfm">

<cfscript>
     gedcom_file = FileOpen(getDirectoryFromPath(getCurrentTemplatePath()) & "Ian Skinner.GED","read");
     
     delim = chr(32);
     gedcom_data  = {lvl = -1};
     stack = [];
     arrayPrepend(stack,gedcom_data);
     
     while (not FileIsEOF(gedcom_file)) {
          line = fileReadLine(gedcom_file);
          
          //extract this node's data
          node     = {};
          node.lvl = val(listFirst(line,delim));
          line     = listRest(line,delim);
          key      = listFirst(line,delim);

                //if key is not a gedcom tag, store the tag name in the node.
          if (not structKeyExists(gedcom_tags,key)) {
               line = listRest(line,delim);
               node.tag = listFirst(line,delim);
          }

                //if there is data associated with this tag, store it in the node.
          if (listLen(line,delim) GT 1) {
               node.data = listRest(line,delim);
          }

                //reteive the most recent node.
          lastNode = stack[1];

                /*if the current node is not a higher element then the last node,

                  step through the stack to find the parrent of the current node.*/
          while (arrayLen(stack) && node.lvl lte lastNode.lvl) {
               arrayDeleteAt(stack,1);
               lastNode = stack[1];
          }

                //add the current node to the top of the stack.
          arrayPrepend(stack,node);

                //if the current key is alread in the node.
          if (structKeyExists(lastNode,key)) {

                  //append the current node to the parent node array.

                        /*if the current node is not an array,

                          make it one with the current node

                          as the first element in the array*/
               if (not isArray(lastNode[key])) {
                    firstNode = lastNode[key];
                    lastNode[key] = arrayNew(1);
                    arrayAppend(lastNode[key],firstNode);
               }

               arrayAppend(lastNode[key],node);
          }
          else {

                //add the current node to the parent node.
            lastNode[key] = node;
          }
     }
     
     fileClose(gedcom_file);
     structDelete(gedcom_data,"lvl");
</cfscript>

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

Now, I can figure out if a structure is really the best representation of this data to make the reports I hope to make.

If you're publishing these reports to HTML, using XML and XSLT might be a good approach.

It would not take much effort to convert my logic from generating JSON to generate XML instead.

--

Adam

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

Adam Cameron. wrote:

If you're publishing these reports to HTML, using XML and XSLT might be a good approach.

Oh my, there is a very interesting idea!  I will have to play with that and see where it takes me!.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

LATEST

I started a thread to carry this discussion into the relm of using XSLT on an XML representation of the gedcom file.

http://forums.adobe.com/thread/788495

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Valorous Hero ,
Feb 07, 2011 Feb 07, 2011

Copy link to clipboard

Copied

Thanks for the great advice over the weekend everybody.  I was afraid that this problem might have been too esoteric to catch anybody's attention.

I like the stack approach as well.  It makes a lot of sense as a good way to keep track of the multiple parents.

Unfortunately the posted code (and I have yet to determine why) fails when I add other complex level 0 data elements.  I presume the stack is not being properly maintained when the data reaches a new 0 level element.  I am starting to debug this.  But I thought to throw it up here first, just in case somebody sees the issue right away.

Here is a slightly larger example of a gedcom file (privitized).

0 HEAD
1 SOUR FTW
2 VERS Family Tree Maker (16.0.350)
2 NAME Family Tree Maker for Windows
2 CORP MyFamily.com, Inc.
3 ADDR 360 W 4800 N
4 CONT Provo, UT 84604
3 PHON (801) 705-7000
0 @SUBM@ SUBM
1 NAME Ian Lee Skinner
1 ADDR Apt. xxx
2 CONT xxxx xxxxxxxxx xxxxxxxx Lane
2 CONT Sacramento, CA  95842
2 PHON xxx-xxx-xxxx
0 @I001@ INDI
1 NAME Ian Lee /SKINNER/
2 SOUR @S01891@
1 SEX M
1 BIRT
2 DATE xx xxx 1967
2 PLAC xxxxxxxx, xxxxxxx County, MT
2 SOUR @S01891@
1 DEAT
2 SOUR @S01891@
1 RESI
2 DATE BET xxxx AND xxxx
2 PLAC Apt xxx xxxx xxxxxxxxxx xxxxx Lane, Sacramento CA 95842
1 RESI
2 DATE 1993
2 PLAC Grass Valley, California, USA
1 RESI
2 DATE BET xxxx AND xxxx
2 PLAC Apt xxxx xxxx xxxxxxxxx xxxxxxxx Lane, Sacramento, CA 95842
1 REFN 1
1 FAMS @F001@
1 FAMC @F002@
0 TRLR

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation