5 Replies Latest reply on Nov 20, 2008 7:34 AM by Applied CD

    Performance issues when working with huge lists

    Applied CD Level 1
      I’ve got a script that reads a large CSV spreadsheet and parses the data into a list of the form [[A1,B1,C1], [A2,B2,C2], [A3,B3,C3]] and a second list of the form [#A1:B1,#A2:B2,#A3:B3] etc… The actual spreadsheet is about 10 columns x 10,000 rows. Reading the file string goes fast enough, the parsing starts off fast but slows to a crawl after about 500 rows (I put the row count on the stage to check progress). Does anyone know if the getaProp, addProp, and append methods are sensitive to the size of the list?

      A sample of one of the parsing loops is below. I’m aware all interactivity will stop as this is executed. This script is strictly for internal use, it crunches the numbers in two spreadsheets and merges the results to a new CSV file. The program is intended to run overnight and the new file harvested in the morning.
        • 1. Re: Performance issues when working with huge lists
          Level 7
          > Does anyone know if the getaProp, addProp, and
          > append methods are sensitive to the size of the list?

          Is this a trick question? Sure they are. All of them.
          Addprop and append are quite fast (due to the list object scalable
          preallocating memory as required), so i doubt that they are the cause of
          the problem.
          GetAProp will search each item in the list, therefore, if you are
          searching for the last item, or if the item is not in the list, the more
          the items, the slower the command.

          Didn't go through all your code but I noticed
          - this: repeat with rowCount = 2 to file2string.line.count
          Big no-no! Line counting is a very slow operation for it to be a
          evaluated in a loop.
          - and this: myFile2data.append(myLineData)
          String operations like this require memory reallocation, and therefore
          are very slow. If you do conclude that such an operation causes the
          problem, consider using a preallocated buffer (create a big string in
          advance) and then use
          mydata.char.[currentoffset..(currentoffset+newstr.length)]=newstr
          This can make code run even hundreds times faster, compared to the
          append method.


          Applied CD wrote:
          > I?ve got a script that reads a large CSV spreadsheet and parses the data into a
          > list of the form [[A1,B1,C1], [A2,B2,C2], [A3,B3,C3]] and a second list of the
          > form [#A1:B1,#A2:B2,#A3:B3] etc? The actual spreadsheet is about 10 columns x
          > 10,000 rows. Reading the file string goes fast enough, the parsing starts off
          > fast but slows to a crawl after about 500 rows (I put the row count on the
          > stage to check progress). Does anyone know if the getaProp, addProp, and
          > append methods are sensitive to the size of the list?
          >
          > A sample of one of the parsing loops is below. I?m aware all interactivity
          > will stop as this is executed. This script is strictly for internal use, it
          > crunches the numbers in two spreadsheets and merges the results to a new CSV
          > file. The program is intended to run overnight and the new file harvested in
          > the morning.
          >
          > writeLine("File 2 Data Parsing" & RETURN)
          > myOrderColumn = myHeaders2.getOne("OrderNum")
          > myChargesColumn = myHeaders2.getOne("Cost")
          > myFile2data = []
          > mergedFedExCharges = [:]
          > repeat with rowCount = 2 to file2string.line.count
          > myLineData = []
          > repeat with i = 1 to file2string.line[rowCount].item.count
          > myItem = file2string.line[rowCount].item
          > if i = 1 then
          > myItem = chars(myItem,2,myItem.length)
          > end if
          > myLineData.append(myItem)
          > end repeat
          > if myLineData.count = myHeaders2.count then
          > myFile2data.append(myLineData)
          > myOrderSymbol = symbol("s" & myLineData[myOrderColumn])
          > myCurrentValue = getaProp(mergedFedExCharges,myOrderSymbol)
          > if voidP(myCurrentValue) then
          > mergedFedExCharges.addProp(myOrderSymbol,0.00)
          > end if
          > mergedFedExCharges[myOrderSymbol] = mergedFedExCharges[myOrderSymbol] +
          > value(myLineData[myChargesColumn])
          > writeUpdate(myLineData[1])
          > else
          > writeError("file 2 : " & string(myLineData) & RETURN)
          > end if
          > end repeat
          >
          • 2. Re: Performance issues when working with huge lists
            Applied CD Level 1
            Lol, I figured they’d be sensitive, I guess I was really wondering “how” sensitive. My real fear was that getaProp implicitly searches the list from start to until a match is found everytime it’s called, from what you’re describing I guess it does. I eliminated the repetitive line and item counting and got a speed boost but it’s still slow. As for the append statements I’m not sure how I can substitute a string operation. The append statements are being used to build a list representation of the original spreadsheet, so in the example myFile2data.append(myLineData), myLineData is a linear list representing a single spreadsheet row, all cells being treated as string elements in the list. myLineData[] is then append to myFile2data[] so for example cell C5 can be accessed by the construct myFile2data[5][3]
            • 3. Re: Performance issues when working with huge lists
              Level 7
              I also noticed you are creating symbols on the fly.. How many might they be?
              Cause if we are talking about a large number of them, then that's what
              is causing the issue.
              Symbols are kept in an internal list (after some hashing is applied on
              the symbol's string). So, the more the symbols, the slower the lookup,
              and therefore, the slower ALL director's commands that use symbols.
              If you go past a certain number of symbols, performing symbols lookup
              will be slower than string comparisons.
              It's quite technical to explain, but, in simple words, from what I can
              tell, director's (internal) symbols creation code wasn't built with
              creating symbols on the fly in mind. So, it's not a scalable one. And
              though the number of symbols created does not affect compiled scripts,
              it does greatly affect compilation time and the string to symbol command.
              To test if this is the case in your script, try benchmarking the
              symbol("somestring") command once before you run your code, and once
              after you have run it - and the symbols have been created.


              Applied CD wrote:
              > Lol, I figured they?d be sensitive, I guess I was really wondering ?how?
              > sensitive. My real fear was that getaProp implicitly searches the list from
              > start to until a match is found everytime it?s called, from what you?re
              > describing I guess it does. I eliminated the repetitive line and item counting
              > and got a speed boost but it?s still slow. As for the append statements I?m not
              > sure how I can substitute a string operation. The append statements are being
              > used to build a list representation of the original spreadsheet, so in the
              > example myFile2data.append(myLineData), myLineData is a linear list
              > representing a single spreadsheet row, all cells being treated as string
              > elements in the list. myLineData[] is then append to myFile2data[] so for
              > example cell C5 can be accessed by the construct myFile2data[5][3]
              >
              • 4. Re: Performance issues when working with huge lists
                Level 7
                Applied CD wrote:
                > I?ve got a script that reads a large CSV spreadsheet and parses the
                > data into a list of the form [[A1,B1,C1], [A2,B2,C2], [A3,B3,C3]] and
                > a second list of the form [#A1:B1,#A2:B2,#A3:B3] etc? The actual
                > spreadsheet is about 10 columns x 10,000 rows. Reading the file
                > string goes fast enough, the parsing starts off fast but slows to a
                > crawl after about 500 rows (I put the row count on the stage to check
                > progress).

                If you want it to work faster, you could look into using Perl, or, for less
                of a learning curve, Visual Studio Express editions are free and you'd have
                a choice of language. Processing time could be reduced by a factor
                substantially better than ten (maybe down to seconds).

                Anyway, rather than writing to a file in the loop, store it in a string and
                then write that to file, and don't write more than a few lines to the
                message window.

                Andrew


                • 5. Re: Performance issues when working with huge lists
                  Applied CD Level 1
                  Thanks everyone. The program ran overnight and produced the finished file before I got in so now the speed issue is more academic than practical, I don’t think it will ever be fast enough to run in real time and wait for the result.

                  I agree another language would process the data faster (I’m rusty at vBasic but probably could get it done), the problem is that this little utility will only be run a few times and then be trashed or forgotten, it would be tough to justify the time to work in an unfamiliar language. What I’m really doing is helping someone reconcile a huge database table from their online order management system with another huge table generated by FedEx for their shipping costs. The consolidated table required the aggregation of certain values that seemed impossible in SQL but simple programmatically. Turns out their offer for “free” shipping on orders over $200 is actually costing them money, lots of money. Once they build the shipping cost back into the price structure this program is done.

                  As for on the fly symbol generation, I haven’t run a bench mark test yet, but you’re right, I create a ton of them (about 6000) on the fly. I actually thought it was an efficient solution, I had one column with Order Number, another column with Price. Every item in the shopping cart gets its own row, so there were about 6000 unique Order Numbers repeated over 13,000 line items. I needed the total Price for each unique Order Number so I converted the Order Number to a symbol and created a property list of the form: [#OrderNum:PriceTotal]. This avoided explicit loops to discover if the current Order Number was already in the list, or was it a new Order Number. Simply use getaProp(#OrderNum) and if its void the order number is new and added to the list, if it’s not void just add the current price to the property value.