14 Replies Latest reply on Jul 23, 2010 6:04 PM by Lichtzeichenanlage

    How to read files larger than 100mb

    Lichtzeichenanlage Level 1

      Hi,

       

      neither LrFileUtils.readFile or io:read can read a file larger than 100mb in one chunk. Any idea how to read it in one piece?

       

      rgds - wilko

        • 1. Re: How to read files larger than 100mb
          Vladimir Vinogradsky

          Wilko,

          I think trying to read this much in one chunk is a bad idea. Could you explain why you even need to do this?

          • 2. Re: How to read files larger than 100mb
            Lichtzeichenanlage Level 1

            To be honest - 100mb is a small value. I want to calculate checksums from e.g. psd and tiff files .Of course reading files in smaller chunks is possible, but this is just a workaround.

             

            rgds - wilko

            • 3. Re: How to read files larger than 100mb
              johnrellis Most Valuable Participant

              In terms of performance, you'll most likely get the best performance reading in chunks much smaller than 100 MB, e.g. 1 MB or even smaller.  You might see severely degraded performance trying to read an entire 500 MB file into memory.

               

              In terms of programming convenience, of course, it's nice to read an entire file all at once into a string.

              • 4. Re: How to read files larger than 100mb
                areohbee Level 5

                Here's something I wrote recently to copy a big file - could be adapted just for reading:

                 

                local __copyBigFile = function( sourcePath, destPath, progressScope )

                 

                    local fileSize = LrFileUtils.fileAttributes( sourcePath ).fileSize

                 

                    local g
                    local s
                    local t
                    -- local blkSize = 32768 -- typical cluster size on large system or primary data drive.
                    local blkSize = 10000000 -- 10MB at a time - lua is fine with big chunks.
                    local nBlks = math.ceil( fileSize / blkSize )
                    local b
                    local x
                    g, s = pcall( io.open, sourcePath, 'rb' )
                    if not g then return false, s end
                    g, t = pcall( io.open, destPath, 'wb' )
                    if not g then
                        pcall( io.close, s )
                        return false, t
                    end
                    local done = false
                    local m = 'unknown error'
                    local i = 0
                    repeat -- forever - until break
                        g, b = pcall( s.read, s, blkSize )
                        if not g then
                            m = b
                            break
                        end
                        if b then
                            g, x = pcall( t.write, t, b )
                            if not g then
                                m = x
                                break
                            end
                            i = i + 1
                            if progressScope then
                                progressScope:setPortionComplete( i, nBlks )
                            end
                            LrTasks.yield()
                        else
                            g, x = pcall( t.flush, t ) -- close also flushes, but I feel more comfortable pre-flushing and checking -
                                -- that way I know if any error is due to writing or closing after written / flushed.
                            if not g then
                                m = x
                                break
                            end
                            m = '' -- completed sans incident.
                            done = true
                            break
                        end
                    until false
                    pcall( s.close, s )
                    pcall( t.close, t )
                    if done then
                        return true
                    else
                        return false, m
                    end
                       
                end

                • 5. Re: How to read files larger than 100mb
                  ChuckTribolet Level 2

                  I spent severaly years in hard drive performance and disk subsystem performance at IBM.  Until you run out of memory,  big is better.  Esp. if you want to read a whole image in to manipulate it., in which case your better have the memory.  I have one 600ish MB image (uncompressed TIF).

                   

                   

                  Chuck

                  • 6. Re: How to read files larger than 100mb
                    johnrellis Most Valuable Participant
                    function(){return A.apply(null,[this].concat($A(arguments)))}

                    Until you run out of memory,  big is better. 

                    I strongly suspect that isn't true of Lua in Lightroom, though running actual tests would be the best way to determine that.  Allocating huge contiguous chunks of private memory can strain both the OS's virtual memory and the allocator of the garbage collector.  Here's what Programming in Lua, 2nd Ed by Roberto Ierusalimschy says:

                     

                    Usually, in Lua, it is faster to read a file as a whole than to read it line by line. However, sometimes we must face a big file (say, tens or hundreds megabytes) for which it is not reasonable to read it all at once. If you want to handle such big files with maximum performance, the fastest way is to read them in reasonably large chunks (e.g., 8 Kbytes each).

                     

                     

                     

                    • 7. Re: How to read files larger than 100mb
                      ChuckTribolet Level 2

                      In this day of 64-bit OSs and 4G+ machines, 8K is electron-microscopic.

                       

                      And as I said, if you are going to do image manipulation, you probably want the image in contiguous storage anyway,

                      so might as well read it in all at once.

                       

                      And little tiny block sizes are REALLY bad if the data is on a network drive because there will only be one little block in

                      flight at a time.  If you use a big block size, the network file system (CIFS/SMB, NFS, etc) willl have multiple packets in flight

                      at a time and it will go a lot faster.  That's less of a problem with local disks because the disks will do speculative

                      read-ahead and the data will already be in the drive cache.  I recently finished an analysis of a large product install
                      process that worked reasonably well on local disks, but abysmally over the network, even with a Gigabit ethernet

                      connection on the same subnet and a bad-boy file server.  Turns out they were reading small blocks to get large files

                      (and sometimes reading the file more than once).

                      • 8. Re: How to read files larger than 100mb
                        areohbee Level 5

                        I tested the above code with block sizes of 8k, 32k( my cluster size) and 10MB and they were all about the same (local disk). I used the "thousand-one, thousand-two, ..., finger-in-the-air..." method for comparing the differences - bottom line: not much difference I could tell.

                         

                        PS - with the smaller block sizes I only yielded every 1000th time through the loop.

                         

                        Granted, the above code only tested local-disk/lua-code transfer speed, since I wasn't holding the whole file in memory - that's a separate matter.

                         

                        The proof is in the pudding...

                         

                        Rob

                        • 9. Re: How to read files larger than 100mb
                          johnrellis Most Valuable Participant

                          The original poster was interested in computing checksums of large files, which doesn't require the entire file in memory.

                           

                          Out of curiosity, I timed reading a very large file with various chunk sizes, ranging from 8K to 128M.  This confirms Rob's quickie timings -- there is little difference in performance using chunk sizes from 8K through 2M.   But larger than 2M and performance starts degrading seriously, as I suspected:

                           

                          Chunk sizeSecondsRatio
                          8K931.00
                          32K900.97
                          128K951.02
                          512K910.98
                          2M981.05
                          8M1031.11
                          32M1191.28
                          128M1761.89

                           

                          These times are an average of 3 runs, each run reading a 6 GB file, ensuring that Windows Vista 64 (with 6 GB of memory) wouldn't be able to cache the file in memory.

                           

                          What I suspect is going on: At the OS level, Windows is reading from the disk into its cache in a uniformly large chunk size, regardless of the size passed to Lua's file:read().  But at the larger chunk sizes, the program incurs higher overhead allocating strings of the given chunk size, most likely because Lua memory allocation is optimized for small objects.

                          • 10. Re: How to read files larger than 100mb
                            areohbee Level 5

                            Thank you John.

                             

                            Very useful information.

                             

                            I don't know enough about Lua to comment about the performance hit at largest block sizes, but I think you are right that modern OS's tend to be very smart and cache-y - so most reads at the lua level are coming from the cache and not the disk. - Certainly that's true for local disk access.

                             

                            Next tests: string allocation performance benchmarks? - local-disk versus network files?

                             

                            Rob

                            • 11. Re: How to read files larger than 100mb
                              Lichtzeichenanlage Level 1

                              Hi,

                               

                              please excuse that I was away so long but I had / have a few family problems. I haven't done much on my code, but I have some results.

                               

                              johnrellis wrote:

                               

                              The original poster was interested in computing checksums of large files, which doesn't require the entire file in memory.

                               

                               

                              Hmmm - my original question was how to read more than 100mb and how to get rid of this lua limitation. I would prefere to just calculate one checksum and not dozents.

                               

                              However currently my code reads a chunk of the file, renders a checksum, yields and repeats those steps untill the file is read completly. The code is not optimized for tiny chunks (string concatinations, less Task.yields and so on).

                               

                              The code is not optimized for tiny chunks (string concatinations, less yields and so on). Here are my results:

                               

                              http://www.diestrenges.de/share/Misc/results.pdf

                              • 12. Re: How to read files larger than 100mb
                                areohbee Level 5

                                So it seems you've got your answer, right? - read in chunks and accumulate checksum as you go.

                                 

                                Regarding the spinoff issue - if you did want to hold it all in memory, say if it were an image that you wanted to manipulate - anybody know if there's a limit on string size?

                                 

                                -R

                                • 13. Re: How to read files larger than 100mb
                                  Lichtzeichenanlage Level 1

                                  Hi,

                                   

                                  areohbee wrote:

                                   

                                  So it seems you've got your answer, right? - read in chunks and accumulate checksum as you go.

                                   

                                   

                                  unfortunately not at all. Reading in chunks was what I did all the time (without concatenation of strings). We did a lot of performance testing (wich I hate, because I did it so often in history) with no real result.

                                   

                                  Currently I will go for chunks, lots of checksums and I will do optimization somehow later.

                                  • 14. Re: How to read files larger than 100mb
                                    areohbee Level 5

                                    I'm confused.

                                     

                                    Can't you read in chunks but maintain only one checksum?

                                     

                                    Rob