Creating large collections via the SDK
John R. Ellis Oct 13, 2013 2:55 PMI'm having significant performance problems creating large collections via the SDK -- it's much slower than creating collections in the user interface. For example, using the SDK to create a collection of 15K photos is 8 times slower than via the user interface (218 versus 27 seconds)!
Does anyone have any relevant experience at making large collections? Am I missing something?
Here's what I've learned so far. Using a fresh test catalog of 25K photos, I first measured the simplest approach to creating a collection:
catalog:withWriteAccessDo (no timeout params) catalog:createCollection collection:addPhotos
This is fine for collections with fewer than 1K photos, but at 2K photos and larger, it really starts slowing down dramatically:
It takes almost two minutes to make a collection of 8K photos and ten minutes for a collection of 25K photos!
Next, I tried adding photos in chunks of, say, 128 photos, one chunk per transaction:
catalog:withWriteAccessDo (no timeout params) catalog:createCollection for each chunk of 128 photos catalog:withWriteAccessDo (no timeout params) collection:addPhotos (128 photos)
I measured various chunk sizes from 1 to 2048, and there's not much difference in total time with sizes between 64 and 1024.
But even with chunking, the SDK is much slower than the UI at creating collections:
In general, the larger the collection, the slower it is to create in the SDK - creating a collection of 10K photos is 5 times slower, and creating a collection of 15K photos is 8 times slower! Here's a plot showing the ratio of the SDK time to the UI time versus the size of the collection:
It's very suspicious that the slowth is linear in the size of the collection. This suggests the SDK is using an inappropriate n-squared algorithm compared to the UI.
I wonder if the difference between the UI and SDK methods is how the SDK handles the undo stack? I'd guess the underlying SQL operations on the catalog are identical and not the cause of the difference, but who knows.
These measurements were all done on LR 5.2 Windows 7 64-bit 8 GB memory and 7200 RPM disk. LR 4.4 behaves very similarly.
PS: I've been exploring the use of collections to represent search results in Any Filter. Given LR's bias towards collections instead of filters, many users feel more comfortable accessing the search results via collections. And collections would make it easy for a user to "go back" to a previous set of search results.


