CSV INSERT TAKES OVER 24 HOURS

Report · Jun 08, 2010

Hi

I have a simple CSV with 5 columns. Name, Email, Custom1, Custom2, Custom3, Custom4

The CSV has 120 000 rows.

When I insert the CSV into a MySql DB, it starts off perfectly. About 2000 is 3 minutes.

Thereafter it slows down alarmingly to about 1 record every second or 2.

I really need to speed this process up.

My code is fairly simple... See below:

Please can someone assist me with this as it is causing me endless headaches!

Best regards

Delon

<cffile action="upload" filefield="CSVFILE" destination="C:\inetpub\vhosts\abc.com\httpdocs\login\csvfiles\">
<cffile action="READ" file="#CSVFILE#" variable="thefile">

            <cfset processed=0>
            <cfloop list="#thefile#" delimiters="#chr(10)#" index="el">
                
                <cfif listlen(el) GTE "2" AND findnocase('@',listgetat(el,'2')) NEQ 0>
                    
                    <cfquery name="CheckDup" datasource="#application.db#">
                    SELECT    name
                    FROM     tblrecipients
                    WHERE    
                    email     =     '#trim(listgetat(el,'2'))#'
                    AND     ListID    =    #ListID#
                    AND     userid    =    #Session.userid#
                    </cfquery>

                    <cfif CheckDup.RecordCount eq 0>

                        <cfquery datasource="#application.db#" timeout="9999999">
                        INSERT INTO tblrecipients(name,email,listid,userid,COLUMNA,COLUMNB,COLUMNC,COLUMND)
                        VALUES('#trim(listgetat(el,'1'))#','#trim(listgetat(el,'2'))#',#ListID#,#Session.userid#,
                        <cftry>'#trim(listgetat(el,'3'))#'<cfcatch type="any"> '</cfcatch></cftry>,
                        <cftry>'#trim(listgetat(el,'4'))#'<cfcatch type="any"> '</cfcatch></cftry>,
                        <cftry>'#trim(listgetat(el,'5'))#'<cfcatch type="any"> '</cfcatch></cftry>,
                        <cftry>'#trim(listgetat(el,'6'))#'<cfcatch type="any"> '</cfcatch></cftry>
                        )
                        </cfquery>
                        <cfset processed=processed+1>
                    </cfif>

                </cfif>
            </cfloop>

Report · Jun 08, 2010

ColdFusion is not the best tool for bulk loading data like this. Most database management tool worthy of the name have bulk loading features that can directly load data by reading CSV files.

If you must use ColdFusion, you maybe running into a issue where CF can't store the entire datafile in memory, causing the server to do lots of memory swapping. With a large file like that, I would try to create a system that reads a line at a time and load that line.

Report · Jun 08, 2010

First: read the file line by line, rather than the whole thing in one fell swoop.

Second: parameterise your queries, rather than hard-coding values into your SQL string.

Third: if you're only querying for a recordCount with CheckDup, why not just return the record count (eg: count(name)) rather than return a recordset and then count it?

Fourth: investigate bulk-loading this via the DB, omitting CF from the mix completely, or perhaps only involving it to prep the bulk-load file.

(those are in the order they occurred to me, not in the order of importance... it's probably 4, 1, 2, 3).

--

Adam

Report · Jun 08, 2010

A couple of things jump out at me in the CF code. As Adam mentioned, you should read the file line by line. You're effectively

doing that anyway with the delimiters= in the CFLOOP, so just remove the CFFILE to read the file, and change the CFLOOP to read the file instead.

All those CFTRYs in the INSERT statement can't be helping either - just append 4 empty items to the line before the INSERT, and you won't need the CFTRYs at all.

Are there lots of email dupes? If the number of distinct emails isn't too great, try loading them into a query first and then doing a QofQ to look for the dupes, instead of going all the way back to the database each time. If you know that the same emails occur a number of times, then at least put a CachedWithin= in the database query to save some of that overhead.

But by far the best bet is Adam's other suggestion - get this out of CF and into the database's ETL if possible. If you're using MS SQL, this is a trivial SSIS task. I used to get CSV data files that were being processed by CF, which over time became huge files. CF was taking hours to process, but when we made it an SSIS task, it only took seconds to complete.

Report · Jun 08, 2010

Hi Reed

Thank you for you for your response.

Whilst I may think I know a lot about CF and MySQL, I actually don’t know nearly as much as I need to.

Could I ask you to explain how to use SSIS?

From the time the CSV is uploaded, what do I do?

DO you have a working example for me to see? I really would be most grateful.

Best regards

Delon

Report · Jun 08, 2010

I googled "mysql bulk load", and this was the first match: http://www.classes.cs.uchicago.edu/archive/2005/fall/23500-1/mysql-load.html

It seems to explain it fairly succinctly (I sound vague because I've never done this, I just know it's possible). It looks pretty straight forward?

It at least gives you a stepping off point to read up on "LOAD DATA" (http://dev.mysql.com/doc/refman/5.1/en/load-data.html), which seems to be the relevant command to do what you need to do...

--

Adam

Report · Jun 08, 2010

For what you are trying to do you can pretty much just run through the Import wizard and then save the project. You can then rerun it whenever you need to import the file. Below is a link to a good tutorial on the process. Once you've done it a couple of times it starts to make sense. There is no "source code" in the programming language sense for me to copy/paste into this posting for you to look at - it is all done in BIDS (aka Visual Studio) and then SQL does what in programming we would call a "compile" and executes it. Check out this link, it has a lot of screen shots of the BIDS screens you need to go through. I think it will get you rolling:

http://www.builderau.com.au/program/sqlserver/soa/How-to-import-an-Excel-file-into-SQL-Server-2005-u...

If this data import is a periodic task that you need to perform, you might want to use CF to handle getting the new file into the folder that your SSIS job is going to be looking for it, and then use the SQL scheduler to have that job run on a periodic basis. For example, I have a CF job that monitors INBOXes via CFEXCHANGE, and detaches email attachment files and thorws them in a folder that the SSIS task looks in and processes. There are lots of ways to skin that cat. Theorically, there is a way to actually execute an SSIS job from CF, but I've never been able to get all the moving parts working (involves a combination of calling an SP or using CFEXECUTE to run the SQL command line interface to invoke the SSIS job).

Report · Jun 08, 2010

See the problem is, it is a bulk mailing app which is available to all of my clients.

So each client has their own mail lists etcx.

How would I automate this on upload of csv.

Ideally what I would need is:

1. user uploads CSV to temp folder

2. Sql imports data then using the SSIS method you mention.

Please advise.

Regards

Delon

Report · Jun 08, 2010

If you can learn how your MySQL database does bulk file loading. It is often pretty easy to move a file into the location the Databae Managerment system needs it and to fire off the bulk loading process with a ColdFusion template.

Report · Jun 08, 2010

Thanks ian.

Thanks to reed as well.

Reed I will try your method using cfloop first and see if it works. If I don’t come right, I will try finding a mysql bulk upload method.

Thanks.

I will revert back with progress.

Regards

Delon

Report · Jun 08, 2010

The CSV has 120 000 rows.

Others have given far more comprehensive answers. But for what it is worth, I would also recommend looking into option #4 (ie Using MySQL's bulk insert tool instead). All things considered it is almost certainly far more efficient than running what appears to be 240K separate queries. If I am not mistaken ..

Report · Jun 08, 2010

I'm not sure I totally agree with you on this. First of all, he is trying to automate things, so any use of the MySQL bulk insert tool would have to be something that can work in an automated/scheduled environment. Secondly, I think you underestimate ColdFusion's database performance. Here are 2 datapoints from jobs that I have that run every night, loading data from previously downloaded files into a MS SQL Server 2005 database, one insert at a time, running simultaneously along with 4-6 other jobs also loading data into that same database:

Job 1 2.7 million records in 3.5 hours

Job 2 1.7 million records in 2.5 hours

I'm not arguing that it is not more efficient to do the work totally within the database - that is certainly true. My jobs would probably finish in a couple of minutes, and one of my "spare time projects" is to go in that direction. What I am saying is that the total time for one of his jobs to process the CSV file shouldn't be all that bad, and that they might be a significant amount of time and effort required to get the pure-database approach done, so if he needs to get this working soon then the pure-CF approach really isn't that bad.

Report · Jun 09, 2010

significant amount of time and effort

I do not think I understand where you are coming from there... The syntax of bulk loading commands like LOAD DATA or BULK INSERT is quite simple, and the commands can be used from within a cfquery. So there would be no difference in the automation process than with the existing code. I do not use MySQL as frequently as other databases, but a quick test of 120K lines took about three (3) seconds and four (4) lines of code for me. That is not intended as sarcasm. I am just not sure I understand the perceived complexity here ..

IMO, it is not about what you *can* do with CF, but what is the right tool for the task. As you already said, most database bulk loaders do a far superior job at loading raw data. Both in time and overall performance. Running 240K (or however many) separate queries creates a LOT of extra traffic/overhead as all of the commands are passed back and forth individually, and processed on each side. Not something you generally need just to bulk load data. That is one of the reasons bulk loaders outperform the looping method, by a landslide. At least when it comes to large amounts of data. You just kick off a single command and the loader returns a single result when finished.

Now I would not say bulk loading is *always* the best option. But it is usually the better overall approach when dealing with large imports of raw data. That said, there are considerations like file format format and permissions. Bulk loading tools can also be unforgiving when it comes to malformed files. However, you can always scrub the data first. Then let the database do the heavy lifting. Given that this is what bulk loaders were specifically designed to do, and the extreme performance gains possible, it is always worth investigating IMO.

Message was edited by: -==cfSearching==-

Report · Jun 09, 2010

It would be great if you could show me, using the code I gave you, how to do BULK INSERTS?

Thank you

Report · Jun 08, 2010

Sorry, I just noticed Adam's post and realized you said "MySQL" and not "MSSQL"! SSIS is part of MS SQL. I'm almost positive that one of the php-based MySQL management tools does simple ETL, but if you're trying to automate all of this I'm not sure what's available for free in the MySQL world. You might be more bang-for-your-time to try a couple of the CF ideas that Adam and I gave earlier in this post to see if you can make your CF execution time more reasonable. From what you described it really sounds like you are running into a performance problem because of memory usage, and doing the line by line file processing instead of reading it all into memory should fix that easily with just a couple of edits to your CF file.

Report · Jun 08, 2010

Ok this is where my bit of inexperience as a developer comes in...

If I look at my code it looks like I am doing the upload line by line.

By using the code above, how would I change it to insert line by line?

Thanks again.

Report · Jun 08, 2010

In Livedocs you'll see examples under CFLOOP FILE= that will show you how to change the CFLOOP line of code. Just do that, and remove the CFFILE ACTION=READ and you should be all set. Right now what is happening is that you are reading the entire file into a variable and then your CFLOOP is processing that variable line by line. That's where the memory is being eaten up - buy having the entire file in the variable. If you change the CFLOOP to FILE=, then CF will only read one line of the file at a time, each time through the loop, and so there is never much data sitting in memory. Check out the LiveDocs examples, they are pretty simple. http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/index.html

Report · Jun 08, 2010

You are doing the UPLOAD line by line, but you are reading the entire CSV file into memory at once. That is a likely source of your problem

If you have a relativly recent version of CF, you can use the <cfloop file="/a/path/to/yourFile.csv"...> to also read the CSV file line-by-line while doing your upload line-by-line.

Adobe Community

CSV INSERT TAKES OVER 24 HOURS