Copy link to clipboard
Copied
Okay – here’s one for the guys smarter than me – IE: everyone. An interesting query.
(CF8, MySQL5)
I’m building an internal stock library for a large firm. Easy job – and its always the easy ones that knock you, ain’t it?
Basically, its very very simple, I have three tables.
Table 1 is a list of images (ImageTbl) – this has a lot of data on it, but the only important thing from ImageTbl is the unique identifier, RowID. It looks something like this
Rowid | Photo |
---|---|
1 | Gold.jpg |
2 | iron.jpg |
3 | train.jpg |
4 | foo.jpg |
5 | bar.jpg |
…Etc. for hundreds of images.
Table 2 is a list of features (FeatureTbl) that can be applied to any image, its literally just a RowID and a Feature, IE
Rowid | Feature |
---|---|
1 | Mining |
2 | Coal |
3 | Rail Transport |
4 | Iron Ore |
5 | Road Transport |
6 | Refined Product |
7 | Gold Ore |
etc. for about 50-odd features. Every image that is uploaded to the system is then given AT LEAST one of these categories, but it can be as many as 9 or 10.
This is where the fun starts…. So, maybe I’ve been given the wrong advice, but I’ve been told not to store strings, but to “normalise” the database, which means I’ve got a third table which has an image number and a feature number: (ImageFeaturesTbl)
ImageNo FeatureNo 1 1 1 3 1 5 2 1 2 7 3 5 etc 100's of records
(This is how I was told to do this, I can probably change the system if I really need to)
Now they want to have a “search” function. Gimme all images that feature “coal”. Got the features list as a checklist on a form, check the feature you want, hit submit – voila! Easy – run a query on ImageFeaturesTbl and Bob is your auntie.
Select imageNo from ImageFeaturesTbl where feature=#FORM.FeatureNo#
Works a treat! Sits out all the images that has the "Coal" featreNo. But now, what if there are more than 2 categories? What if they want all images that has “Coal” or “gold” on it – that one I also figured out, the “OR” query. I can do a loop and search for images on ImageFeaturesTbl that has either one or the other. So this is not the problem either.
The fun really starts with the “and” query – select images that has both “coal” and “rail transport”… and this is where I get stuck. I’ve literally worked till 4:00 AM this morning – and it’s a weekend; and up again at 7 trying to solve this puzzle. I’m miserable!
Basically I’m wanting to get from ImageFeaturesTbl a list of image numbers that contains ALL the selected features. It could be one feature (always at least one) but it could be Nine or 10. But it has to have ALL the featurees. So the system has to give me all the images that shows coal being transported on rail in summer at the harbour, for instance.
The database and tables exist, but how the hell do I write that select query….
Thanks - I'll email the winning answer a crate of beer.
Copy link to clipboard
Copied
Well I was at a beer festival last night, so I'm perhaps not in the best position to offer perfect advice, but the promise of virtual beer is simply too appealing to pass up.
What you could do (and I'm pretty sure MySQL supports this) is an EXISTS clause. Something like this:
SELECT
I.Rowid,
I.Img
FROM
ImageTbl I
WHERE EXISTS (
SELECT 1
FROM
ImageFeaturesTbl IF
WHERE
IF.ImageNo = I.RowId
AND
IF.FeatureNo = <cfqueryparam cfsqltpe="cf_sql_numeric" value="#SearchFeature#" /> )
That'll pull you back any rows for your primary example there you want one feature. For multiple features, you can simply do a CFLOOP with as many AND EXISTS () clauses in there for the other feature IDs you want.
That way you will pull back only rows matching all the clauses you wanted.
Did that make sense? My vision's still a little fuzzy.
Copy link to clipboard
Copied
I'm not a MySQL person, but can't you simply join the tables and use an IN predicate?
Something like:
select distinct photo
from ImageTbl
inner join ImageFeatureTbl on ImageFeatureTbl.ImageNo = ImageTbl.RowID
where Feature in ('#form.FeatureNo#')
You would want to make sure you have indexes setup to properly handle the join as well
as do a bit of work to make sure your FeatureNo list variable is setup properly (you probably
don't want to use the form variable directly).
Good Luck
--
Copy link to clipboard
Copied
I'm not a MySQL person, but can't you simply join the tables and use an IN predicate?
Something like:
select distinct photo
from ImageTbl
inner join ImageFeatureTbl on ImageFeatureTbl.ImageNo = ImageTbl.RowID
where Feature in ('#form.FeatureNo#')
That would get photos which had feature x OR feature y. What the OP is after is photos that have feature x AND feature y.
I'm racking my brains as to whether I think think of something better than Owain's beer-soaked approach, but haven't come up with anything yet. Other than going "mmmmm.... beer..."
--
Adam
Copy link to clipboard
Copied
select somefields
from table1 t1 join table3 t3 on t1.rowID = t3.rowId
where FeatureNo in (select rowId from table2 where lower (feature) like '%coal%')
and FeatureNo in (select rowId from table2 where lower (feature) like '%gold%')
Copy link to clipboard
Copied
select somefields
from table1 t1 join table3 t3 on t1.rowID = t3.rowId
where FeatureNo in (select rowId from table2 where lower (feature) like '%coal%')
and FeatureNo in (select rowId from table2 where lower (feature) like '%gold%')
I think the OP implied they were having some kind of structured search, so there's no need to search the text fields, you can just use the criteria primary keys.
Secondly I'd always go for an EXISTS clause over an IN, as the latter has to evaluate the entire subquery and return its dataset before it can evaluate the outer query. The whole point of an EXISTS is that as soon as it finds a value, it exits the clause and doesn't do any further processing.
You'd need to do some testing, but I'd be willing to get an IN would be much less efficient.
This is my sober reply.
Copy link to clipboard
Copied
"Exists" and "in" are logically different. Basing your choice on efficiency could get you the wrong answer faster.
Our different interpretations of the OP are neither here nor there.
Copy link to clipboard
Copied
I retract my earlier post about "in" and "exists" being logically different. I also followed Owain's invitation to test the two approaches for speed. Here are two logically equivalent queries using redbrick software.
select registration_number
from admit_fact af
where date = yesterday
and admission
and registration_number
in (select registration_number from visit_fact vf
where registration_date = yesterday);
** STATISTICS ** (500) Time = 00:00:00.09 cp time, 00:00:00.11 time, Logical IO count=928
** INFORMATION ** (256) 24 rows returned.
select registration_number
from admit_fact af
where date = yesterday
and admission
and exists
(select 1 from visit_fact vf
where registration_date = yesterday
and vf.registration_number = af.registration_number);
** STATISTICS ** (500) Time = 00:00:00.28 cp time, 00:00:00.29 time, Logical IO count=6112
** INFORMATION ** (256) 24 rows returned.
Using "in" is faster.
Copy link to clipboard
Copied
Simply cannot argue with those figures! Just goes to show the importance of testing your queries before putting them into production, something I do more and more these days.
Out of interest, what platform was that on? And was that on a completely non-indexed table?
Copy link to clipboard
Copied
Simply cannot argue with those figures!
Of course you can. For one thing, it flies in the face of logic. As you said: EXISTS resolves as soon as it matches, whereas IN needs to keep going to build the entire recordset. So - all things being equal - it's going to be faster. It's doing less work.
I would question Dan's testing.
How many rows were resolved by the IN subquery? I wager "not many". As the number of results the subquery returns decreases, the difference between the two will diminish.
How many test runs did Dan make? Looks like "one each". That's no test. Who knows what else the CPU was up to at that moment? The testing should be done over a number of iterations. With each iteration being parameterised differently, so the DB is not just spewing a cached result out as fast as it can.
Was each test result based on the first run of that SQL? Or the nth? Because if it's the first, then the query compile time is going to have a fair impact on the results.
Are the appropriate relationships and indexes in place?
I don't see anything here that represents a reasonable test of the hypothesis, and the conclusions - like I said - fly in the face of common sense, so I question the experiment.
So, yeah, one can argue with those figures. Easily.
--
Adam
Copy link to clipboard
Copied
In the first example, the subquery returned 24 rows. Here it is again where the subquery returns 4.1 million rows:
select registration_number
from admit_fact af
where date = yesterday
and admission
and registration_number
in (select registration_number from visit_fact vf
);
** STATISTICS ** (500) Time = 00:00:51.06 cp time, 00:00:56.07 time, Logical IO count=136900
** INFORMATION ** (256) 24 rows returned.
select registration_number
from admit_fact af
where date = yesterday
and admission
and exists
(select 1 from visit_fact vf
where vf.registration_number = af.registration_number
);
** STATISTICS ** (500) Time = 00:00:00.34 cp time, 00:00:00.43 time, Logical IO count=6380
** INFORMATION ** (256) 24 rows returned.
In this case "exists" is faster.
Copy link to clipboard
Copied
Well as I say it demonstrates the need for testing if nothing else! I've done some testing with my own data here, unfortunately (?) Oracle's cost-based-optimiser is pretty clever and often notices you doing something stupid and changes the way it'll execute anyway so there's very little difference showing up between the two for me.
I don't have any MySQL databases to try with.
Still, passed some time
Copy link to clipboard
Copied
As stated, the database is redbrick (now an IBM product). Three indexed fields were used, admit_fact.date, admit_fact.event_code (not visible in the query), and visit_fact.registration_date.
Copy link to clipboard
Copied
Shoo guys, you really have got me thinking here… the joins do not work – but I’m not that familiar with the joins… Adam Cameron is right, I want an AND not an OR – the or I’ve got sorted.
So far, Owain North’s 1st answer makes the most sense, but I can’t get it to work - something about incorrect syntax.
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'IF WHERE ImageNo = I.RowId AND IF.featureno = 3 )’
Owain – help, please…
Owain is also correct that I need a Rowid, not a text search.
The problem I have with the “like” query is that it picks up 5, 15 and 25 when I only want 5.
This thing is going to turn my remaining hairs grey…. I do however appreciate all your help, and this just showed me I know a lot less about SQL queries than I thought I did…
Copy link to clipboard
Copied
syntax to use near 'IF WHERE ImageNo = I.RowId AND IF.featureno = 3 )’
I have barely skimmed the thread. But if that is not a typo, your problem is almost certainly using IF as a table alias (usually a key word). Try using an abbreviation that is less likely to be a reserved word.
Copy link to clipboard
Copied
As pointed out, it looks like you're using IF as a table alias, which is a reserved word. Seeing as you have a table called ImageFeatures, I'd hazard a guess that's exactly what you're doing
Copy link to clipboard
Copied
I have found a solution. Its not elegant, its not pretty, but it fulfils the most important criteria: happy client.
Ta-dah!
So I have two queries, one to just get an initial result set, and one looped.
Yes its ugly, but it works!