How do I select single entry based on mutiple values?

Report · Jul 15, 2011

Okay – here’s one for the guys smarter than me – IE: everyone. An interesting query.

(CF8, MySQL5)

I’m building an internal stock library for a large firm. Easy job – and its always the easy ones that knock you, ain’t it?

Basically, its very very simple, I have three tables.

Table 1 is a list of images (ImageTbl) – this has a lot of data on it, but the only important thing from ImageTbl is the unique identifier, RowID. It looks something like this

Rowid	Photo
1	Gold.jpg
2	iron.jpg
3	train.jpg
4	foo.jpg
5	bar.jpg

…Etc. for hundreds of images.

Table 2 is a list of features (FeatureTbl) that can be applied to any image, its literally just a RowID and a Feature, IE

Rowid	Feature
1	Mining
2	Coal
3	Rail Transport
4	Iron Ore
5	Road Transport
6	Refined Product
7	Gold Ore

etc. for about 50-odd features. Every image that is uploaded to the system is then given AT LEAST one of these categories, but it can be as many as 9 or 10.

This is where the fun starts…. So, maybe I’ve been given the wrong advice, but I’ve been told not to store strings, but to “normalise” the database, which means I’ve got a third table which has an image number and a feature number: (ImageFeaturesTbl)

ImageNo	FeatureNo
1	1
1	3
1	5
2	1
2	7
3	5
etc	100's of records

(This is how I was told to do this, I can probably change the system if I really need to)

Now they want to have a “search” function. Gimme all images that feature “coal”. Got the features list as a checklist on a form, check the feature you want, hit submit – voila! Easy – run a query on ImageFeaturesTbl and Bob is your auntie.

Select imageNo from ImageFeaturesTbl where feature=#FORM.FeatureNo#

Works a treat! Sits out all the images that has the "Coal" featreNo. But now, what if there are more than 2 categories? What if they want all images that has “Coal” or “gold” on it – that one I also figured out, the “OR” query. I can do a loop and search for images on ImageFeaturesTbl that has either one or the other. So this is not the problem either.

The fun really starts with the “and” query – select images that has both “coal” and “rail transport”… and this is where I get stuck. I’ve literally worked till 4:00 AM this morning – and it’s a weekend; and up again at 7 trying to solve this puzzle. I’m miserable!

Basically I’m wanting to get from ImageFeaturesTbl a list of image numbers that contains ALL the selected features. It could be one feature (always at least one) but it could be Nine or 10. But it has to have ALL the featurees. So the system has to give me all the images that shows coal being transported on rail in summer at the harbour, for instance.

The database and tables exist, but how the hell do I write that select query….

Thanks - I'll email the winning answer a crate of beer.

Report · Jul 16, 2011

Well I was at a beer festival last night, so I'm perhaps not in the best position to offer perfect advice, but the promise of virtual beer is simply too appealing to pass up.

What you could do (and I'm pretty sure MySQL supports this) is an EXISTS clause. Something like this:

SELECT

I.Rowid,

I.Img

FROM

ImageTbl I

WHERE EXISTS (

SELECT 1

FROM

ImageFeaturesTbl IF

WHERE

IF.ImageNo = I.RowId

AND

IF.FeatureNo = <cfqueryparam cfsqltpe="cf_sql_numeric" value="#SearchFeature#" /> )

That'll pull you back any rows for your primary example there you want one feature. For multiple features, you can simply do a CFLOOP with as many AND EXISTS () clauses in there for the other feature IDs you want.

That way you will pull back only rows matching all the clauses you wanted.

Did that make sense? My vision's still a little fuzzy.

Report · Jul 16, 2011

I'm not a MySQL person, but can't you simply join the tables and use an IN predicate?

Something like:

select distinct photo

from ImageTbl

inner join ImageFeatureTbl on ImageFeatureTbl.ImageNo = ImageTbl.RowID

where Feature in ('#form.FeatureNo#')

You would want to make sure you have indexes setup to properly handle the join as well

as do a bit of work to make sure your FeatureNo list variable is setup properly (you probably

don't want to use the form variable directly).

Good Luck

--

Report · Jul 16, 2011

I'm not a MySQL person, but can't you simply join the tables and use an IN predicate?
Something like:
select distinct photo
from ImageTbl
inner join ImageFeatureTbl on ImageFeatureTbl.ImageNo = ImageTbl.RowID
where Feature in ('#form.FeatureNo#')

That would get photos which had feature x OR feature y. What the OP is after is photos that have feature x AND feature y.

I'm racking my brains as to whether I think think of something better than Owain's beer-soaked approach, but haven't come up with anything yet. Other than going "mmmmm.... beer..."

--

Adam

Report · Jul 16, 2011

select somefields

from table1 t1 join table3 t3 on t1.rowID = t3.rowId

where FeatureNo in (select rowId from table2 where lower (feature) like '%coal%')

and FeatureNo in (select rowId from table2 where lower (feature) like '%gold%')

Report · Jul 18, 2011

select somefields
from table1 t1 join table3 t3 on t1.rowID = t3.rowId
where FeatureNo in (select rowId from table2 where lower (feature) like '%coal%')
and FeatureNo in (select rowId from table2 where lower (feature) like '%gold%')

I think the OP implied they were having some kind of structured search, so there's no need to search the text fields, you can just use the criteria primary keys.

Secondly I'd always go for an EXISTS clause over an IN, as the latter has to evaluate the entire subquery and return its dataset before it can evaluate the outer query. The whole point of an EXISTS is that as soon as it finds a value, it exits the clause and doesn't do any further processing.

You'd need to do some testing, but I'd be willing to get an IN would be much less efficient.

This is my sober reply.

Report · Jul 18, 2011

"Exists" and "in" are logically different. Basing your choice on efficiency could get you the wrong answer faster.

Our different interpretations of the OP are neither here nor there.

Report · Jul 18, 2011

I retract my earlier post about "in" and "exists" being logically different. I also followed Owain's invitation to test the two approaches for speed. Here are two logically equivalent queries using redbrick software.

select registration_number

from admit_fact af

where date = yesterday

and admission

and registration_number

in (select registration_number from visit_fact vf

where registration_date = yesterday);

** STATISTICS ** (500) Time = 00:00:00.09 cp time, 00:00:00.11 time, Logical IO count=928
** INFORMATION ** (256) 24 rows returned.

select registration_number

from admit_fact af

where date = yesterday

and admission

and exists

(select 1 from visit_fact vf

where registration_date = yesterday

and vf.registration_number = af.registration_number);

** STATISTICS ** (500) Time = 00:00:00.28 cp time, 00:00:00.29 time, Logical IO count=6112
** INFORMATION ** (256) 24 rows returned.

Using "in" is faster.

Report · Jul 18, 2011

Simply cannot argue with those figures! Just goes to show the importance of testing your queries before putting them into production, something I do more and more these days.

Out of interest, what platform was that on? And was that on a completely non-indexed table?

Report · Jul 18, 2011

Simply cannot argue with those figures!

Of course you can. For one thing, it flies in the face of logic. As you said: EXISTS resolves as soon as it matches, whereas IN needs to keep going to build the entire recordset. So - all things being equal - it's going to be faster. It's doing less work.

I would question Dan's testing.

How many rows were resolved by the IN subquery? I wager "not many". As the number of results the subquery returns decreases, the difference between the two will diminish.

How many test runs did Dan make? Looks like "one each". That's no test. Who knows what else the CPU was up to at that moment? The testing should be done over a number of iterations. With each iteration being parameterised differently, so the DB is not just spewing a cached result out as fast as it can.

Was each test result based on the first run of that SQL? Or the nth? Because if it's the first, then the query compile time is going to have a fair impact on the results.

Are the appropriate relationships and indexes in place?

I don't see anything here that represents a reasonable test of the hypothesis, and the conclusions - like I said - fly in the face of common sense, so I question the experiment.

So, yeah, one can argue with those figures. Easily.

--

Adam

Report · Jul 18, 2011

In the first example, the subquery returned 24 rows. Here it is again where the subquery returns 4.1 million rows:

select registration_number

from admit_fact af

where date = yesterday

and admission

and registration_number

in (select registration_number from visit_fact vf

);

** STATISTICS ** (500) Time = 00:00:51.06 cp time, 00:00:56.07 time, Logical IO count=136900
** INFORMATION ** (256) 24 rows returned.

select registration_number

from admit_fact af

where date = yesterday

and admission

and exists

(select 1 from visit_fact vf

where vf.registration_number = af.registration_number

);

** STATISTICS ** (500) Time = 00:00:00.34 cp time, 00:00:00.43 time, Logical IO count=6380
** INFORMATION ** (256) 24 rows returned.

In this case "exists" is faster.

Report · Jul 18, 2011

Well as I say it demonstrates the need for testing if nothing else! I've done some testing with my own data here, unfortunately (?) Oracle's cost-based-optimiser is pretty clever and often notices you doing something stupid and changes the way it'll execute anyway so there's very little difference showing up between the two for me.

I don't have any MySQL databases to try with.

Still, passed some time

Report · Jul 18, 2011

As stated, the database is redbrick (now an IBM product). Three indexed fields were used, admit_fact.date, admit_fact.event_code (not visible in the query), and visit_fact.registration_date.

Report · Jul 19, 2011

Shoo guys, you really have got me thinking here… the joins do not work – but I’m not that familiar with the joins… Adam Cameron is right, I want an AND not an OR – the or I’ve got sorted.

So far, Owain North’s 1^st answer makes the most sense, but I can’t get it to work - something about incorrect syntax.

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'IF WHERE ImageNo = I.RowId AND IF.featureno = 3 )’

Owain – help, please…

Owain is also correct that I need a Rowid, not a text search.

The problem I have with the “like” query is that it picks up 5, 15 and 25 when I only want 5.

This thing is going to turn my remaining hairs grey…. I do however appreciate all your help, and this just showed me I know a lot less about SQL queries than I thought I did…

Report · Jul 19, 2011

syntax to use near 'IF WHERE ImageNo = I.RowId AND IF.featureno = 3 )’

I have barely skimmed the thread. But if that is not a typo, your problem is almost certainly using IF as a table alias (usually a key word). Try using an abbreviation that is less likely to be a reserved word.

Report · Jul 19, 2011

As pointed out, it looks like you're using IF as a table alias, which is a reserved word. Seeing as you have a table called ImageFeatures, I'd hazard a guess that's exactly what you're doing

Report · Jul 20, 2011

I have found a solution. Its not elegant, its not pretty, but it fulfils the most important criteria: happy client.

This is what I did:
Have the search criteria (index numbers) as a list.
Run a query looking for all images with the FIRST criteria, output result set as list.
Run a loop from SECOND criteria to “n” containing a second query looking for images in the list and output values to new list.

Ta-dah!

So I have two queries, one to just get an initial result set, and one looped.

Yes its ugly, but it works!

Adobe Community

How do I select single entry based on mutiple values?