How can we help you today?
Query issues - how to form this query so VistaDB 5.8.7 likes it
Oops - copy-past error. The second query should have the MAX(it.FamilyID) as FamilyID like the 3rd one does. Is it possible to edit posts?
This is the query that I really want, but I was going for the #temp table because even SQL Server sometimes does this badly...
select -- a bunch of columns, but for simplicity we'll go with... * from Item it join Batch b on b.FileID=it.FileID and b.LineNumber=it.BatchLineNumber join [File] f on f.FileID=it.FileID where it.FamilyID in ( select FamilyID from Item it group by FamilyID having count(*) > 1 ) ;
It looks like this will work - avoiding SELECT ... INTO and having an extra column in the temp table.
create table #ids ( FamilyID bigint not null primary key, [Count] int not null ); insert #ids(FamilyID, [Count]) select FamilyID, count(*) as [Count] from Item group by FamilyID having count(*) > 1 ; select -- a bunch of columns... * from #ids id join Item it on it.FamilyID=id.FamilyID join Batch b on b.FileID=it.FileID and b.LineNumber=it.BatchLineNumber join [File] f on f.FileID=it.FileID order by id.FamilyID ; drop table #ids;
That leaves just one question (which I can probably answer experimentally): Will VistaDB reify the entire result set and then sort it, or will it realize that an index scan of #ids will produce the results in the correct order. Note that all join targets are SARGable in SQL server terms - they're indexes or primary keys on the target tables.
Is there any way to see something like a query plan?
If you run this in Data Builder there's an Execution Log tab that shows information about how the query is running. I believe it indicates whether it's doing a Live query or not (a Live query is when it can output the rows directly from the underlying query without having to spool and sort them separately).
Thanks, Kendall. I haven't seen anything in the log about Live query, but I realized a simple test - I added a "top 10" to the query. If the result is fast, it's being streamed as it runs. If not, it's doing an offline sort. The result IS fast, so it appears that there's no sort.
That's a clever experiment, but it might be misleading. The presence of a small TOP limit can affect how VistaDB chooses to run the query. If the TOP limit is fairly small, but the likely result set would otherwise be large, it figures that trying to run a live query walking an index that matches the ORDER BY will be cut off by the TOP limit in less time than accumulating the entire query's results and sorting them to find the TOP few. But, without the TOP limit (or if the limit is large enough) it may run the same query as a non-live query and accumulate the entire results and then sort them--because it turns out that walking an index is much slower for the engine than walking the natural table storage (in order of original insertion), so if it isn't going to be cut off early it's actually faster as a non-live query.
But, you can do something similar that might be more accurate. If you use ExecuteReader() in your app or test code you can try measuring how long it takes to return the reader and how long it takes for rows after the first one. If returning the reader is very slow and subsequent rows (after the first, which is found before returning the reader and should always be fairly fast to return once you have the reader) are then very fast, it's probably a non-live query and computed all the results up front. If returning the reader is faster than the time (in total) reading several subsequent rows then it's probably a live query and is finding the rows as it goes. (This assumes it's a large enough query which takes seconds, or longer, to compute; otherwise, it's harder to tell externally.)
Your first point about GROUP BY and HAVING is correct. We'll try to get that fixed, hopefully in a 6.x version. You may have also found a bug in SELECT … INTO with aggregate functions (or specifically with a GROUP BY and/or HAVING clause). It might be validating the result columns incorrectly in that case. But, your first two examples appear to be the same; am I missing a difference in them or did you perhaps paste the wrong query for one of them?
If you like the structure and performance of your second-to-last query from your original post, you can try wrapping that as a sub-query to select out just the first column (which could also apply a WHERE clause on the count rather than use a having clause in the sub-query--I'm not sure which way will perform better, but the HAVING clause approach will presumably use less memory internally). So something like:
select sq.FamilyID
--into #ids
from ( select it.FamilyID, count(*) as [Count]
from Item it
group by it.FamilyID
having count(*) > 1 ) as sq --Or if that fails, try: having [Count] > 1
or
select sq.FamilyID
--into #ids
from ( select it.FamilyID, count(*) as [Count]
from Item it
group by it.FamilyID ) as sq
where sq.[Count] > 1
Either of these should also get around the bug in SELECT … INTO if you add INTO to the outer query. (I haven't tried them, but I think those would work--assuming you first drop the #ids table if it already exists.)
Rob - thanks for the comments.
Regarding TOP: Yes, quite so - as is also the case with SQL Server. It appears that this does run as a live query, but I haven't actually run it over enough rows to be certain.
Regarding second example: Yes, copy-paste error. The second query should have been selecting MAX(FamilyID) but I pasted the wrong one.
Regarding suggestions: I'll check those out if the construct I have now proves to perform inadequately on a larger dataset. For now, my last version appears to be OK.
Regarding group-by/having queries with no aggregates in the results and select/into oddities: should I file bug reports for these items?
You can file a support ticket on these issues, especially if you can include an example database against which your example queries can be run to demonstrate the different cases.
If the database is small enough (less than 20MB) it can be included as an attachment. Otherwise, you can provide a link for us to download it or use the instructions here to send it to our TransferBigFiles drop-box. But, for this case, a smaller database would probably be better, anyway, since it's mostly schema- and syntax-dependent rather than data-dependent. It would help to have enough data for the queries to be meaningful when they work, especially if it's also small enough to validate that the results are correct.
Carl Daniel
Assume a table Item with a column FamilyID bigint not null.
Here's the query I wrote:
VistaDB complains "HAVING clause cannot be used if result does not have aggregate functions". This is untrue - it should be aggregate functions and/or grouping keys.
So I tried:
VistaDB complains: "Each result column of a SELECT ... INTO must have a unique alias name from a column name or an explicit alias. FamilyID"
Um, what? There's only one column, and it has a unique name.
So I tried:
No error, but the query also produces no results. There are ~180,000 rows in Item with about 40,000 distinct values of FamilyID.
So I tried:
Ah ha! Results! But I only want the IDs, not the counts. I guess I can keep the extra column, but it's wasting memory.
Feeling encouraged, I tried:
But no we're back to the earlier error about unique column aliases.
Questions:
1. Does VistaDB not support temp tables?
2. Is there any way to write the query so that I don't need to add an unneeded aggregate column to the results to get VistaDB to process it?