ALL, DISTINCT, DISTINCTROW, TOP Yüklemleri. Distinct лыжи


ALL, DISTINCT, DISTINCTROW, TOP Yüklemleri

Not:  En güncel yardım içeriklerini, mümkün olduğunca hızlı biçimde kendi dilinizde size sunmak için çalışıyoruz. Bu sayfanın çevirisi otomasyon aracılığıyla yapılmıştır ve bu nedenle hatalı veya yanlış dil bilgisi kullanımları içerebilir. Amacımız, bu içeriğin sizin için faydalı olabilmesini sağlamaktır. Buradaki bilgilerin faydalı olup olmadığını bu sayfanın sonunda bize bildirebilir misiniz? Kolayca başvurabilmek için İngilizce makaleye buradan ulaşabilirsiniz .

SQL sorgularıyla seçilen kayıtları belirtir.

Söz dizimi

SEÇİN [TÜM | DISTINCT | DISTINCTROW | [Üst n [yüzde]]] TABLOSUNDAN

Bu yüklemleri içeren bir SELECT deyimi şu bölümlerden bulunur:

Bölüm

Açıklama

ALL

Yüklemleri birini dahil değil, kabul edilir. Microsoft Access veritabanı altyapısına SQL deyiminde koşulları karşılayan kayıtların tümünü seçer. Aşağıdaki iki örnek eşdeğer ve Çalışanlar tablosundan tüm kayıtları döndürür:

SELECT ALL *FROM EmployeesORDER BY EmployeeID; SELECT *FROM EmployeesORDER BY EmployeeID;

DISTINCT

Seçili alanları yinelenen verileri içeren kayıtları atlar. Sorgu sonuçlarında dahil edilecek değerleri SELECT deyiminde listelenen her alan için benzersiz olması gerekir. Örneğin, Çalışanlar tablosunda listelenen bazı çalışanlar Soyadı aynı olabilir. İki kaydı Soyadı alanına Atakan bulunuyorsa, aşağıdaki SQL deyimi Etikan içeren tek bir kayıt döndürür:

SELECT DISTINCT LastNameFROM Employees;

DISTINCT kullanmazsanız, bu sorgu her iki Etikan kayıtları döndürür.

SELECT yan tümcesi birden fazla alan içeriyorsa, tüm alanlardaki değerlerinin birleşimi, verilen bir kaydın sonuçlara dahil edilecek benzersiz olması gerekir.

DISTINCT kullanan bir sorgu çıktısı güncelleştirilebilir değil ve sonradan diğer kullanıcılar tarafından yapılan değişiklikleri yansıtmak değil.

DISTINCTROW

Tüm yinelenen kayıtlara, yalnızca yinelenen alanlara dayalı verileri atlar. Örneğin, müşteriler ve Siparişler tablolarını MüşteriNo alanında birleştiren bir sorgu oluşturabilirsiniz. Müşteriler tablosu yinelenen MüşteriNo alanlarını içerir, ancak her müşterinin siparişleri birçok olabilir çünkü Siparişler tablosunda yapar. Aşağıdaki SQL deyimini olması en az bir şirketler ancak bu siparişler hakkındaki ayrıntılar olmadan bir listesini elde etmek için DISTINCTROW nasıl kullanabileceğinizi gösterir:

SELECT DISTINCTROW CompanyNameFROM Customers INNER JOIN OrdersON Customers.CustomerID = Orders.CustomerIDORDER BY CompanyName;

DISTINCTROW kullanmazsanız, bu sorgu birden fazla siparişi olan her şirket için birden çok satır üretir.

Sorgusunda kullanılan tabloların tümüne değil, ancak bazı alanlar yalnızca seçtiğiniz zaman DISTINCTROW bir etkisi yoktur. Sorgunuzu yalnızca bir tablo içeriyorsa ya da tüm tablolardan alanların çıktısını DISTINCTROW göz ardı edilir.

ÜST n [yüzde]

Belirli bir en üstünde kalan kayıt sayısını veya ORDER BY yan tümcesiyle belirtilen bir aralığın altındaki verir. 1994 sınıfında ilk 25 öğrencinin adlarını istediğinizi varsayalım:

SELECT TOP 25FirstName, LastNameFROM StudentsWHERE GraduationYear = 2003ORDER BY GradePointAverage DESC;

ORDER BY yan tümcesi içermiyorsa, sorgu Öğrenciler tablosundan WHERE yan tümcesini karşılayan 25 kayıtları rasgele kümesini döndürür.

ÜST karşılaştırma eşit değerler arasından seçim değil. Yukarıdaki örnekte, yirmi beşinci ve yirmi altıncı en yüksek Not ortalamaları aynı olup olmadığını sorgu 26 kayıt verir.

Belirli bir yüzdesini üstündeki veya altındaki ORDER BY yan tümcesiyle belirtilen bir aralığın giren kayıtların dönmek için yüzde ayrılmış sözcüğünü de kullanabilirsiniz. Üst 25 Öğrenciler yerine, alt sınıfında yüzde 10 istediğinizi varsayalım:

SELECT TOP 10 PERCENTFirstName, LastNameFROM StudentsWHERE GraduationYear = 2003ORDER BY GradePointAverage ASC;

ASC karşılaştırma alt değerleri döndürmek belirtir. ÜST izleyen değer işaretsiz bir tamsayı olması gerekir.

Sorguyu güncelleştirilebilir olup olmadığını başı etkilemez.

tablo

Kayıtların getirildiği tablonun adı.

support.office.com

sql distinct kulanm nasldr, oklu distinct kullanm, birden fazla alanada distinct

SQL DISTINCT Kullanm

DISTINCT ifadesi tablodaki belirtilen alanda bulunan kaytlardan birer rnek alr. Yani tekrar eden kaytlardan bir tane alr ve bunun yanna da tekrar etmeyen kaytlar koyarak bir veri kmesi oluturur. Ne iimize yarar veya nerede kullanabiliriz sorusu akla gelebilir.

Mesela bir yelerinizin depoland bir tablo olduunu dnn. Bu tablodan mesela stanbul'da ka tane yeniz olduunu bulmak istediinizi dnn. Yaptnz programa bir alr liste kutusu koyup illerin isimlerini tek tek yazabilirsiniz. Ancak bu durum hem bir zaman kayb olur hemde sistemde kaytl olmayan ileri de listeleyecei iin, alr liste kutusu aldnda uzun bir liste olacaktr. Bunun yerine Distinct komutu kullanarak tablodaki ehir alannda yazan kaytlar tek drlr ve bir alr liste kutusuna aktarlabilir. Bylece ilgili ehir seilerek o ehirde ka tane yenizin olduunu grebilirsiniz. Listede grnmeyen ehirden yeniz olmad anlamn rahatlkla karabilirsiniz.

Distinct Kullanm Biimi

SELECT DISTINCT alan_adi1,alan_adi2 FROM tablo_adi

Distinct kelimesinden sonra yazlacak olan alanlara otomatik olarak uygulanr. Yani birden fazla alan zerinde Distinct yaplacaksa alanlarn bana tek tek yazlmaz. Ayrca Distinct komutu tek bana kullanlamaz. Mutlaka SELECT ifadesi ile kullanlmaldr.

Burada dikkat edilmesi gereken nokta oklu distinct kullanmnda belirtilen alanlardaki verileri bir btn olarak ele alr ve dier kayrlarda benzersiz alanlar bulmaya alr. (rnek2 ve rnek3'e bakabilirisiniz.)

rnek Tablo Uygulamas:rnek olarak aadaki gibi Personel isimli tablomuz olsun.

id Adi_soyadi Sehir Bolum Meslek_Kodu
1 Salih ESKOLU stanbul Bilgi lem Sorumlusu 1234567
2 Ayhan ETNKAYA Kocaeli dari ler Yneticisi 2345678
3 Serkan ZGREL Erzincan Muhasebe 3456789
4 lhan ZL stanbul Blgi lem Sorumlusu 2345678

rnek1:

SELECT DISTINCT Sehir FROM Personel 

Bu kod ile tablodaki Sehir alannda bulunan kaytlar birer defa alr.

kts:

Sehir
stanbul
Kocaeli
Erzincan

rnek2:

SELECT DISTNICT Sehir, Bolum FROM Personel

Bu rnekte Personel tabosundan ad soyad ve blm bilgisinin tutulduu alanlar seilmektedir. Ancak burada dikkat edilmesi gerekn nokta iki alann sanki tek bir alanm gibi deelendirilmesidir.

kts:

Sehir Bolum
stanbul Bilgi lem Sorumlusu
Kocaeli dari ler Yneticisi
Erzincan Muhasebe

Burada dikkat edeceiniz zere tablomuzdaki son satr almad. nk Sehir ve Bolum alanlarn tek bir alanm gibi dndmz zaman "stanbul Bilgi lem Sorumlusu" ifadesi ortaya kar. Son satrdaki kaytta ayn ifadeye denk gelmektedir. Bu sebeple dikkate alnmad.

rnek3:

SELECT DISTNICT Sehir, Bolum, Meslek_Kodu FROM Personel

Bu rnekte Personel tabosundan ad soyad ve blm bilgisinin tutulduu alanlar seilmektedir. Ancak burada dikkat edilmesi gerekn nokta iki alann sanki tek bir alanm gibi deelendirilmesidir.

kts:

Sehir Bolum Meslek_Kodu
stanbul Bilgi lem Sorumlusu 1234567
Kocaeli dari ler Yneticisi 2345678
Erzincan Muhasebe 3456789
stanbul Blgi lem Sorumlusu 2345678

Bu rnekte ise btn kaytlar gelmi oldu.  nk Sehir, Bolum ve Meslek_kodu alanlarn tek bir alanm gibi dndmz zaman; ilk satr rnek verecek olursak "stanbul Bilgi lem Sorumlusu 1234567" ifadesi ortaya kar. Sehir alannda iki tane stanbul olmasna ramen ikiside listelenmitir. nk iki kaydn Meslek_Kodu alannda yazan deer farkldr.

Ayn tabloyu aadaki kod ile altrdmz zaman: Select Distinct Bolum, Meslek_kodu FROM Personel

kts:

Bolum Meslek_Kodu
Bilgi lem Sorumlusu 1234567
dari ler Yneticisi 2345678
Muhasebe 3456789
Bilgi lem Sorumlusu 2345678

Dikkat edilecei zere Bilgi lem Sorumlusu alan iki defa gelmi oldu. Ayn mantktan yola akrak alanlarn birletirilmi olduunu dnrsek, Bilgi lem Sorumlusu kaytlarnda Meslek_Kodu alannda 1234567 ve 2345678 verileri vardr. Dolays ile bu iki satr benzersiz deildir.

www.sqlkodlari.com

How do the SQL DISTINCT and TOP SELECT Clauses Work Together to Generate Results?

3

This article is inspired by a series of questions that one of my readers, Nan, recently sent me regarding DISTINCT, TOP, and ORDER BY.

All the examples for this lesson are based on Microsoft SQL Server Management Studio and the AdventureWorks2012 database.  You can get started using these free tools using my Guide Getting Started Using SQL Server.

How does the SQL Top and Distinct SELECT modifiers Work Together to Produce Results?

Nan’s Original Question

Here is the question that Nan originally sent me:

I’m a bit confused about SELECT DISTINCT and SELECT.  For example,

SELECT DISTINCT TOP 10 FirstName, LastName FROM Person.Person ORDER BY LastName

Is this looking at distinct first names?  Distinct combined first and last names?  How do we distinguish between the columns used for the distinct evaluation and columns we just want to show in the output?

What about

Select Distinct TOP 10 LastName, FirstName + ' ' + LastName AS FullName FROM Person.Person ORDER BY LastName

I thought everyone would like to know the answer so I create a blog post.

DISTINCT and TOP – Which is First?

Let’s look at the first statement who purpose is to return a unique list of fist and last names.

SELECT DISTINCT TOP 10 FirstName, LastName FROM Person.Person ORDER BY LastName;

TOP 10 will return the first ten items from the ordered set, and DISTINCT will remove any duplicates.  The question is which happens first?

  • Is the table sorted by LastName and the top ten items taken, and then duplicate name removed?
  • Or are the duplicates removed, and then the items sorted and the top ten items displayed?

Before we answer this question keep in mind that DISTINCT operates on all column and expressions in the SELECT clause.  So in this case the statement will return distinct rows for FirstName and LastName.

Unfortunately there is no direct way to use DISTINCT on one set of fields and display others.  Once you add columns to the SELECT statement they become under the influence of the DISTINCT operator.  I say direct, as you could get a distinct list, and then use a INNER JOIN to pull in other columns.  There are dangers to doing that though, as the join may reintroduce duplicates.

Adding a TOP clause to DISTINCT is interesting.  I wasn’t sure what would happen, but I did some experimenting with the AdventureWorks database and found that the order of processing goes something like so:

  1. Select DISTINCT Values from Table and order
  2. Select the TOP x rows from the results in step 1 and display.

If you want to try this yourself start with

SELECT FirstName, LastName FROM Person.Person ORDER BY LastName

And notice the results.  Keep track of “Kim Ambercombie.”  Notice how there are three entries for her name.

Results sorted by LastName

Now run

SELECT DISTINCT FirstName, LastName FROM Person.Person ORDER BY LastName

And you’ll see that “Kim Ambercombine” is shown only once.

Uniquie list ordered by LastName

Then run

SELECT DISTINCT TOP 10 FirstName, LastName FROM Person.Person ORDER BY LastName

And you’ll see it returns first 10 unique first and last names as sorted by LastName.

First 10 unique rows ordered by LastName

If you’re wondering which happens first, the DISTINCT or TOP 10 operations, then compare the results from the last two queries.

Notice that the query “DISTINCT TOP 10” includes the first 10 rows from the query from the “DISTINCT” query.

From this we know a DISTINCT list is first created, and then the TOP 10 items returned.

Query plan showing order of execution

You can also confirm this by showing the query plan.  To do so, select Query -> Include Actual Query Plan from the menu before executing the query.

The “Stream Aggregate” icon is for the DISTINCT operation and “Top” for the TOP 10 one.

It may seem somewhat counterintuitive to see DISTINCT listed first within the SELECT statement.  Just keep in mind SQL isn’t necessarily processed in the order a human would read it from left to right.

DISTINCT and TOP with SELECT list Expressions

The second portion of Nan’s question related to how expressions are treated with the DISTINCT operator.

Expressions are treated the same as column regarding DISTINCT and TOP.  Let’s start with a select statement to get the first name as well as the full, which we create by appending LastName to FirstName.

Also, keep in mind, when using ORDER BY, that the ORDER BY items must appear in the select list when using Distinct.  Given this I have to modify the statement presented in the original question:

SELECT DISTINCT FirstName,          FirstName + ' ' + LastName AS FullName FROM Person.Person ORDER BY LastName

Won’t run since LastName isn’t in the SELECT list. Yes, it is part of an expression in the select list, but its not there on its own.  It is valid to order by FullName.

We’ll use this ordering in the examples below.

The statement

SELECT FirstName,          FirstName + ' ' + LastName AS FullName FROM Person.Person ORDER BY FirstName + ' ' + LastName

Returns 19972 rows.  When we add Distinct

SELECT DISTINCT FirstName,          FirstName + ' ' + LastName AS FullName FROM Person.Person ORDER BY FirstName + ' ' + LastName

Then 19516 rows are returned. Finally adding Top 10, returns the first 10 distinct name combinations.

SELECT DISTINCT TOP 10 FirstName,          FirstName + ' ' + LastName AS FullName FROM Person.Person ORDER BY FirstName + ' ' + LastName

Try running these queries on the AdventureWork database and you see for your self the behavior is the same as we find when working exclusively with columns.

www.essentialsql.com

Finding Distinct Counts | Elasticsearch: The Definitive Guide [2.x]

Finding Distinct Countsedit

The first approximate aggregation provided by Elasticsearch is the cardinality metric. This provides the cardinality of a field, also called a distinct or unique count. You may be familiar with the SQL version:

SELECT COUNT(DISTINCT color) FROM cars

Distinct counts are a common operation, and answer many fundamental business questions:

  • How many unique visitors have come to my website?
  • How many unique cars have we sold?
  • How many distinct users purchased a product each month?

We can use the cardinality metric to determine the number of car colors being sold at our dealership:

GET /cars/transactions/_search { "size" : 0, "aggs" : { "distinct_colors" : { "cardinality" : { "field" : "color" } } } }

This returns a minimal response showing that we have sold three different-colored cars:

... "aggregations": { "distinct_colors": { "value": 3 } } ...

We can make our example more useful: how many colors were sold each month? For that metric, we just nest the cardinality metric under a date_histogram:

GET /cars/transactions/_search { "size" : 0, "aggs" : { "months" : { "date_histogram": { "field": "sold", "interval": "month" }, "aggs": { "distinct_colors" : { "cardinality" : { "field" : "color" } } } } } }

Understanding the Trade-offsedit

As mentioned at the top of this chapter, the cardinality metric is an approximate algorithm. It is based on the HyperLogLog++ (HLL) algorithm. HLL works by hashing your input and using the bits from the hash to make probabilistic estimations on the cardinality.

You don’t need to understand the technical details (although if you’re interested, the paper is a great read!), but you should be aware of the properties of the algorithm:

  • Configurable precision, which controls memory usage (more precise == more memory).
  • Excellent accuracy on low-cardinality sets.
  • Fixed memory usage. Whether there are thousands or billions of unique values, memory usage depends on only the configured precision.

To configure the precision, you must specify the precision_threshold parameter. This threshold defines the point under which cardinalities are expected to be very close to accurate. Consider this example:

GET /cars/transactions/_search { "size" : 0, "aggs" : { "distinct_colors" : { "cardinality" : { "field" : "color", "precision_threshold" : 100 } } } }

precision_threshold accepts a number from 0–40,000. Larger values are treated as equivalent to 40,000.

This example will ensure that fields with 100 or fewer distinct values will be extremely accurate. Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric.

For a given threshold, the HLL data-structure will use about precision_threshold * 8 bytes of memory. So you must balance how much memory you are willing to sacrifice for additional accuracy.

Practically speaking, a threshold of 100 maintains an error under 5% even when counting millions of unique values.

If you want a distinct count, you usually want to query your entire dataset (or nearly all of it). Any operation on all your data needs to execute quickly, for obvious reasons. HyperLogLog is very fast already—it simply hashes your data and does some bit-twiddling.

But if speed is important to you, we can optimize it a little bit further. Since HLL simply needs the hash of the field, we can precompute that hash at index time. When the query executes, we can skip the hash computation and load the value directly out of fielddata.

Precomputing hashes is useful only on very large and/or high-cardinality fields. Calculating the hash on these fields is non-negligible at query time.

However, numeric fields hash very quickly, and storing the original numeric often requires the same (or less) memory. This is also true on low-cardinality string fields; there are internal optimizations that guarantee that hashes are calculated only once per unique value.

Basically, precomputing hashes is not guaranteed to make all fields faster — only those that have high cardinality and/or large strings. And remember, precomputing simply shifts the cost to index time. You still pay the price; you just choose when to pay it.

To do this, we need to add a new multifield to our data. We’ll delete our index, add a new mapping that includes the hashed field, and then reindex:

DELETE /cars/ PUT /cars/ { "mappings": { "transactions": { "properties": { "color": { "type": "string", "fields": { "hash": { "type": "murmur3" } } } } } } } POST /cars/transactions/_bulk { "index": {}} { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } { "index": {}} { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" } { "index": {}} { "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" } { "index": {}} { "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } { "index": {}} { "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" } { "index": {}} { "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

This multifield is of type murmur3, which is a hashing function.

Now when we run an aggregation, we use the color.hash field instead of the color field:

GET /cars/transactions/_search { "size" : 0, "aggs" : { "distinct_colors" : { "cardinality" : { "field" : "color.hash" } } } }

Notice that we specify the hashed multifield, rather than the original.

Now the cardinality metric will load the values (the precomputed hashes) from "color.hash" and use those in place of dynamically hashing the original value.

The savings per document is small, but if hashing each field adds 10 nanoseconds and your aggregation touches 100 million documents, that adds 1 second per query. If you find yourself using cardinality across many documents, perform some profiling to see if precomputing hashes makes sense for your deployment.

www.elastic.co

Use Subqueries to Count Distinct 50X Faster

NB: These techniques are universal, but for syntax we chose Postgres. Thanks to the inimitable pgAdminIII for the Explain graphics.

So Useful, Yet So Slow

Count distinct is the bane of SQL analysts, so it was an obvious choice for our first blog post.

First things first: If you have a huge dataset and can tolerate some imprecision, a probabilistic counter like HyperLogLog can be your best bet. (We’ll return to HyperLogLog in a future blog post.) But for a quick, precise answer, some simple subqueries can save you a lot of time.

Let’s start with a simple query we run all the time: Which dashboards do most users visit?

select   dashboards.name,   count(distinct time_on_site_logs.user_id)from time_on_site_logs join dashboards on time_on_site_logs.dashboard_id = dashboards.idgroup by name order by count desc

In Periscope, this would give you a graph like this:

For starters, let’s assume the handy indices on user_id and dashboard_id are in place, and there are lots more log lines than dashboards and users.

On just 10 million rows, this query takes 48 seconds. To understand why, let’s consult our handy SQL explain:

It’s slow because the database is iterating over all the logs and all the dashboards, then joining them, then sorting them, all before getting down to real work of grouping and aggregating.

Aggregate, Then Join

Anything after the group-and-aggregate is going to be a lot cheaper because the data size is much smaller. Since we don’t need dashboards.name in the group-and-aggregate, we can have the database do the aggregation first, before the join:

select  dashboards.name,  log_counts.ctfrom dashboardsjoin (  select    dashboard_id,    count(distinct user_id) as ct  from time_on_site_logs   group by dashboard_id) as log_counts on log_counts.dashboard_id = dashboards.idorder by log_counts.ct desc

This query runs in 20 seconds, a 2.4X improvement! Once again, our trusty explain will show us why:

As promised, our group-and-aggregate comes before the join. And, as a bonus, we can take advantage of the index on the time_on_site_logs table.

First, Reduce The Data Set

We can do better. By doing the group-and-aggregate over the whole logs table, we made our database process a lot of data unnecessarily. Count distinct builds a hash set for each group — in this case, each dashboard_id — to keep track of which values have been seen in which buckets.

Instead of doing all that work, we can compute the distincts in advance, which only needs one hash set. Then we do a simple aggregation over all of them.

select  dashboards.name,  log_counts.ctfrom dashboards join (  select distinct_logs.dashboard_id,   count(1) as ct  from (    select distinct dashboard_id, user_id    from time_on_site_logs  ) as distinct_logs  group by distinct_logs.dashboard_id) as log_counts on log_counts.dashboard_id = dashboards.idorder by log_counts.ct desc

We’ve taken the inner count-distinct-and-group and broken it up into two pieces. The inner piece computes distinct (dashboard_id, user_id) pairs. The second piece runs a simple, speedy group-and-count over them. As always, the join is last.

And now for the big reveal: This sucker takes 0.7 seconds! That’s a 28X increase over the previous query, and a 68X increase over the original query.

As always, data size and shape matters a lot. These examples benefit a lot from a relatively low cardinality. There are a small number of distinct (user_id, dashboard_id) pairs compared to the total amount of data. The more unique pairs there are — the more data rows are unique snowflakes that must be grouped and counted — the less free lunch there will be.

Next time count distinct is taking all day, try a few subqueries to lighten the load.

Who Are You Guys, Anyway?

We make Periscope, a tool that makes SQL data analysis really fast. We’ll be using this space to share the algorithms and techniques we’ve baked into our product.

You can sign up on our homepage to be notified as we take on new customers.

Update: More Data!

See our follow-up post, Count Distinct Compared on Top 4 SQL Databases to see how these improvements hold up on MySQL, Oracle and SQL Server.

Even More Updates: Count Distinct Even Faster with HyperLogLog!

See our more recent follow-up, HyperLogLog in Pure SQL to learn how to speed up count distinct even further with probabilistic counting and parallelism!

www.periscopedata.com


Смотрите также