我正在尝试利用 Django (v2.1) 和 Postgres (9.5) 创建一个地址自动完成功能的全文搜索,但性能目前不适合自动完成,我不明白逻辑我得到的绩效结果背后。就信息而言,该表相当大,有 1400 万行。
我的型号:
from django.db import models
from postgres_copy import CopyManager
from django.contrib.postgres.indexes import GinIndex
class Addresses(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
objects = CopyManager()
class Meta:
indexes = [
GinIndex(fields=['number', 'street', 'unit', 'city', 'region', 'postcode'], name='search_idx')
]
我创建了一个小测试来根据搜索中的单词数检查性能:
search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode')
searchtext1 = "north"
searchtext2 = "north bondi"
searchtext3 = "north bondi blair"
searchtext4 = "north bondi blair street 2026"
print('Test1: 1 word')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext1)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
#print(AddressesAustralia.objects.annotate(search=search_vector).explain(verbose=True))
print('Test2: 2 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext2)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
print('Test3: 3 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext3)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
print('Test4: 5 words')
start_time = time.time()
result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext4)[:10]
#print(len(result))
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
print(' ')
我得到以下结果,看起来非常正确:
Test1: 1 word
0:00:00.001841
Test2: 2 words
0:00:00.001422
Test3: 3 words
0:00:00.001574
Test4: 5 words
0:00:00.001360
但是,如果我取消注释 print(len(results)) 行,我会得到以下结果:
Test1: 1 word
10
0:00:00.046392
Test2: 2 words
10
0:00:06.544732
Test3: 3 words
10
0:01:12.367157
Test4: 5 words
10
0:01:17.786596
这显然不适合自动完成功能。
有人可以解释为什么在对查询集结果执行操作时执行时间更长吗?似乎数据库检索总是很快,但浏览结果需要时间,这对我来说没有意义,因为我将结果限制为 10,返回的查询集始终是相同的大小。
另外,虽然我创建了GIN索引,但是这个索引好像并没有被使用。看起来它已经正确创建:
=# \d public_data_au_addresses
Table
"public.public_data_au_addresses"
Column | Type | Collation | Nullable |
Default
-------------+--------------------------+-----------+----------+------
---------------------------------------------------------
id | integer | | not null |
nextval('public_data_au_addresses_id_seq'::regclass)
date_update | timestamp with time zone | | |
longitude | numeric(9,6) | | |
latitude | numeric(9,6) | | |
number | character varying(16) | | |
street | character varying(60) | | |
unit | character varying(50) | | |
city | character varying(50) | | |
district | character varying(10) | | |
region | character varying(5) | | |
postcode | character varying(5) | | |
addr_id | character varying(20) | | not null |
addr_hash | character varying(20) | | not null |
Indexes:
"public_data_au_addresses_pkey" PRIMARY KEY, btree (id)
"public_data_au_addresses_addr_hash_key" UNIQUE CONSTRAINT, btree (addr_hash)
"public_data_au_addresses_addr_id_key" UNIQUE CONSTRAINT, btree (addr_id)
"public_data_au_addresses_addr_hash_e8c67a89_like" btree (addr_hash varchar_pattern_ops)
"public_data_au_addresses_addr_id_9ee00c76_like" btree (addr_id varchar_pattern_ops)
"search_idx" gin (number, street, unit, city, region, postcode)
当我对查询运行explain()方法时,我得到:
Test1: 1 word
Limit (cost=0.00..1110.60 rows=10 width=140)
-> Seq Scan on public_data_au_addresses (cost=0.00..8081472.41 rows=72767 width=140)
Filter: (to_tsvector((((((((((((COALESCE(number, ''::character varying))::text || ' '::text) || (COALESCE(street, ''::character varying))::text) || ' '::text) || (COALESCE(unit, ''::character varying))::text) || ' '::text) || (COALESCE(city, ''::character varying))::text) || ' '::text) || (COALESCE(region, ''::character varying))::text) || ' '::text) || (COALESCE(postcode, ''::character varying))::text)) @@ plainto_tsquery('north'::text))
因此它仍然显示顺序扫描而不是使用索引扫描。有谁知道如何修复或调试它?
无论如何,GIN 索引在有这么多字段可供搜索的情况下仍然有效吗?
最后,有谁知道如何改进代码以进一步提高性能?
谢谢你!
问候
Update
我尝试按照下面 Paolo 的建议创建一个搜索向量,但搜索似乎仍然是连续的,并且没有利用 GIN 索引。
class AddressesQuerySet(CopyQuerySet):
def update_search_vector(self):
return self.update(search_vector=SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english'))
class AddressesAustralia(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
search_vector = SearchVectorField(null=True, editable=False)
objects = AddressesQuerySet.as_manager()
class Meta:
indexes = [
GinIndex(fields=['search_vector'], name='search_vector_idx')
]
然后我使用 update 命令更新了 search_vector 字段:
AddressesAustralia.objects.update_search_vector()
然后我运行一个查询来使用相同的搜索向量进行测试:
class Command(BaseCommand):
def handle(self, *args, **options):
search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english')
searchtext1 = "north"
print('Test1: 1 word')
start_time = time.time()
result = AddressesAustralia.objects.filter(search_vector=searchtext1)[:10].explain(verbose=True)
print(len(result))
print(result)
time_exec = str(timedelta(seconds=time.time() - start_time))
print(time_exec)
我得到以下结果,仍然显示顺序搜索:
Test1: 1 word
532
Limit (cost=0.00..120.89 rows=10 width=235)
Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
-> Seq Scan on public.public_data_au_addressesaustralia (cost=0.00..5061078.91 rows=418651 width=235)
Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
Filter: (public_data_au_addressesaustralia.search_vector @@ plainto_tsquery('north'::text))
0:00:00.075262
我也尝试过:
但结果还是一样。知道我做错了什么或如何进一步排除故障吗?