Feb 21 2020

Log all SQL queries in Django.

As this part said, we can log all queries in Django with DEBUG=True.

DEBUG option is a global configure item, if we set it True, it will also affect other codes. So it is hard to active it in production environment.

How to get all SQL queries or we want in production? MySQL or database middleware can log these. But for all kinds of limitation, it's better to log it by ourselves.

Here are some methods to log it without active DEBUG option.

Logging Methods

When we set DEBUG with True, we have two places to get queries.
One is in log file, the output like this:

1 2	DEBUG 2020-02-21 10:35:12,214 33337 [utils.py:91] (0.036) SELECT @@SQL_AUTO_IS_NULL; args=None DEBUG 2020-02-21 10:35:12,251 33337 [utils.py:91] (0.036) SELECT VERSION(); args=None

Another one is using Django database connection's 'queries' property(code1, doc1). It stores all queries in one request, and is cleared automatically in here. So we can add a middleware to traverse 'queries' to log all SQLs.

Above two methods can be enabled by set DEBUG with True.

The following part will discuss how to enable database logging function without set DEBUG.

Custom Methods

Enable database logging function.

From codes, we find database logging/recording SQL functions are controlled by 'queries_logged' property. And it is determined by settings.DEBUG and self.force_debug_cursor, the last one is hard-coded without function argument, setting option to set it. Even in subclass, the concrete database backends.

# https://github.com/django/django/blob/3.0.3/django/db/backends/base/base.py#L150

    def queries_logged(self):
        return self.force_debug_cursor or settings.DEBUG

Can we set self.force_debug_cursor with True? Yes, we can.

self.force_debug_cursor is BaseDatabaseWrapper object's field, not class's. BaseDatabaseWrapper is base class of all backends' DatabaseWrapper, which are the database connection's class we use in code.
Every thread has its own DatabaseWrapper(database connection) object, not a singleton. The following is related code:

class ConnectionHandler:
    
    ...
    
    def __getitem__(self, alias):
        if hasattr(self._connections, alias):
            return getattr(self._connections, alias)

        self.ensure_defaults(alias)
        self.prepare_test_settings(alias)
        db = self.databases[alias]
        backend = load_backend(db['ENGINE'])
        conn = backend.DatabaseWrapper(db, alias)
        setattr(self._connections, alias, conn)
        return conn

ConnectionHandler is the class of django.db.connections. self._connections is a thread local variable. So we can set force_debug_cursor to True in middleware, for each threading's connection object. Like this:

from django.db import connections

class SimpleMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        # Enable database logging.
        connections['default'].force_debug_cursor = True
        response = self.get_response(request)
        return response

Or if there are other thread scope initialization points, we can also do in there. If you know, please tell me.

Write custom database backend.

Since there are no choice to set force_debug_cursor, how about add a setting method?

Yes, we can. Write a custom database backend:

#  $ tree custom_backend/
#  custom_backend/
#  ├── base.py
#  └── __init__.py

# file: custom_backend/base.py
from django.db.backends.mysql import base

class DatabaseWrapper(base.DatabaseWrapper):
  def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.force_debug_cursor = True

In settings, use this module to access MySQL service:

DATABASES = {
    'default': {
        'ENGINE': 'common.custom_backend',
        'NAME': 'DATABASE-NAME',
        'USER': 'ACCOUNT',
        ...
    }
}

Same with other database backends.

Hook exist backend.

With previous two methods, we can enable logging SQL into log files and connection.queries variable. This method does not aim to enable someone, but to log by ourselves.

From documents and code, we can find in backend internal it uses cursor object to do fetch operation. code1 code2 doc1 doc2

We can also find Django implements logging function by different wrapper classes for cursor:

# https://github.com/django/django/blob/3.0.3/django/db/backends/base/base.py#L224

    def _prepare_cursor(self, cursor):
        """
        Validate the connection is usable and perform database cursor wrapping.
        """
        self.validate_thread_sharing()
        if self.queries_logged:
            wrapped_cursor = self.make_debug_cursor(cursor)
        else:
            wrapped_cursor = self.make_cursor(cursor)
        return wrapped_cursor

So we can implement our own cursor wrapper class to record anything as we want. And next, how to let Django use our custom cursor wrapper class.

We can also implement a custom database backend and overwrite cursor creator function, like MySQL backend does:

# https://github.com/django/django/blob/3.0.3/django/db/backends/mysql/base.py#L252

    @async_unsafe
    def create_cursor(self, name=None):
        cursor = self.connection.cursor()
        return CursorWrapper(cursor)

Another one is to hook django.db.backends.utils.CursorWrapper class in process scope. Here is a example from django-extensions: link

Sep 20 2019

Linux

Keep environment variables with sudo command.

We have a complicated shell script, and need root permission to finish some operations. But this script may be executed by everyone and we don't want to leak root password, we still don't want everyone to do anything that root can do.
So we can add this script into /etc/sudoers with NOPASSWD.
Ref:
How do I run specific sudo commands without a password?

But the script misses some environment variables when using sudo to execute. We can use -E in command line to keep environment variables in current session, or configure Defaults !env_reset option in /etc/sudoers.
But the PYTHONPATH environment variables is still missing. We can use Defaults env_keep+=PYTHONPATH to keep PYTHONPATH explicitly.
Ref:
Avoid using env_reset in sudoers file

Only commenting Defaults env_reset is useless, because reset environment is default action.

We still want to known why sudo can keep many custom environment variables except PYTHONPATH.
After reading codes, it's because PYTHONPATH is a bad variable. Although we explictly set keep env, these bad variables still will be removed. Except explictly use Defaults env_keep+=PYTHONPATH to keep.

/*
 * Merge another environment with our private copy.
 * Only overwrite an existing variable if it is not
 * being preserved from the user's environment.
 * Returns true on success or false on failure.
 */
bool
env_merge(char * const envp[])
{
    ...

    for (ep = envp; *ep != NULL; ep++) {
    /* XXX - avoid checking value here, should only check name */
    bool overwrite = def_env_reset ? !env_should_keep(*ep) : env_should_delete(*ep);
    
    ...
}

In env_should_keep() it will check if ep in bad variables list.

Following list are these bad variables:

/*
 * Default table of "bad" variables to remove from the environment.
 * XXX - how to omit TERMCAP if it starts with '/'?
 */
static const char *initial_badenv_table[] = {
    "IFS",
    "CDPATH",
    "LOCALDOMAIN",
    "RES_OPTIONS",
    "HOSTALIASES",
    "NLSPATH",
    "PATH_LOCALE",
    "LD_*",
    "_RLD*",
#ifdef __hpux
    "SHLIB_PATH",
#endif /* __hpux */
#ifdef _AIX
    "LDR_*",
    "LIBPATH",
    "AUTHSTATE",
#endif
#ifdef __APPLE__
    "DYLD_*",
#endif
#ifdef HAVE_KERB5
    "KRB5_CONFIG*",
    "KRB5_KTNAME",
#endif /* HAVE_KERB5 */
#ifdef HAVE_SECURID
    "VAR_ACE",
    "USR_ACE",
    "DLC_ACE",
#endif /* HAVE_SECURID */
    "TERMINFO",			/* terminfo, exclusive path to terminfo files */
    "TERMINFO_DIRS",		/* terminfo, path(s) to terminfo files */
    "TERMPATH",			/* termcap, path(s) to termcap files */
    "TERMCAP",			/* XXX - only if it starts with '/' */
    "ENV",			/* ksh, file to source before script runs */
    "BASH_ENV",			/* bash, file to source before script runs */
    "PS4",			/* bash, prefix for lines in xtrace mode */
    "GLOBIGNORE",		/* bash, globbing patterns to ignore */
    "BASHOPTS",			/* bash, initial "shopt -s" options */
    "SHELLOPTS",		/* bash, initial "set -o" options */
    "JAVA_TOOL_OPTIONS",	/* java, extra command line options */
    "PERLIO_DEBUG ",		/* perl, debugging output file */
    "PERLLIB",			/* perl, search path for modules/includes */
    "PERL5LIB",			/* perl 5, search path for modules/includes */
    "PERL5OPT",			/* perl 5, extra command line options */
    "PERL5DB",			/* perl 5, command used to load debugger */
    "FPATH",			/* ksh, search path for functions */
    "NULLCMD",			/* zsh, command for null file redirection */
    "READNULLCMD",		/* zsh, command for null file redirection */
    "ZDOTDIR",			/* zsh, search path for dot files */
    "TMPPREFIX",		/* zsh, prefix for temporary files */
    "PYTHONHOME",		/* python, module search path */
    "PYTHONPATH",		/* python, search path */
    "PYTHONINSPECT",		/* python, allow inspection */
    "PYTHONUSERBASE",		/* python, per user site-packages directory */
    "RUBYLIB",			/* ruby, library load path */
    "RUBYOPT",			/* ruby, extra command line options */
    "*=()*",			/* bash functions */
    NULL
};

Aug 1 2019

Database

Do Pagination With Table/Database Sharding

分表分库下的分页方案

为什么分表分库这里就不说了，聊一下在这种场景下如何进行分页查询。

现有方案

业界难题-“跨库分页”的四种方案里提了四种方案，这里线逐个分析一下：

方案1: 全局视野法

这种方法获取的结果是绝对准确的，但是缺点正如文章里所说：随着翻页的进行，性能越来越低。

方案2: 业务折衷法-禁止跳页查询

这种方案结果也是准确的，并且每次查询的性能也是稳定的。缺点是不能随机跳转到某页。
但是考虑一下用户需要跳转的任意页面动机，我们是不是可以通过查询条件的方式满足用户这方面的需求？比如显示的条目是按照时间递增排序的，当前显示的条目时间是2019/08/01，用户想看2019/08/10号的数据，然后心里估算了一下大约在第10页，然后想直接跳转到Page 10。那么我们是不是可以通过提供一个选择日期范围的功能满足用户的这个需求？

整体来看这种分页方式是最优的，和一些查询功能搭配起来能够满足各种使用场景。

方案3: 业务折衷法-允许模糊数据

如果业务方能够接受，这种方法是没有问题的。但是一定要注意前提：各分库所有非partition key属性，在各个分库上的数据分布，统计概率情况是一致的。
举例来说，业务方使用条目的创建时间排序，现在分两个表A和B，在A表每分钟有10条新纪录写入，B库每分钟只有2条记录写入。那么在页面显示里，页码数越大，不同表返回条目的创建时间差别也更明显。那么如果用户想跳转到某一页找某一天写入的记录，用户心里会怎么估算跳转的页码呢？

这里的模糊指的是：排序顺序不完全准确，但显示的条目不会少，不会重复！

方案4: 终极武器-二次查询法

首先说结论，这个方案是不精确的！

具体操作步骤原文。这里举一个范例说明下为什么是不准的。
现在有两张表A和B，有一列属性名字为priority。A表有1万行记录，priority取值范围为[1,1w]，B表有1万行记录，priority取值范围为[1w+1, 2w]，假设priority属性是唯一的。
现在要按照priority属性排序然后分页，每页10条记录。按照原文的操作步骤，我们来计算一下第1k页的元素。

第一步：
select * from A(B) order by priority offset 5000 limit 5
Priority Range From A: [5001, 5005]
Priority Range From B: [15001, 15005]

第二、三步：
select from A where priority between 5001, 5005
select from B where priority between 5001, 15005

这里的问题：B表的查询返回大量的数据！5005行。

第四、五步：
priority==5001 在全局里的offset是5000。然后上面返回的两个结果集合并排序，得到的offset 1w的记录为 [14996, 15005]。但是从数据集来看，应该返回的是[1w, 1w+10]。
原因是只对"第二、三步"的结果集合并排序，忽略了其他元素的影响。

另外一个问题：返回的每页的数据集，是连续的吗？会遗漏数据吗？
回答是：不是，会遗漏数据，并且会出现重复数据。
我们用反证法想一想，priority属性值唯一，2万条记录，取值范围为[1,2w]。
每页10条元素，共2000页。通过该方案第1000页记录的priority范围为[14996, 15005]。
那么在1-999页，需要显示priority范围为[1,14995]，多于页面能显示的记录数，肯定有条目遗漏。
在1001-2000页，需要显示priority范围为[15006, 2w]，少于页面能显示的条目书，所以肯定要有重复。

我们的方案

我们项目里目前也遇到了类似问题，为什么分表这里不说了。并且有个问题是分表后各个表里的数据分布是不均匀的，我们还做了分库，某些库的网络链接条件特别差，带宽小，延迟高。

我们的思路是利用计数排序的思想。举个例子，还是有一列整型值priority。

第一步：将记录按照floor(priority / VALUE_LENGTH)分组计数。得到类似如下结果：

priority Range	A表中元素个数	B表中元素个数
[0VALUE_LENGTH, 1LAVLUE_LENGHT)	10	20
[1VALUE_LENGTH, 2LAVLUE_LENGHT)	4	0
[2VALUE_LENGTH, 3LAVLUE_LENGHT)	14	11
...	...	...

第二步：通过第一个的分段计数，我们可以知道 [offset, offset+limit) 的元素的priority值分布区间。然后select id, priority from A(B) where priority between RANGE_BEGIN and RANGE_END。将多表的结果集合并排序后就可以找到分页范围内的元素。
比如在按照上表的结果，每页显示10条记录，我们想查询第4页的结果，可知 offset 30 limit 10 的元素priorityt的范围是 [1*VALUE_LENGTH, 3*LAVLUE_LENGHT)，然后到每张表里查询该范围的记录。

第三步：根据id到各个表里获取完整的原始数据。

每张表都需要三次查询。其中第一步的查询结果可以做缓存。如果range length空间够小，第二步里每张表的查询结果集也不会很大，所以可以直接取完整的数据行，省去第三部的按照id取元素。
整体来看即避免了多次数据查询，也避免了结果集过大。

这种方法也很类似搜索引擎里的架构方案。

另外还有其他一些方案：比如数据分表分库，但是所有的数据的索引保存在一个表里。然后查询时通过索引表获取记录id，然后到各个库/表里获取对应元素。

Nov 21 2018

Squid

Why does squid print 'TCP_REFRESH_UNMODIFIED_ABORTED'?

We have a storage system which doesn't provide http api to access its content. So I write a proxy service using Django to map http request to storage systems's protocol. What proxy service does is bulding http headers and reading content from storage system and writing back to client.
For some reasons, I also need a cache layer above the proxy service and choose squid to act as such a role.

I insert 'Cache-Control' headers in response, when the representation expired squid will refresh it. When proxy receives request, we check 'If-Modified-Since' header and return 304 code if the file doesn't changed.

When the representation expired in squid and the client want to read the data, squid will make a request to proxy(originserver) and get a 304 code. But the client get a empty body! And squid print TCP_REFRESH_UNMODIFIED_ABORTED code in log. But the next request can get the complete data before it expires!

According to squid's manual and code, TCP_REFRESH_UNMODIFIED_ABORTED can be divided to two parts: TCP_REFRESH_UNMODIFIED and _ABORTED. TCP_REFRESH_UNMODIFIED means the representation in client is valid. _ABORTED means write back failed. Transmission failure reason is that squid doesn't write the full data of representaion.

I write proxy service using Django framework and return django.http.HttpResponseNotModified object if the file isn't changed. When django sends response, it will always add 'Content-Length' header with value of content length. In this case, the 'Content-Length' is ZERO. When squid get 304 response, it knows the local representation is valid, and send first 'Content-Length' bytes of local representations to client, which is ZERO. When send finished, squid find the send length isn't same with the length of local represention, so it add a '_ABORTED' to squid status.
The code about django add 'Content-Length' header is here.

According to rfc7232#section-4.1, we should not send 'Content-Length' and we don't need it at all. But adding 'Content-Length' header is the common mechanism in Django without any options. Currently in order to to solve this problem we set 'Content-Length' by ourselves with represention's actual length.

Nov 16 2018

Python

Why uploading large file using Requests is very slow?

Recently I write a python script to upload files using Requests Libaray. The script need about 7 miniutes to upload a file about 3GB in size. While the curl only take less than a miniute.
According server-side's log, the time usage between server accepting socket and closing it is less than a miniute when we use python script.

Where is the Requests library spend its time?
After reading Requests's code, I find the lib will read the full content into memory and encode using the multipart/form-data mime format.
The function _encode_files is used to process file upload request. In the line:159 it read the full content of file. And then call encode_multipart_formdata() function to encode data in the line:169.
After this, the libaray can calculate the content length in request body and add 'Content-Length' header.

Why is curl upload file so fast?
Because http client need send 'Content-Length' header before send body. So we need calculate body size at first. When we upload file, loading file content into memory and encoding body and then calculating body size is a trivial solution.

So is the time usage difference caused by the performace difference between python and c++?

After reading curl's code, I find the programming language performance difference is not the only reason.

When curl build request to upload file, it will build mimepart structure to describe such a file. It will start to read file content until begin to send http body. When send http body, the curl will call structure method to read data. The related code is here.

When we cacluate http body size before send body content, we only need file size. The length of encoded content is unrelated to file content. The related code is here.

At last, I use pycurl to refactor my script and the script perforamce is same as the command curl.

kmiku7's blog

JUST DO IT.