使用一组字符而不是一个字符的序列对齐算法

2024-03-21

Summary:

我从一些有关对齐算法的细节开始，最后我提出了我的问题。如果您了解对齐算法，请从头开始。

考虑我们有两个字符串，例如：

ACCGAATCGA
ACCGGTATTAAC

有一些算法，例如：史密斯-沃特曼 https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm Or 尼德曼-温施 https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm，对齐这两个序列并创建一个矩阵。查看以下部分的结果：

Smith-Waterman Matrix
§   §   A   C   C   G   A   A   T   C   G   A   
§   0   0   0   0   0   0   0   0   0   0   0   
A   0   4   0   0   0   4   4   0   0   0   4   
C   0   0   13  9   4   0   4   3   9   4   0   
C   0   0   9   22  17  12  7   3   12  7   4   
G   0   0   4   17  28  23  18  13  8   18  13  
G   0   0   0   12  23  28  23  18  13  14  18  
T   0   0   0   7   18  23  28  28  23  18  14  
A   0   4   0   2   13  22  27  28  28  23  22  
T   0   0   3   0   8   17  22  32  27  26  23  
T   0   0   0   2   3   12  17  27  31  26  26  
A   0   4   0   0   2   7   16  22  27  31  30  
A   0   4   4   0   0   6   11  17  22  27  35  
C   0   0   13  13  8   3   6   12  26  22  30  

Optimal Alignments
A   C   C   G   A   -   A   T   C   G   A   
A   C   C   G   G   A   A   T   T   A   A

问题：

我的问题很简单，但也许答案并不像看起来那么容易。我想将一组字符用作单个字符，例如：[A0][C0][A1][B1]。但在这些算法中，我们必须使用单个字符。我们怎样才能做到这一点？

附：考虑我们有这个序列：#read #write #add #write。然后我将其转换为类似的内容：#read to A .... #write to B.... #add to C。然后我的序列变为：ABCB。但我有很多不同的词开头#。而且 ASCII 表不足以全部转换。然后我需要更多的角色。唯一的方法是使用类似的东西[A0] ... [Z9]对于每个单词。或者使用数字。

P.S：Smith-Waterman 的一些示例代码存在于此link http://www.vivid-abstractions.net/logical/programming/smith-waterman-algorithm-program-and-source-code/

PS：有另一个帖子 https://stackoverflow.com/questions/29847642/sequence-alignment-with-one-to-many-characters他们想要类似的东西，但我想要的是不同的。在这个问题中，我们有一组以 a 开头的字符[并以]。并且不需要使用像这样的语义ee等于i.

我适应了这个Python实现 https://github.com/alevchuk/pairwise-alignment-in-pythonSmith-Waterman 和 Needleman-Wunsch 算法（GPL 版本 3 许可）支持具有多个字符组的序列：

#This software is a free software. Thus, it is licensed under GNU General Public License.
#Python implementation to Smith-Waterman Algorithm for Homework 1 of Bioinformatics class.
#Forrest Bao, Sept. 26 <http://fsbao.net> <forrest.bao aT gmail.com>

# zeros() was origianlly from NumPy.
# This version is implemented by alevchuk 2011-04-10
def zeros(shape):
    retval = []
    for x in range(shape[0]):
        retval.append([])
        for y in range(shape[1]):
            retval[-1].append(0)
    return retval

match_award      = 10
mismatch_penalty = -5
gap_penalty      = -5 # both for opening and extanding
gap = '----' # should be as long as your group of characters
space = '    ' # should be as long as your group of characters

def match_score(alpha, beta):
    if alpha == beta:
        return match_award
    elif alpha == gap or beta == gap:
        return gap_penalty
    else:
        return mismatch_penalty

def finalize(align1, align2):
    align1 = align1[::-1]    #reverse sequence 1
    align2 = align2[::-1]    #reverse sequence 2

    i,j = 0,0

    #calcuate identity, score and aligned sequeces
    symbol = []
    found = 0
    score = 0
    identity = 0
    for i in range(0,len(align1)):
        # if two AAs are the same, then output the letter
        if align1[i] == align2[i]:                
            symbol.append(align1[i])
            identity = identity + 1
            score += match_score(align1[i], align2[i])

        # if they are not identical and none of them is gap
        elif align1[i] != align2[i] and align1[i] != gap and align2[i] != gap:
            score += match_score(align1[i], align2[i])
            symbol.append(space)
            found = 0

        #if one of them is a gap, output a space
        elif align1[i] == gap or align2[i] == gap:
            symbol.append(space)
            score += gap_penalty

    identity = float(identity) / len(align1) * 100

    print 'Identity =', "%3.3f" % identity, 'percent'
    print 'Score =', score
    print ''.join(align1)
    # print ''.join(symbol)
    print ''.join(align2)


def needle(seq1, seq2):
    m, n = len(seq1), len(seq2)  # length of two sequences

    # Generate DP table and traceback path pointer matrix
    score = zeros((m+1, n+1))      # the DP table

    # Calculate DP table
    for i in range(0, m + 1):
        score[i][0] = gap_penalty * i
    for j in range(0, n + 1):
        score[0][j] = gap_penalty * j
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            match = score[i - 1][j - 1] + match_score(seq1[i-1], seq2[j-1])
            delete = score[i - 1][j] + gap_penalty
            insert = score[i][j - 1] + gap_penalty
            score[i][j] = max(match, delete, insert)

    # Traceback and compute the alignment 
    align1, align2 = [], []
    i,j = m,n # start from the bottom right cell
    while i > 0 and j > 0: # end toching the top or the left edge
        score_current = score[i][j]
        score_diagonal = score[i-1][j-1]
        score_up = score[i][j-1]
        score_left = score[i-1][j]

        if score_current == score_diagonal + match_score(seq1[i-1], seq2[j-1]):
            align1.append(seq1[i-1])
            align2.append(seq2[j-1])
            i -= 1
            j -= 1
        elif score_current == score_left + gap_penalty:
            align1.append(seq1[i-1])
            align2.append(gap)
            i -= 1
        elif score_current == score_up + gap_penalty:
            align1.append(gap)
            align2.append(seq2[j-1])
            j -= 1

    # Finish tracing up to the top left cell
    while i > 0:
        align1.append(seq1[i-1])
        align2.append(gap)
        i -= 1
    while j > 0:
        align1.append(gap)
        align2.append(seq2[j-1])
        j -= 1

    finalize(align1, align2)

def water(seq1, seq2):
    m, n = len(seq1), len(seq2)  # length of two sequences

    # Generate DP table and traceback path pointer matrix
    score = zeros((m+1, n+1))      # the DP table
    pointer = zeros((m+1, n+1))    # to store the traceback path

    max_score = 0        # initial maximum score in DP table
    # Calculate DP table and mark pointers
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            score_diagonal = score[i-1][j-1] + match_score(seq1[i-1], seq2[j-1])
            score_up = score[i][j-1] + gap_penalty
            score_left = score[i-1][j] + gap_penalty
            score[i][j] = max(0,score_left, score_up, score_diagonal)
            if score[i][j] == 0:
                pointer[i][j] = 0 # 0 means end of the path
            if score[i][j] == score_left:
                pointer[i][j] = 1 # 1 means trace up
            if score[i][j] == score_up:
                pointer[i][j] = 2 # 2 means trace left
            if score[i][j] == score_diagonal:
                pointer[i][j] = 3 # 3 means trace diagonal
            if score[i][j] >= max_score:
                max_i = i
                max_j = j
                max_score = score[i][j];

    align1, align2 = [], []    # initial sequences

    i,j = max_i,max_j    # indices of path starting point

    #traceback, follow pointers
    while pointer[i][j] != 0:
        if pointer[i][j] == 3:
            align1.append(seq1[i-1])
            align2.append(seq2[j-1])
            i -= 1
            j -= 1
        elif pointer[i][j] == 2:
            align1.append(gap)
            align2.append(seq2[j-1])
            j -= 1
        elif pointer[i][j] == 1:
            align1.append(seq1[i-1])
            align2.append(gap)
            i -= 1

    finalize(align1, align2)

如果我们使用以下输入运行它：

seq1 = ['[A0]', '[C0]', '[A1]', '[B1]']
seq2 = ['[A0]', '[A1]', '[B1]', '[C1]']

print "Needleman-Wunsch"
needle(seq1, seq2)
print
print "Smith-Waterman"
water(seq1, seq2)

我们得到这个输出：

Needleman-Wunsch
Identity = 60.000 percent
Score = 20
[A0][C0][A1][B1]----
[A0]----[A1][B1][C1]

Smith-Waterman
Identity = 75.000 percent
Score = 25
[A0][C0][A1][B1]
[A0]----[A1][B1]

我所做的具体修改参见：这个 GitHub 存储库 https://github.com/BioGeek/pairwise-alignment-in-python/commits/master/alignment.py.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

使用一组字符而不是一个字符的序列对齐算法的相关文章

WPF DataGrid 多选

我读过几篇关于这个主题的文章但很多都是来自 VS 或框架的早期版本我想做的是从 dataGrid 中选择多行并将这些行返回到绑定的可观察集合中我尝试创建一个属性类型并将其添加到可观察集合中它适用于单个记录但代码永远不会触发多个
在模板类中声明模板友元类时出现编译器错误

我一直在尝试实现我自己的链表类以用于教学目的我在迭代器声明中指定了 List 类作为友元但它似乎无法编译这些是我使用过的 3 个类的接口 Node h define null Node
没有特殊字符的密码验证器

我是 RegEx 的新手已经进行了大量搜索但没有找到任何具体内容我正在编写一个验证密码字符串的正则表达式可接受的字符串必须至少具有 4 种字符类型中的 3 种数字小写字母大写字母特殊字符我对包含有一个想法也就是说如果这
在一个数据访问层中处理多个连接字符串

我有一个有趣的困境我目前有一个数据访问层它必须与多个域一起使用并且每个域都有多个数据库存储库具体取决于所调用的存储过程目前我只需使用 SWITCH 语句来确定应用程序正在运行的计算机并从 Web config 返回适当的连接字
如何从 Visual Studio 将视图导航到其控制器？

问题是解决方案资源管理器上有 29 个项目而且项目同时具有 ASP NET MVC 和 ASP NET Web 表单结构在MVC部分中 Controller文件夹中有大约100个子文件夹每个文件夹至少有3 4个控制器视图完全位于不同
如何从本机 C(++) DLL 调用 .NET (C#) 代码？

我有一个 C app exe 和一个 C my dll my dll NET 项目链接到本机 C DLL mynat dll 外部 C DLL 接口并且从 C 调用 C DLL 可以正常工作通过使用 DllImport mynat dl
如何在 C++ 中标记字符串？

Java有一个方便的分割方法 String str The quick brown fox String results str split 在 C 中是否有一种简单的方法可以做到这一点 The 增强分词器 http www boost o
WPF 数据绑定到复合类模式？

我是第一次尝试 WPF 并且正在努力解决如何将控件绑定到使用其他对象的组合构建的类例如如果我有一个由两个单独的类组成的类 Comp 为了清楚起见请注意省略的各种元素 class One int first int second cla
在 Unity 中实现 Fur with Shells 技术

我正在尝试在 Unity 中实现皮毛贝壳技术 http developer download nvidia com SDK 10 5 direct3d Source Fur doc FurShellsAndFins pdf Fins 技术被
如何获取 EF 中与组合（键/值）列表匹配的记录？

我有一个数据库表其中包含每个用户年份组合的记录如何使用 EF 和用户 ID 年份组合列表从数据库获取数据组合示例 UserId Year 1 2015 1 2016 1 2018 12 2016 12 2019 3 2015 91
两个静态变量同名（两个不同的文件），并在任何其他文件中 extern 其中一个

在一个文件中将变量声明为 static 并在另一个文件中进行 extern 声明我认为这会在链接时出现错误因为 extern 变量不会在任何对象中看到因为在其他文件中声明的变量带有限定符 static 但不知何故链接器瑞萨没有显
为什么 C# 2.0 之后没有 ISO 或 ECMA 标准化？

我已经开始学习 C 并正在寻找标准规范但发现大于 2 0 的 C 版本并未由 ISO 或 ECMA 标准化或者是我从 Wikipedia 收集到的这有什么原因吗因为编写审查验证发布处理反馈修订重新发布等复杂的规范文档需要
C# xml序列化必填字段

我需要将一些字段标记为需要写入 XML 文件但没有成功我有一个包含约 30 个属性的配置类这就是为什么我不能像这样封装所有属性 public string SomeProp get return someProp set if som
C 函数 time() 如何处理秒的小数部分？

The time 函数将返回自 1970 年以来的秒数我想知道它如何对返回的秒数进行舍入例如对于100 4s 它会返回100还是101 有明确的定义吗 ISO C标准没有说太多它只说time 回报该实现对当前日历时间的最佳近似结
编译时展开 for 循环内的模板参数？

维基百科 here http en wikipedia org wiki Template metaprogramming Compile time code optimization 给出了 for 循环的编译时展开我想知道我们是否可以
对于某些 PDF 文件，LoadIFilter() 返回 -2147467259

我正在尝试使用 Adob e IFilter 搜索 PDF 文件我的代码是用 C 编写的我使用 p invoke 来获取 IFilter 的实例 DllImport query dll SetLastError true CharSet
C++ 中的参考文献

我偶尔会在 StackOverflow 上看到代码询问一些涉及函数的重载歧义例如 void foo int param 我的问题是为什么会出现这种情况或者更确切地说你什么时候会有对参考的参考这与普通的旧参考有何不同我从未在现
MySQL Connector C/C API - 使用特殊字符进行查询

我是一个 C 程序我有一个接受域名参数的函数 void db domains query char name 使用 mysql query 我测试数据库中是否存在域名如果不是这种情况我插入新域名 char query 400 spri
类型或命名空间“MyNamespace”不存在等

我有通常的类型或命名空间名称不存在错误除了我引用了程序集 using 语句没有显示为不正确并且我引用的类是公共的事实上我在不同的解决方案中引用并使用相同的程序集来执行相同的操作并且效果很好顺便说一句这是VS2010 有人有什么
使用 WGL 创建现代 OpenGL 上下文？

我正在尝试使用 Windows 函数创建 OpenGL 上下文现代版本基本上代码就是创建窗口类注册班级创建一个窗口 choose PIXELFORMATDESCRIPTOR并设置它创建旧版 OpenGL 上下文使上下文成为当前

随机推荐

是否可以在没有 sqlite 函数的情况下将 SQLite 数据库与 PHP 一起使用？

我的 PHP 安装没有 SQLite Functionality 作为基本安装因此没有 sqlite 函数可用是否有一个 PHP 库 PHP 代码可以访问 SQLite 数据库而无需在 PHP 中安装任何插件我无法更改服务器配置
如何计算 (a*b)%c 形式的模数？

如何计算 a b c 形式的模数我想计算两个 int 数字相乘的模数它们几乎处于溢出阶段这里 c 也是 int a b c a c b c c
是否对使用“OR”的 SQL SERVER 表达式的所有部分进行求值？

Given WHERE Id Is NULL OR Id Table Id 如果 Id 为 null 表达式的计算结果为 true 第二部分 Id Table Id 是否仍然被考虑或者如果第一部分是 c 中的情况则表达式计算结果为 t
你能在 shell 脚本中生成一个进程吗？

我试图让我的 bin sh shell 脚本启动另一个应用程序而不暂停执行也就是说我正在寻找一种在后台启动它并让我的 shell 脚本继续执行的方法我希望它能像这样工作 start daemon start success launc
如何使用NamedTemporaryFile（什么时候关闭？）

我正在尝试编写一系列写入临时文件的函数然后对写入的文件进行处理我试图了解该文件是如何处理的我想做的摘要是 def create function inputs create temp file write some contents
JS 私有方法不会在每次构造函数调用时重新定义

如何创建一个每次调用构造函数时都未定义的 Javascript 私有方法据我所知在OOP JS中私有方法是在类的构造方法中定义的方法每次实例化一个新的对象时都会调用我在想也许是一个函数声明即function name
在 Swift 中使用 obj-c typedef

我有一个 typedef 如下 typedef NSString VMVideoCategoryType extern VMVideoCategoryType const VMVideoCategoryType MusicVideo ext
组合两个 Runnable 对象

举例来说我有一个名为 RunnableA 的 Runnable 它可以执行某些操作我还有一个名为 RunnableB 的 Runnable 它可以执行其他操作有没有办法可以将这两个 Runnable 组合起来以便它们在同一个线程中运
Ironpython：函数在 CPython 中工作，IronPython 中神秘的空指针异常

我正在尝试做一些看起来非常简单的事情并且属于标准 python 的范围以下函数接受集合的集合并返回两个或多个集合中包含的所有项目为此虽然集合的集合不为空但它只是从集合中弹出一个集合将其与其余集合相交并更新落在这些交集之一中的
在 React 中测试 mapbox-gl 时如何修复“window.URL.createObjectURL 不是函数”？

我正在使用 Mapbox material ui 和自定义样式测试 React 组件我使用 Jest Enzyme 进行测试我有问题 window URL createObjectURL 不是函数我读过类似的问题 github com
如何自定义引导侧边栏/sidenav？

I need to make use of Twitter Bootstrap Sidebar for creating a menu in my web application Highlighted in red To create a
隐藏键盘ios [重复]

这个问题在这里已经有答案了我有一些文本输入每当我触摸背景时我都可以隐藏键盘但只有当我输入第一个文本框名称 textField1 时现在这段代码应该很简单但我似乎无法理解它 IBAction backgroundTouched i
如何在Python中将词云保存为.png？

我正在尝试基于字符串创建词云然后将其导入到报告文档中我正在使用 python docx matplotlib 和词云这是我的一个简短的总结 from wordcloud import WordCloud import matplotl
获取 Flask 中的当前用户 ID

我对 Python 还很陌生老实说我对一般编程也很陌生我目前正在制定一种待办事项列表我需要它将待办事项放入适当的课程中所有这些都与教育内容相关所以问题很简单我将其作为 Flask 驱动的路线 app route add co
方案单词列表 eq?

我有一个问题我需要查找列表是否等于第二个列表例如 set eq 1 2 3 1 2 3 gt t set eq 1 2 3 2 3 4 gt f 这些例子在我的程序中是正确的但这个例子不是 set eq quote quote one
如何使用GridView从服务器获取JSON数据--Flutter

我参考过食谱 https flutter dev docs cookbook networking fetch data https flutter dev docs cookbook networking fetch data 示例代码是
VueJS 自定义 Props 验证功能

我是 VueJS 的新手所以我一直在关注他们的官方指南 https v2 vuejs org v2 guide components html Prop Validation 我能够触发前 5 个属性验证器但我似乎无法触发最后一个示例
在 Firefox 中读取多行内容可编辑文本

让我们读取一个 contenteditable 元素 span This is editable br Yes it is span 就在您在文本末尾手动添加两个空格之后 I get textContent gt This is edita
C++ 中的“new”运算符何时调用构造函数

自从我开始学习 C 以来我一直读到 new 运算符在返回指向分配内存的指针之前调用对象的构造函数因此出于好奇我检查了 new 的源代码并在以下位置找到了以下内容 GLIBCXX WEAK DEFINITION void opera
使用一组字符而不是一个字符的序列对齐算法

Summary 我从一些有关对齐算法的细节开始最后我提出了我的问题如果您了解对齐算法请从头开始考虑我们有两个字符串例如 ACCGAATCGA ACCGGTATTAAC 有一些算法例如史密斯沃特曼 https en wikip

使用一组字符而不是一个字符的序列对齐算法

使用一组字符而不是一个字符的序列对齐算法 的相关文章

随机推荐

热门标签

使用一组字符而不是一个字符的序列对齐算法的相关文章