我正在编写一个 Clojure 实现这次编码挑战 http://biostar.stackexchange.com/questions/1759/code-golf-mean-length-of-fasta-sequences,尝试找出 Fasta 格式的序列记录的平均长度:
>1
GATCGA
GTC
>2
GCA
>3
AAAAA
有关更多背景信息,请参阅此相关 StackOverflow 帖子 https://stackoverflow.com/questions/3296855/average-length-of-the-sequences-in-a-fasta-file-can-you-improve-this-erlang-co关于 Erlang 解决方案。
我的 Clojure 初学者尝试使用lazy-seq 尝试一次读入文件中的一条记录,以便它将扩展到大文件。然而,它相当消耗内存并且速度很慢,所以我怀疑它没有得到最佳实现。这是一个使用以下解决方案BioJava http://biojava.org库来抽象出记录的解析:
(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])
(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))
(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths *command-line-args*))))
以及无需外部库的等效方法:
(use '[clojure.contrib.duck-streams :only (read-lines)])
(defn seq-lengths [lines cur-length]
"Retrieve lengths of sequences in the file using line lengths"
(lazy-seq
(let [cur-line (first lines)
remain-lines (rest lines)]
(if (= nil cur-line) [cur-length]
(if (= \> (first cur-line))
(cons cur-length (seq-lengths remain-lines 0))
(seq-lengths remain-lines (+ cur-length (.length cur-line))))))))
(defn fasta-to-lengths-bland [in-file seq-type]
; pop off first item since it will be everything up to the first >
(rest (seq-lengths (read-lines in-file) 0)))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths-bland *command-line-args*))))
当前的实现在大文件上需要 44 秒,而 Python 实现则需要 7 秒。您能提供一些关于加快代码速度并使其更加直观的建议吗?使用lazy-seq是否可以按预期正确地逐条记录地解析文件?
这可能并不重要,但是average
握着长度之海的头。
以下是一种完全未经测试但更懒惰的方法来完成我认为您想要的事情。
(use 'clojure.java.io) ;' since 1.2
(defn lazy-avg [coll]
(let [f (fn [[v c] val] [(+ v val) (inc c)])
[sum cnt] (reduce f [0 0] coll)]
(if (zero? cnt) 0 (/ sum cnt)))
(defn fasta-avg [f]
(->> (reader f)
line-seq
(filter #(not (.startsWith % ">")))
(map #(.length %))
lazy-avg))
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)