دانلود مقاله ISI انگلیسی شماره 79062
ترجمه فارسی عنوان مقاله

# الگوریتم های سریع برای پیدا کردن نمایندگی حداقل تکرار رشته ها و درختان.

عنوان انگلیسی
Fast algorithms for finding a minimum repetition representation of strings and trees ☆
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
79062 2013 20 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Discrete Applied Mathematics, Volume 161, Issues 10–11, July 2013, Pages 1556–1575

ترجمه کلمات کلیدی
تکرار پشت سر هم - الگوریتم رشته؛ دستور برچسب شده درختان
کلمات کلیدی انگلیسی
Tandem repeat; String algorithm; Labeled ordered trees
پیش نمایش مقاله

#### چکیده انگلیسی

A string with many repetitions can be represented compactly by replacing hh-fold contiguous repetitions of a string rr with (r)h(r)h. We present a compact representation, which we call a repetition representation (of a string) or RRS, by which a set of disjoint or nested tandem arrays can be compacted. In this paper, we study the problem of finding a minimum RRS   or MRRS, where the size of an RRS is defined by the sum of the length of component letters and the description length of the component repetitions (⋅)h(⋅)h which is defined by wR(h)wR(h) using a repetition weight function wRwR. We develop two dynamic programming-based algorithms to solve this problem: CMR, which works for any type of wRwR, and CMR-C, which is faster but can be applied to a constant wRwR only. CMR-C is an O(n2logn)O(n2logn)-time O(nlogn)O(nlogn)-space algorithm, which is more efficient in both time and space than CMR by a ((logn)/n)((logn)/n)-factor, where nn is the length of the given string. The problem of finding an MRRS for a string can be extended to that of finding a minimum repetition representation (of a tree) or MRRT for a given labeled ordered tree. For this problem, we present two algorithms, CMRT and CMRT-C, by using CMR and CMR-C, respectively, as a subroutine. As well as the theoretical analysis, we confirmed the efficiency of the proposed algorithms by experiments, which consist of the following three parts: First we demonstrated that CMR-C and CMRT-C are fast enough for large-scale data by using synthetic strings and trees, respectively. The size of an MRRS for a given string can be a measure of how compactly the string can be represented, meaning how well the string is structurally organized. This is also true of trees. To check such ability of MRRS-size, second we measured the size of an MRRS for chromosomes of nine different species. We found that all the chromosomes of the same species have a similar compression rate when realized by an MRRS. Run length encoding (RLE) was also shown to have species-specific compression rate, but species were separated more clearly by MRRS than by RLE. Third we examined the size of an MRRT for web pages of world-leading companies by using the tag trees, showing a consistency between the compression rate by an MRRT and visual web page structures.