Skip to content

[Rule] ClosestSubstring to ILP #1035

@zhangjieqingithub

Description

@zhangjieqingithub

Source

ClosestSubstring

Source model issue: #1033

Target

ILP/i32

Motivation

This rule gives ClosestSubstring immediate solver connectivity through the existing integer-linear-programming hub.

The formulation chooses both the center string and one window from each input string, then minimizes a radius variable. It uses window-choice indicators to activate exactly the Hamming-distance constraint for the selected window of each string.

Reference

Ming Li, Bin Ma, and Lusheng Wang, "On the closest string and substring problems," Journal of the ACM 49(2):157-171, 2002. https://doi.org/10.1145/506147.506150

Reduction Algorithm

Let the source instance have:

  • alphabet size q,
  • input strings s_1, ..., s_n,
  • substring length ell,
  • W_i = |s_i| - ell + 1 possible windows in string s_i.

Construct an ILP/i32 instance as follows.

  1. For every center position r in {0,...,ell-1} and alphabet symbol a in {0,...,q-1}, create an integer variable x_{r,a}.
    It means the center has symbol a at position r.

  2. For every input string s_i and window start p in {0,...,W_i-1}, create an integer variable y_{i,p}.
    It means window p is selected from string s_i.

  3. Create one nonnegative integer radius variable R.

  4. For each center position r, add the assignment constraint:
    sum_a x_{r,a} = 1.

  5. For each input string s_i, add the window-choice constraint:
    sum_p y_{i,p} = 1.

  6. For every input string s_i and every window start p, add the conditional radius constraint:
    R + sum_{r=0}^{ell-1} x_{r, s_i[p+r]} - ell * y_{i,p} >= 0.

    If y_{i,p} = 1, this becomes:
    R + number_of_matches(c, s_i[p..p+ell)) >= ell,
    equivalently R >= d_H(c, s_i[p..p+ell)).

    If y_{i,p} = 0, the constraint is automatically satisfied because R >= 0 and the match count is nonnegative.

  7. Minimize R.

  8. Recover the source solution by reading:

    • the center symbol at each position r from the unique a with x_{r,a} = 1,
    • the selected window in each string s_i from the unique p with y_{i,p} = 1.

Correctness Argument

The center assignment constraints encode exactly one center string c in Sigma^ell. The window-choice constraints encode exactly one length-ell substring from each input string.

For a selected window p in string s_i, the expression sum_r x_{r, s_i[p+r]} counts the number of positions where the center agrees with that window. Thus ell - sum_r x_{r, s_i[p+r]} is exactly the Hamming distance between the center and the selected window.

The conditional radius constraint is active exactly for the selected window of each input string, and it forces R to be at least that selected Hamming distance. For unselected windows, the constraint is redundant.

Therefore minimizing R minimizes the maximum Hamming distance between the center and one chosen window from each input string. This is exactly the ClosestSubstring objective.

Size Overhead

Let:

  • q = alphabet_size,
  • ell = substring_length,
  • n = num_strings,
  • W = total_num_windows = sum_i (|s_i| - ell + 1).
Target metric Formula
num_vars q * ell + W + 1
num_constraints ell + n + W

Validation Method

Use closed-loop validation on small instances:

  • enumerate all center strings and all window choices directly,
  • solve the ILP instance,
  • extract the center string and selected windows,
  • verify that the extracted solution has the same optimum radius.

Include examples where:

  • the optimum radius is nonzero,
  • the best window is not the first window in at least one string,
  • no common exact substring exists,
  • multiple window choices tie for the same center.

Example

Use the source instance from #1033:

  • alphabet {0,1},
  • substring length ell = 3,
  • strings 00011, 10100, 11001.

The center 010 is encoded by:

  • x_{0,0} = 1,
  • x_{1,1} = 1,
  • x_{2,0} = 1.

Choose windows:

  • from 00011, choose start 0, window 000,
  • from 10100, choose start 1, window 010,
  • from 11001, choose start 0, window 110.

For string 00011 and selected start 0, the active radius constraint is:
R + x_{0,0} + x_{1,0} + x_{2,0} - 3*y_{1,0} >= 0.

Substituting the chosen center and y_{1,0}=1 gives:
R + 1 + 0 + 1 - 3 >= 0,
so R >= 1.

For string 10100 and selected start 1, the window is exactly 010, so the active constraint allows R >= 0.

For string 11001 and selected start 0, the active constraint forces R >= 1.

Thus R = 1 is feasible. Since the three sets of length-3 windows have empty intersection, radius 0 is impossible. Therefore the ILP optimum is 1, matching the ClosestSubstring optimum.

BibTeX

@article{LiMaWang2002ClosestStringSubstring,
  author  = {Ming Li and Bin Ma and Lusheng Wang},
  title   = {On the Closest String and Substring Problems},
  journal = {Journal of the ACM},
  volume  = {49},
  number  = {2},
  pages   = {157--171},
  year    = {2002},
  doi     = {10.1145/506147.506150},
  url     = {https://doi.org/10.1145/506147.506150}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    ruleA new reduction rule to be added.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions