Conversation
src/instructlab/sdg/default_flows.py
Outdated
There was a problem hiding this comment.
Should this be self.num_iters?
There was a problem hiding this comment.
No this is a vllm specific parameter - we are replacing the IterBlock with this. This parameter essentialy controls the number of return sequences
There was a problem hiding this comment.
No, these parameters get passed to the openai client directly as kwargs. Ref to openai's n parameter https://platform.openai.com/docs/api-reference/completions/create#completions-create-n. In theory, openai's functionality of n is the same as what we were doing with num_iters.
There was a problem hiding this comment.
Running ilab data generate with this branch gives:
File "/home/markmc/sdg/src/instructlab/sdg/generate_data.py", line 257, in generate_data
sdg_knowledge, sdg_freeform_skill, sdg_grounded_skill = _sdg_init(
^^^^^^^^^^
File "/home/markmc/sdg/src/instructlab/sdg/generate_data.py", line 144, in _sdg_init
[
File "/home/markmc/sdg/src/instructlab/sdg/generate_data.py", line 146, in <listcomp>
flow_type(
TypeError: Flow.__init__() takes from 4 to 5 positional arguments but 6 were given
(after I fix the iterblock import error)
What should the num_instructions_to_generate parameter to generate_data() be used for?
There was a problem hiding this comment.
@shivchander or @oindrillac any thoughts here? As far as I know the generate_data.py file still contains older sdg code right?
There was a problem hiding this comment.
Yeah, it seems like generate_data.py has been updated to use the newer pipeline structure. So I guess I can go ahead and fix the num_iter in generate_data() API call. Just want to get confirmation from @oindrillac / @aakankshaduggal
There was a problem hiding this comment.
I think the point is that num_iters was a constructor arg to the Flow class before and it's not clear why that's changed. Is it the intention to change this from being parameterized to being 1? If it's always 1 then there's no need to specify n at all. If it stays as a param to the Flow class then it might be best to rename it since num_iters isn't really correct.. should be something more like num_parallel_outputs_per_input
There was a problem hiding this comment.
Alright, after discussing with @shivchander, we think that the client num_instructions_to_generate can be captured by openai's n parameter. Currently, the num_samples are hardcoded in the default_flows.py which is overwriting the num_instructions_to_generate /num_iters passed from generate_data(). It should be fixed in the latest commit.
There was a problem hiding this comment.
does that n parameter imply that the server supports batching? llama-cpp does not -- either way, just need to make sure it works with that backend too and not just vllm
src/instructlab/sdg/llmblock.py
Outdated
There was a problem hiding this comment.
I find this super confusing
What is the "n" in n_samples supposed to stand for? It reads to me like "number of samples", suggesting an integer
And the nested list comprehension is ... too clever
Suggest
extended_samples = []
for item in samples:
extended_samples.extend([item] * num_parallel_samples)
There was a problem hiding this comment.
Yeah, sorry I couldn't come up with a better name. I agree extended_samples is better.
Some context for this change:
Openai's n parameter determines how many responses to generate for each input sample. This change fixes the zip functionality we use just below this code.
If we have n higher than 1 then we end up with scenarios something like this:
>>> input = ["s1", "s2"]
>>> output = ["o11", "o12", "o13", "o21", "o22", "o23"] # n==3
>>> c = zip(input, output)
>>> for item in c:
... print(item)
...
('s1', 'o11')
('s2', 'o12')
Instead, we want
>>> n_samples = [item for item in input for i in range(3)]
>>> n_samples
['s1', 's1', 's1', 's2', 's2', 's2']
>>> c = zip(n_samples, output)
>>>
>>> for item in c:
... print(item)
...
('s1', 'o11')
('s1', 'o12')
('s1', 'o13')
('s2', 'o21')
('s2', 'o22')
('s2', 'o23')
There was a problem hiding this comment.
IMO just some code comments would help with this.
|
The commit log could be much improved. The logical changes I see, with the explanations I'd expect:
|
Ah, I see these three are from #53 |
src/instructlab/sdg/pipeline.py
Outdated
There was a problem hiding this comment.
e2e is failing with:
File "/home/runner/work/sdg/sdg/venv/lib/python3.11/site-packages/instructlab/sdg/pipeline.py", line 6, in <module>
from .iterblock import IterBlock
ModuleNotFoundError: No module named 'instructlab.sdg.iterblock'
see above warnings
E0401: Unable to import 'instructlab.sdg.iterblock' (import-error)
W0611: Unused IterBlock imported from iterblock (unused-import)
There was a problem hiding this comment.
Ah yeah, the pipeline file changes didn't get pushed, sigh!
07f876a to
6bf23ce
Compare
|
@npalaska you should rebase this on the latest main branch or merge latest main into this branch, right now the diff/PR includes unrelated changes. |
Does the llama-cpp backend support this parameter? See the instructlab changelog:
i.e. the llama-cpp backend will need to be supported |
I don't think so, the best thing here would be to wrap the client in this case - the generate method can be overridden to just make sequential calls in a loop and return the equivalent response when |
Right, so something like this in IterBlock: (I don't actually know the best way to detect whether the client supports this parameter though) |
|
This pull request has merge conflicts that must be resolved before it can be |
This is a step towards converting it to the final format (messages) required for training. Signed-off-by: Oindrilla Chatterjee <oc@bu.edu> Co-authored-by: Aakanksha Duggal <aduggal@redhat.com> Co-authored-by: Shiv <shivchander.s30@gmail.com>
Changed the prompt templates and alignment with expected outputs. Conducted stress testing across various leaf nodes to ensure accuracy and relevance. Signed-off-by: Oindrilla Chatterjee <oc@bu.edu> Co-authored-by: Aakanksha Duggal <aduggal@redhat.com> Co-authored-by: Shiv <shivchander.s30@gmail.com>
261cbbe to
0c90804
Compare
|
|
Ruff checking failing:
|
It may only support a subset of parameters, we can't assume it supports
This PR should not merge without this testing Since
but also note:
Just a little more investigation is required to confirm that llama-cpp is handling this correctly |
Hmm, maybe this isn't so promising - it looks like we're creating 1 sample per seed example? There are 5 seed examples for the freeform skills test |
|
This pull request has merge conflicts that must be resolved before it can be |
@markmc yes, exactly, except that again "num_iters" might not be the best name since the fact that they are iters in one of the cases is an implementation detail. Also we could have a wrapper for the client that encapsulates this and thus isn't needed in the block impls (and wouldn't need to use an IterBlock). |
@markmc Yeah I tested with model served with llama.cpp and it seems like openai client supports it in a way that it doesn't break the request but the server ignores the I guess the next thing would be to find out what is the best way to know whether the server supports the |
Yep. The most straightforward thing is to pass down a I'd prefer something more like |
Do you mean adding a new parameter |
no take it from the existing |
@markmc So we are doing it in a different way where we determine whether the backend server supports batching by making a test request. We don't need to pass any server configs. @njhill has opened a new PR #77 with these changes on top of changes from this PR and I will close this one. |
|
Replaced by #105 |
Update actionlint to v1.7.0
This PR is rebased on top of #53
Iterblockand replaces it withLLMblock.Iterblockcan be specified in the LLMBlock'sgen_kwargsusing openai'snparameter.nparameter determines the number of output choices generated by the LLM service per sample in the batch.