Skip to content

fix(metadata): encode multi-word Scopus queries#5

Open
WilmerGaspar wants to merge 5 commits into
slimeslab:mainfrom
WilmerGaspar:fix/multi-word-keyword-handling
Open

fix(metadata): encode multi-word Scopus queries#5
WilmerGaspar wants to merge 5 commits into
slimeslab:mainfrom
WilmerGaspar:fix/multi-word-keyword-handling

Conversation

@WilmerGaspar

Copy link
Copy Markdown

This PR improves Scopus metadata URL construction for multi-word query terms.

Changes:

  • Encodes query and special_query using urllib.parse.quote.
  • Converts spaces into AND before URL encoding to make multi-word searches more explicit.
  • Keeps the change focused on the metadata query-construction path.

Testing:

  • Not run locally. This change was prepared through the GitHub web editor.

Related issue:

Encode main and special Scopus query terms after converting spaces to AND for multi-word keyword handling.
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@aritraroy24

Copy link
Copy Markdown
Member

Hi @WilmerGaspar, thanks for working on the issue!

I recommend a modification to ensure multi-word main_keyword handling throughout the entire workflow, not just in the Scopus query. As we discussed in #4, except for the less accurate Scopus search, nothing will break. However, replacing the space with _ right after receiving the keyword gives us a uniform approach for file/table naming, which is cleaner from a scripting perspective. We can then replace _ back with a space wherever the original form is needed:

  • Scopus search
  • article collection regex matching
  • data extraction query

Could you extend the PR to cover these cases too, so the fix handles the entire space handling, not just the Scopus search?

@WilmerGaspar

Copy link
Copy Markdown
Author

Thanks, Aritra. I extended the PR to cover the workflow-wide space handling more consistently.

The update now:

  • Normalizes main_property_keyword into an underscore-safe internal form for file/table/path naming.
  • Converts _ back to spaces before Scopus query construction.
  • Uses the readable search keyword form in the data extraction query.

I also reviewed the article processor path. From what I saw, self.keyword is mainly used for paths, CSV/database naming, and related outputs, while the article text matching relies on property_keywords. So I kept the matching logic unchanged and focused the normalization on the main keyword flow.

Please let me know if you’d prefer any part handled differently.

@aritraroy24

Copy link
Copy Markdown
Member

Hi @WilmerGaspar, sorry for the delay in replying. Got stuck with some other work.

Yeah. You were right about the property_keywords. So, we don't need to modify anything in the regex matching.

However, the tests are failing due to wrong indentation in the scripts. Maybe due to the GitHub web editor, both scripts now have indentation issues in the following functions:

@WilmerGaspar

Copy link
Copy Markdown
Author

No problem at all, Aritra. I completely understand. Thanks for taking the time to review it.

It’s a pleasure to help with the project. I corrected the indentation issues in comproscanner.py and fetch_metadata.py and pushed the update to the PR.

@aritraroy24

Copy link
Copy Markdown
Member

Hi @WilmerGaspar,

The indentation error is still there for the __init__ method in comproscanner.py‎:

Current code:

class ComProScanner:
        def __init__(self, main_property_keyword: str = None):
        if main_property_keyword is None:
            raise ValueErrorHandler(
                "Please provide a main property keyword to proceed."
            )

        self.main_property_keyword = main_property_keyword.replace(" ", "_")
        self.main_property_search_keyword = self.main_property_keyword.replace("_", " ")

Should be:

class ComProScanner:
    def __init__(self, main_property_keyword: str = None):
        if main_property_keyword is None:
            raise ValueErrorHandler(
                "Please provide a main property keyword to proceed."
            )

        self.main_property_keyword = main_property_keyword.replace(" ", "_")
        self.main_property_search_keyword = self.main_property_keyword.replace("_", " ")

@WilmerGaspar

Copy link
Copy Markdown
Author

Thanks again, Aritra. I also corrected the indentation in _construct_url inside fetch_metadata.py and pushed the update. The workflow is now waiting for approval to run again.

@aritraroy24

Copy link
Copy Markdown
Member

Hi @WilmerGaspar, the test is failing again with an AttributeError: "'FetchMetadata' object has no attribute '_construct_url'".

Reason: _construct_url became a function under __init__ instead of FetchMetadata class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants