-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explorer revamp #428
base: master
Are you sure you want to change the base?
Explorer revamp #428
Conversation
…hodsinitiative/4cat into explorer-improvements # Conflicts: # common/lib/config_definition.py
…rer settings page
…s functionality, start Instagram template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over the backend code and only noticed one real issue (in an edge case). I ran this version and tested out the Explorer on a number of datasets (instagram, custom, telegram, tumblr, tiktok, youtube, reddit). It looks good! Sort works well. Reddit was missing the "subject" field (it's probably the only dataset that uses subject anymore). Telegram has an issue which I will post separately.
I tested saving annotations and writing them to datasets. This worked for me (and broke one with my edge case 😬; see comment). I did notice that the new fields show up in the Dataset preview view, but the values saved to the database do not show up in preview. The values do show up after you have run "write annotations".
Changing deactivating/activating settings seem to work fine. There is an explorerflask
settings group that could probably be merged with the Explorer group.
If you want to merge now, I would deactivate Telegram as a default (till addressed) and consider how to address my comment re: field names for annotations.
datasources/ninegag/search_9gag.py
Outdated
@@ -8,6 +8,7 @@ | |||
|
|||
from backend.lib.search import Search | |||
from common.lib.item_mapping import MappedItem | |||
from common.lib.helpers import UserInput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're importing UserInput in a few datasources unnecessarily. Probably an oversight and otherwise has no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, we should do some cleanup..
@@ -101,7 +102,7 @@ def process(self): | |||
|
|||
# Write to top dataset | |||
for label, values in new_data.items(): | |||
self.add_field_to_parent("annotation_" + label, values, which_parent=self.source_dataset, update_existing=True) | |||
self.add_field_to_parent(label, values, which_parent=self.source_dataset, update_existing=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The add_field_to_parent
function does not check for existing fields. If a User creates a field called "username" they will overwrite an existing field with the same name. If I recall, I could not figure out how to check that an existing column had the name because the add_field_to_parent function needs to be able to update existing annotation fields. This is just a bit dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on a dataset by creating a field called "author", adding some values, and writing to dataset. I was able to overwrite the original "author" field (which in my case was actually a dictionary of author related data which caused map item to break). I recommend reverting this change. We could even add 4CAT_annotation_
or something so that it would be virtually impossible for raw data to contain that fieldname.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed an oversight for now, though I would like to have the option for annotation fields to have a 'clean' name; long names are quickly unreadable in spreadsheet software. This can be resolved by initially checking whether an annotation field key already exists in the dataset columns or, when annotated datasets are filtered and create a new dataset, if it is not a field registered in the annotations
table for the parent dataset.
This is a sort-of edge case for now, but I'll try to resolve this next week!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this with a back-end and front-end check
And this is the issue with a Telegram dataset I ran into:
Looks like perhaps the emojis are killing the template. Telegram, I think, is the only datasource using them. |
…hodsinitiative/4cat into explorer-improvements
…orer-improvements
# Conflicts: # setup.py # webtool/__init__.py # webtool/lib/template_filters.py # webtool/templates/components/result-result-row.html
# Conflicts: # common/lib/dataset.py # webtool/views/api_explorer.py
@stijn-uva this should be mergeable! Perhaps we want the OpenAI processor on a different branch for now since I'd like it to have some more features. But it works in its current state so also doesn't hurt to have it on master.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got an error while trying to save annotations, so I couldn't test everything:
ERROR:webtool:Exception on /explorer/save_annotations/a565f95306bcf5182adc023aa82a8d59 [POST]
Traceback (most recent call last):
File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 2190, in wsgi_app
response = self.full_dispatch_request()
File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1486, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1484, in full_dispatch_request
rv = self.dispatch_request()
File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1469, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask_limiter/extension.py", line 544, in __inner
return obj(*a, **k)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask_login/utils.py", line 290, in decorated_view
return current_app.ensure_sync(func)(*args, **kwargs)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/lib/helpers.py", line 332, in decorated_view
return func(*args, **kwargs)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/lib/helpers.py", line 332, in decorated_view
return func(*args, **kwargs)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/views/views_explorer.py", line 214, in explorer_save_annotations
annotations_saved = dataset.save_annotations(annotations, overwrite=True)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/dataset.py", line 1787, in save_annotations
annotation = Annotation(data=annotation_data, db=self.db)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/annotation.py", line 148, in __init__
self.write_to_db()
File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/annotation.py", line 223, in write_to_db
return self.db.upsert("annotations", data=db_data, constraints=["label", "dataset", "item_id"])
File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/database.py", line 270, in upsert
cursor.execute(query, replacements)
File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/psycopg2/extras.py", line 236, in execute
return super().execute(query, vars)
psycopg2.errors.InvalidColumnReference: there is no unique or exclusion constraint matching the ON CONFLICT specification
Some other things I noted:
- I kind of liked the always-in-view toolbar in the old annotations layout, so I could scroll down the page and then open annotations, instead of having to scroll back up to do so
- The 'save annotations' button saves so quickly that it's sometimes hard to know if the annotations were actually saved. Maybe a 'Annotations saved!' tooltip-like notice that fades away after a second would be useful
- Maybe annotation fields should be shown by default after saving/editing fields? You would usually want to see them anyway. I could even see a case for not hiding them, ever, and just making them always visible (since annotation is one of the main purposes of the explorer view anyway)
# Instagram posts also allow 'Collabs' with up to one co-author | ||
coauthor = {"coauthor": "", "coauthor_fullname": "", "coauthor_id": ""} | ||
if node.get("coauthor_producers"): | ||
coauthor_node = node["coauthor_producers"][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instagram posts can have multiple co-authors, could we e.g. make this a comma-separated list? Else we give the impression that there is less data than there really is
except ValueError: | ||
self.dataset.finish_with_error("Max tokens must be a number") | ||
|
||
self.dataset.delete_parameter("api_key") # sensitive, delete after use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to set "sensitive": True
for this option, then this is taken care of internally
annotations = [] | ||
|
||
# Get annotation IDs first | ||
if item_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the Annotation
constructor can also take a data
dictionary to avoid having to re-query the database - could that be used here instead of first querying only the id
and then again the full annotation data when the object is instantiated on line 1681?
|
||
# We're saving the new annotation fields as-is. | ||
# Ordering of fields is preserved this way. | ||
self.db.execute("UPDATE datasets SET annotation_fields = %s WHERE key = %s;", (json.dumps(new_fields), self.key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using self.db.update
or relying on DataSet.__setattr__
i.e. just set self.annotation_fields
and let the class handle the rest.
# Base URLs after which tags and @-mentions follow, per platform | ||
base_urls = { | ||
"twitter": { | ||
"hashtag": "https://twitter.com/hashtag/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x.com? Seems to be the canonical domain now
This PR revamps how the Explorer works and looks. It specifically does the following:
OPTION_DATASOURCES_TABLE
user input that creates a table with dynamic columns for each enabled dataset. Input fields per row can be text, dropdown, and checkboxstatic/css/explorer/
) and Jinja2 templates (inwebtool/templates/explorer/datasource-templates/
) instead of CSS and JSON files in the data source folders that need to be verified and parsed.iterate_items
.Note that some unused code is still present for future updates with respect to 4CAT scrapers and database-accessible data sources generally.