Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export 4CAT datasets and analyses as ZIP file... and import them elsewhere! #452

Merged
merged 14 commits into from
Oct 1, 2024

Conversation

dale-wahl
Copy link
Member

  • Processor that saves metadata, results files, and logs into a zip file.
    • Automatically expires after 1 day (as these datasets are essentially just copies of the other data)
  • Updates to import_4cat datasource to accept ZIP files in addition to 4CAT results URLs

@dale-wahl dale-wahl requested a review from stijn-uva September 6, 2024 14:04
dale-wahl and others added 9 commits September 6, 2024 16:09
commit 3f2a62a
Author: Carsten Schnober <[email protected]>
Date:   Wed Sep 18 18:18:29 2024 +0200

    Update Gensim to >=4.3.3, <4.4.0 (#450)

    * Update Gensim to >=4.3.3, <4.4.0

    * update nltk as well

    ---------

    Co-authored-by: Dale Wahl <[email protected]>
    Co-authored-by: Sal Hagen <[email protected]>

commit fee2c8c
Merge: 3d94b66 f8e93ed
Author: sal-phd-desktop <[email protected]>
Date:   Wed Sep 18 18:11:19 2024 +0200

    Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat

commit 3d94b66
Author: sal-phd-desktop <[email protected]>
Date:   Wed Sep 18 18:11:04 2024 +0200

    FINALLY remove 'News' from the front page, replace with 4CAT BlueSky updates and potential information about the specific server (to be set on config page)

commit f8e93ed
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 15:11:21 2024 +0200

    Simple extensions page in Control Panel

commit b5be128
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 14:08:13 2024 +0200

    Remove 'docs' directory

commit 1e2010a
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 14:07:38 2024 +0200

    Forgot TikTok and Douyin

commit c757dd5
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 14:01:31 2024 +0200

    Say 'zeeschuimer' instead of 'extension' to avoid confusion with 4CAT extensions

commit ee7f434
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 14:00:40 2024 +0200

    RIP Parler data source

commit 11300f2
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 11:21:37 2024 +0200

    Tuplestring

commit 5472652
Author: Stijn Peeters <[email protected]>
Date:   Wed Sep 18 11:15:29 2024 +0200

    Pass user obj instead of str to ConfigWrapper in Processor

commit b21866d
Author: Stijn Peeters <[email protected]>
Date:   Tue Sep 17 17:45:01 2024 +0200

    Ensure request-aware config reader in user object when using config wrapper

commit bbe79e4
Author: Sal Hagen <[email protected]>
Date:   Tue Sep 17 15:12:46 2024 +0200

    Fix extension path walk for Windows

commit d6064be
Author: Stijn Peeters <[email protected]>
Date:   Mon Sep 16 14:50:45 2024 +0200

    Allow tags that have no users

    Use case: tag-based frontend differentiation using X-4CAT-Config-Via-Proxy

commit b542ded
Author: Stijn Peeters <[email protected]>
Date:   Mon Sep 16 14:13:14 2024 +0200

    Trailing slash in query results list

commit a4bddae
Author: Dale Wahl <[email protected]>
Date:   Mon Sep 16 13:57:23 2024 +0200

    4CAT Extension - easy(ier) adding of new datasources/processors that can be mainted seperately from 4CAT base code (#451)

    * domain only

    * fix reference

    * try and collect links with selenium

    * update column_filter to find multiple matches

    * fix up the normal url_scraper datasource

    * ensure all selenium links are strings for join

    * change output of url_scraper to ndjson with map_items

    * missed key/index change

    * update web archive to use json and map to 4CAT

    * fix no text found

    * and none on scraped_links

    * check key first

    * fix up web_archive error reporting

    * handle None type for error

    * record web archive "bad request"

    * add wait after redirect movement

    * increase waittime for redirects

    * add processor for trackers

    * dict to list for addition

    * allow both newline and comma seperated links

    * attempt to scrape iframes as seperate pages

    * Fixes for selenium scraper to work with config database

    * installation of packages, geckodriver, and firefox if selenium enabled

    * update install instructions

    * fix merge error

    * fix dropped function

    * have to be kidding me

    * add note; setup requires docker... need to think about IF this will ever
    be installed without Docker

    * seperate selenium class into wrapper and Search class so wrapper can be
    used in processors!

    * add screenshots; add firefox extension support

    * update selenium definitions

    * regex for extracting urls from strings

    * screenshots processor; extract urls from text and takes screenshots

    * Allow producing zip files from data sources

    * import time

    * pick better default

    * test screenshot datasource

    * validate all params

    * fix enable extension

    * haha break out of while loop

    * count my items

    * whoops, len() is important here

    * must be getting tired...

    * remove redundant logging

    * Eager loading for screenshots, viewport options, etc

    * Woops, wrong folder

    * Fix label shortening

    * Just 'queue' instead of 'search queue'

    * Yeah, make it headless

    * README -> DESCRIPTION

    * h1 -> h2

    * Actually just have no header

    * Use proper filename for downloaded files

    * Configure whether to offer pseudonymisation etc

    * Tweak descriptions

    * fix log missing data

    * add columns to post_topic_matrix

    * fix breadcrumb bug

    * Add top topics column

    * Fix selenium config install parameter (Docker uses this/manual would
    need to run install_selenium, well, manually)

    * this processor is slow; i thought it was broken long before it updated!

    * refactor detect_trackers as conversion processor not filter

    * add geckodriver executable to docker install

    * Auto-configure webdrivers if available in PATH

    * update screenshots to act as image-downloader and benefit from processors

    * fix is_compatible_with

    * Delete helper-scripts/migrate/migrate-1.30-1.31.py

    * fix embeddings is_compatible_with

    * fix up UI options for hashing and private

    * abstract was moved to lib

    * various fixes to selenium based datasources

    * processors not compatible with image datasets

    * update firefox extension handling

    * screenshots datasource fix get_options

    * rename screenshots processor to be detected as image dataset

    * add monthly and weekly frequencies to wayback machine datasource

    * wayback ds: fix fail if all attempts do not realize results; addion frequency options to options; add daily

    * add scroll down page to allow lazy loading for entire page screenshots

    * screenshots: adjust pause time so it can be used to force a wait for images to load

    I have not successfully come up with or found a way to wait for all images to load; document.readyState == 'complete' does not function in this way on certain sites including the wayback machine

    * hash URLs to create filenames

    * remove log

    * add setting to toggle display advanced options

    * add progress bars

    * web archive fix query validation

    * count subpages in progress

    * remove overwritten function

    * move http response to own column

    * special filenames

    * add timestamps to all screenshots

    * restart selenium on failure

    * new build have selenium

    * process urls after start (keep original query parameters)

    * undo default firefox

    * quick max

    * rename SeleniumScraper to SeleniumSearch

    todo: build SeleniumProcessor!

    * max number screenshots configurable

    * method to get url with error handling

    * use get_with_error_handling

    * d'oh, screenshot processor needs to quit selenium

    * update log to contain URL

    * Update scrolling to use Page down key if necessary

    * improve logs

    * update image_category_wall as screenshot datasource does not have category column; this is not ideal and ought to be solved in another way.

    Also, could I get categories from the metadata? That's... ugh.

    * no category, no processor

    * str errors

    * screenshots: dismiss alerts when checking ready state is complete

    * set screenshot timeout to 30 seconds

    * update gensim package

    * screenshots: move processor interrupt into attempts loop

    * if alert disappears before we can dismiss it...

    * selenium specific logger

    * do not switch window when no alert found on dismiss

    * extract wait for page to load to selenium class

    * improve descriptions of screenshot options

    * remove unused line

    * treat timeouts differently from other errors

    these are more likely due to an issue with the website in question

    * debug if requested

    * increase pause time

    * restart browser w/ PID

    * increase max_workers for selenium

    this is by individual worker class not for all selenium classes... so you can really crank them out if desired

    * quick fix restart by pid

    * avoid bad urls

    * missing bracket & attempt to fix-missing dependencies in Docker install

    * Allow dynamic form options in processors

    * Allow 'requires' on data source options as well

    * Handle list values with requires

    * basic processor for apple store; setup checks for additional requirements

    * fix is_4cat_class

    * show preview when no map_item

    * add google store datasource

    * Docker setup.py use extensions

    * Wider support for file upload in processors

    * Log file uploads in DMI service manager

    * add map_item methods and record more data per item

    need additional item data as map_item is staticmethod

    * update from master; merge conflicts

    * fix docker build context (ignore data files)

    * fix option requirements

    * apple store fix: list still tries to get query

    * apple & google stores fix up item mapping

    * missed merge error

    * minor fix

    * remove unused import

    * fix datasources w/ files frontend error

    * fix error w/ datasources having file option

    * better way to name docker volumes

    * update two other docker compose files

    * fix docker-compose ymls

    * minor bug: fix and add warning; fix no results fail

    * update apple field names to better match interface

    * update google store fieldnames and order

    * sneak in jinja logger if needed

    * fix fourcat.js handling checkboxes for dynamic settings

    * add new endpoint for app details to apple store

    * apple_store map new beta app data

    * add default lang/country

    * not all apps have advisories

    * revert so button works

    * add chart positions to beta map items

    * basic scheduler

    To-do
    - fix up and add options to scheduler view (e.g. delete/change)
    - add scheduler view to navigator
    - tie jobs to datasets? (either in scheduler view or, perhaps, filter dataset view)
    - more testing...

    * update scheduler view, add functions to update job interval

    * revert .env

    * working scheduler!

    * basic scheduler view w/ datasets

    * fix postgres tag

    * update job status in scheduled_jobs table

    * fix timestamp; end_date needed for last run check; add dataset label

    * improve scheduler view

    * remove dataset from scheduled_jobs table on delete

    * scheduler view order by last creation

    * scheduler views: separate scheduler list from scheduled dataset list

    * additional update from master fixes

    * apple_store map_items fix missing locales

    * add back depth for pagination

    * correct route

    * modify pagination to accept args

    * pagination fun

    * pagination: i hate testing on live servers...

    * ok ok need the pagination route

    * pagination: add route_args

    * fix up scheduler header

    * improve app store descriptions

    * add azure store

    * fix azure links

    * azure_store: add category search

    * azure fix type of config update timestamp

    OPTION_DATE does not appear correctly in settings and causes it to be written incorrectly

    * basic aws store

    * check if selenium available; get correct app_id

    * aws: implement pagination

    * add logging; wait for elements to load after next page; attempts to rework filter option collection

    * apple_store: handle invalid param error

    * fix filter_options

    * aws: fix filter option collection!

    * more merge

    * move new datasources and processors to extensions and modify setup.py and module loader to use the new locations

    * migrate.py to run extension "fourcat_install.py" files

    * formatting

    * remove extensions; add gitignore

    * excise scheduler merge

    * some additional cleanup from app_studies branch

    * allow nested datasources folders; ignore files in extensions main folder

    * allow extension install scripts to run pip if migrate.py has not

    * Remove unused URL functions we could use ural for

    * Take care of git commit hash tracking for extension processors

    * Get rid of unused path.versionfile config setting

    * Add extensions README

    * Squashed commit of the following:

    commit cd356f7
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 17:36:18 2024 +0200

        UI setting for 4CAT install ad in login

    commit 0945d8c
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 17:32:55 2024 +0200

        UI setting for anonymisation controls

        Todo: make per-datasource

    commit 1a2562c
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 15:53:27 2024 +0200

        Debug panel for HTTP headers in control panel

    commit 203314e
    Author: Stijn Peeters <[email protected]>
    Date:   Sat Sep 14 15:53:17 2024 +0200

        Preview for HTML datasets

    commit 48c20c2
    Author: Desktop Sal <[email protected]>
    Date:   Wed Sep 11 13:54:23 2024 +0200

        Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies

    commit 657ffd7
    Author: Dale Wahl <[email protected]>
    Date:   Fri Sep 6 16:29:19 2024 +0200

        fix nltk where it matters

    commit 2ef5c80
    Author: Stijn Peeters <[email protected]>
    Date:   Tue Sep 3 12:05:14 2024 +0200

        Actually check progress in text annotator

    commit 693960f
    Author: Stijn Peeters <[email protected]>
    Date:   Mon Sep 2 18:03:18 2024 +0200

        Add processor for stormtrooper DMI service

    commit 6ae964a
    Author: Stijn Peeters <[email protected]>
    Date:   Fri Aug 30 17:31:37 2024 +0200

        Fix reference to old stopwords list in neologisms preset

    * Fix Github links for extensions

    * Fix commit detection in extensions

    * Fix extension detection in module loader

    * Follow symlinks when loading extensions

    Probably not uncommon to have a checked out repo somewhere to then symlink into the extensions dir

    * Make queue message on create page more generic

    * Markdown in datasource option tooltips

    * Remove Spacy model from requirements

    * Add software_source to database SQL

    ---------

    Co-authored-by: Stijn Peeters <[email protected]>
    Co-authored-by: Stijn Peeters <[email protected]>

commit cd356f7
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 17:36:18 2024 +0200

    UI setting for 4CAT install ad in login

commit 0945d8c
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 17:32:55 2024 +0200

    UI setting for anonymisation controls

    Todo: make per-datasource

commit 1a2562c
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 15:53:27 2024 +0200

    Debug panel for HTTP headers in control panel

commit 203314e
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 15:53:17 2024 +0200

    Preview for HTML datasets

commit 48c20c2
Author: Desktop Sal <[email protected]>
Date:   Wed Sep 11 13:54:23 2024 +0200

    Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies

commit 657ffd7
Author: Dale Wahl <[email protected]>
Date:   Fri Sep 6 16:29:19 2024 +0200

    fix nltk where it matters
not sure pycharm's merge is super awesome...
@stijn-uva stijn-uva merged commit a224dd9 into master Oct 1, 2024
1 check passed
@dale-wahl dale-wahl deleted the dataset_export_import branch October 1, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants