-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export 4CAT datasets and analyses as ZIP file... and import them elsewhere! #452
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
dale-wahl
commented
Sep 6, 2024
- Processor that saves metadata, results files, and logs into a zip file.
- Automatically expires after 1 day (as these datasets are essentially just copies of the other data)
- Updates to import_4cat datasource to accept ZIP files in addition to 4CAT results URLs
had some weird lost datasets when debugging this
commit 3f2a62a Author: Carsten Schnober <[email protected]> Date: Wed Sep 18 18:18:29 2024 +0200 Update Gensim to >=4.3.3, <4.4.0 (#450) * Update Gensim to >=4.3.3, <4.4.0 * update nltk as well --------- Co-authored-by: Dale Wahl <[email protected]> Co-authored-by: Sal Hagen <[email protected]> commit fee2c8c Merge: 3d94b66 f8e93ed Author: sal-phd-desktop <[email protected]> Date: Wed Sep 18 18:11:19 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 3d94b66 Author: sal-phd-desktop <[email protected]> Date: Wed Sep 18 18:11:04 2024 +0200 FINALLY remove 'News' from the front page, replace with 4CAT BlueSky updates and potential information about the specific server (to be set on config page) commit f8e93ed Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 15:11:21 2024 +0200 Simple extensions page in Control Panel commit b5be128 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:08:13 2024 +0200 Remove 'docs' directory commit 1e2010a Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:07:38 2024 +0200 Forgot TikTok and Douyin commit c757dd5 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:01:31 2024 +0200 Say 'zeeschuimer' instead of 'extension' to avoid confusion with 4CAT extensions commit ee7f434 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:00:40 2024 +0200 RIP Parler data source commit 11300f2 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 11:21:37 2024 +0200 Tuplestring commit 5472652 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 11:15:29 2024 +0200 Pass user obj instead of str to ConfigWrapper in Processor commit b21866d Author: Stijn Peeters <[email protected]> Date: Tue Sep 17 17:45:01 2024 +0200 Ensure request-aware config reader in user object when using config wrapper commit bbe79e4 Author: Sal Hagen <[email protected]> Date: Tue Sep 17 15:12:46 2024 +0200 Fix extension path walk for Windows commit d6064be Author: Stijn Peeters <[email protected]> Date: Mon Sep 16 14:50:45 2024 +0200 Allow tags that have no users Use case: tag-based frontend differentiation using X-4CAT-Config-Via-Proxy commit b542ded Author: Stijn Peeters <[email protected]> Date: Mon Sep 16 14:13:14 2024 +0200 Trailing slash in query results list commit a4bddae Author: Dale Wahl <[email protected]> Date: Mon Sep 16 13:57:23 2024 +0200 4CAT Extension - easy(ier) adding of new datasources/processors that can be mainted seperately from 4CAT base code (#451) * domain only * fix reference * try and collect links with selenium * update column_filter to find multiple matches * fix up the normal url_scraper datasource * ensure all selenium links are strings for join * change output of url_scraper to ndjson with map_items * missed key/index change * update web archive to use json and map to 4CAT * fix no text found * and none on scraped_links * check key first * fix up web_archive error reporting * handle None type for error * record web archive "bad request" * add wait after redirect movement * increase waittime for redirects * add processor for trackers * dict to list for addition * allow both newline and comma seperated links * attempt to scrape iframes as seperate pages * Fixes for selenium scraper to work with config database * installation of packages, geckodriver, and firefox if selenium enabled * update install instructions * fix merge error * fix dropped function * have to be kidding me * add note; setup requires docker... need to think about IF this will ever be installed without Docker * seperate selenium class into wrapper and Search class so wrapper can be used in processors! * add screenshots; add firefox extension support * update selenium definitions * regex for extracting urls from strings * screenshots processor; extract urls from text and takes screenshots * Allow producing zip files from data sources * import time * pick better default * test screenshot datasource * validate all params * fix enable extension * haha break out of while loop * count my items * whoops, len() is important here * must be getting tired... * remove redundant logging * Eager loading for screenshots, viewport options, etc * Woops, wrong folder * Fix label shortening * Just 'queue' instead of 'search queue' * Yeah, make it headless * README -> DESCRIPTION * h1 -> h2 * Actually just have no header * Use proper filename for downloaded files * Configure whether to offer pseudonymisation etc * Tweak descriptions * fix log missing data * add columns to post_topic_matrix * fix breadcrumb bug * Add top topics column * Fix selenium config install parameter (Docker uses this/manual would need to run install_selenium, well, manually) * this processor is slow; i thought it was broken long before it updated! * refactor detect_trackers as conversion processor not filter * add geckodriver executable to docker install * Auto-configure webdrivers if available in PATH * update screenshots to act as image-downloader and benefit from processors * fix is_compatible_with * Delete helper-scripts/migrate/migrate-1.30-1.31.py * fix embeddings is_compatible_with * fix up UI options for hashing and private * abstract was moved to lib * various fixes to selenium based datasources * processors not compatible with image datasets * update firefox extension handling * screenshots datasource fix get_options * rename screenshots processor to be detected as image dataset * add monthly and weekly frequencies to wayback machine datasource * wayback ds: fix fail if all attempts do not realize results; addion frequency options to options; add daily * add scroll down page to allow lazy loading for entire page screenshots * screenshots: adjust pause time so it can be used to force a wait for images to load I have not successfully come up with or found a way to wait for all images to load; document.readyState == 'complete' does not function in this way on certain sites including the wayback machine * hash URLs to create filenames * remove log * add setting to toggle display advanced options * add progress bars * web archive fix query validation * count subpages in progress * remove overwritten function * move http response to own column * special filenames * add timestamps to all screenshots * restart selenium on failure * new build have selenium * process urls after start (keep original query parameters) * undo default firefox * quick max * rename SeleniumScraper to SeleniumSearch todo: build SeleniumProcessor! * max number screenshots configurable * method to get url with error handling * use get_with_error_handling * d'oh, screenshot processor needs to quit selenium * update log to contain URL * Update scrolling to use Page down key if necessary * improve logs * update image_category_wall as screenshot datasource does not have category column; this is not ideal and ought to be solved in another way. Also, could I get categories from the metadata? That's... ugh. * no category, no processor * str errors * screenshots: dismiss alerts when checking ready state is complete * set screenshot timeout to 30 seconds * update gensim package * screenshots: move processor interrupt into attempts loop * if alert disappears before we can dismiss it... * selenium specific logger * do not switch window when no alert found on dismiss * extract wait for page to load to selenium class * improve descriptions of screenshot options * remove unused line * treat timeouts differently from other errors these are more likely due to an issue with the website in question * debug if requested * increase pause time * restart browser w/ PID * increase max_workers for selenium this is by individual worker class not for all selenium classes... so you can really crank them out if desired * quick fix restart by pid * avoid bad urls * missing bracket & attempt to fix-missing dependencies in Docker install * Allow dynamic form options in processors * Allow 'requires' on data source options as well * Handle list values with requires * basic processor for apple store; setup checks for additional requirements * fix is_4cat_class * show preview when no map_item * add google store datasource * Docker setup.py use extensions * Wider support for file upload in processors * Log file uploads in DMI service manager * add map_item methods and record more data per item need additional item data as map_item is staticmethod * update from master; merge conflicts * fix docker build context (ignore data files) * fix option requirements * apple store fix: list still tries to get query * apple & google stores fix up item mapping * missed merge error * minor fix * remove unused import * fix datasources w/ files frontend error * fix error w/ datasources having file option * better way to name docker volumes * update two other docker compose files * fix docker-compose ymls * minor bug: fix and add warning; fix no results fail * update apple field names to better match interface * update google store fieldnames and order * sneak in jinja logger if needed * fix fourcat.js handling checkboxes for dynamic settings * add new endpoint for app details to apple store * apple_store map new beta app data * add default lang/country * not all apps have advisories * revert so button works * add chart positions to beta map items * basic scheduler To-do - fix up and add options to scheduler view (e.g. delete/change) - add scheduler view to navigator - tie jobs to datasets? (either in scheduler view or, perhaps, filter dataset view) - more testing... * update scheduler view, add functions to update job interval * revert .env * working scheduler! * basic scheduler view w/ datasets * fix postgres tag * update job status in scheduled_jobs table * fix timestamp; end_date needed for last run check; add dataset label * improve scheduler view * remove dataset from scheduled_jobs table on delete * scheduler view order by last creation * scheduler views: separate scheduler list from scheduled dataset list * additional update from master fixes * apple_store map_items fix missing locales * add back depth for pagination * correct route * modify pagination to accept args * pagination fun * pagination: i hate testing on live servers... * ok ok need the pagination route * pagination: add route_args * fix up scheduler header * improve app store descriptions * add azure store * fix azure links * azure_store: add category search * azure fix type of config update timestamp OPTION_DATE does not appear correctly in settings and causes it to be written incorrectly * basic aws store * check if selenium available; get correct app_id * aws: implement pagination * add logging; wait for elements to load after next page; attempts to rework filter option collection * apple_store: handle invalid param error * fix filter_options * aws: fix filter option collection! * more merge * move new datasources and processors to extensions and modify setup.py and module loader to use the new locations * migrate.py to run extension "fourcat_install.py" files * formatting * remove extensions; add gitignore * excise scheduler merge * some additional cleanup from app_studies branch * allow nested datasources folders; ignore files in extensions main folder * allow extension install scripts to run pip if migrate.py has not * Remove unused URL functions we could use ural for * Take care of git commit hash tracking for extension processors * Get rid of unused path.versionfile config setting * Add extensions README * Squashed commit of the following: commit cd356f7 Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:36:18 2024 +0200 UI setting for 4CAT install ad in login commit 0945d8c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:32:55 2024 +0200 UI setting for anonymisation controls Todo: make per-datasource commit 1a2562c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:27 2024 +0200 Debug panel for HTTP headers in control panel commit 203314e Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:17 2024 +0200 Preview for HTML datasets commit 48c20c2 Author: Desktop Sal <[email protected]> Date: Wed Sep 11 13:54:23 2024 +0200 Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies commit 657ffd7 Author: Dale Wahl <[email protected]> Date: Fri Sep 6 16:29:19 2024 +0200 fix nltk where it matters commit 2ef5c80 Author: Stijn Peeters <[email protected]> Date: Tue Sep 3 12:05:14 2024 +0200 Actually check progress in text annotator commit 693960f Author: Stijn Peeters <[email protected]> Date: Mon Sep 2 18:03:18 2024 +0200 Add processor for stormtrooper DMI service commit 6ae964a Author: Stijn Peeters <[email protected]> Date: Fri Aug 30 17:31:37 2024 +0200 Fix reference to old stopwords list in neologisms preset * Fix Github links for extensions * Fix commit detection in extensions * Fix extension detection in module loader * Follow symlinks when loading extensions Probably not uncommon to have a checked out repo somewhere to then symlink into the extensions dir * Make queue message on create page more generic * Markdown in datasource option tooltips * Remove Spacy model from requirements * Add software_source to database SQL --------- Co-authored-by: Stijn Peeters <[email protected]> Co-authored-by: Stijn Peeters <[email protected]> commit cd356f7 Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:36:18 2024 +0200 UI setting for 4CAT install ad in login commit 0945d8c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:32:55 2024 +0200 UI setting for anonymisation controls Todo: make per-datasource commit 1a2562c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:27 2024 +0200 Debug panel for HTTP headers in control panel commit 203314e Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:17 2024 +0200 Preview for HTML datasets commit 48c20c2 Author: Desktop Sal <[email protected]> Date: Wed Sep 11 13:54:23 2024 +0200 Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies commit 657ffd7 Author: Dale Wahl <[email protected]> Date: Fri Sep 6 16:29:19 2024 +0200 fix nltk where it matters
not sure pycharm's merge is super awesome...
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.