Skip to content

Commit

Permalink
Regex improvement and split into seperate package (#132)
Browse files Browse the repository at this point in the history
- Rewrite of the regex model, with a new interface. 
- Regex model is now its own separate package.
- Also add a new legacy system to deprecate old distributions.
  • Loading branch information
qubixes authored Sep 18, 2023
1 parent 8284dbd commit bf7a79a
Show file tree
Hide file tree
Showing 20 changed files with 1,224 additions and 1,660 deletions.
37 changes: 0 additions & 37 deletions docs/source/api/metasynth.distribution.regex.rst

This file was deleted.

17 changes: 9 additions & 8 deletions docs/source/api/metasynth.distribution.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,6 @@
metasynth.distribution package
==============================

Subpackages
-----------

.. toctree::
:maxdepth: 4

metasynth.distribution.regex

Submodules
----------

Expand Down Expand Up @@ -60,6 +52,15 @@ metasynth.distribution.faker module
:undoc-members:
:show-inheritance:

metasynth.distribution.regex module
-----------------------------------

.. automodule:: metasynth.distribution.regex
:members:
:undoc-members:
:show-inheritance:


Module contents
---------------

Expand Down
62 changes: 31 additions & 31 deletions examples/advanced_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@
},
{
"cell_type": "markdown",
"source": [
"## Step 0: Install the metasynth package and import required packages"
],
"id": "f5c6597b",
"metadata": {
"collapsed": false
},
"id": "f5c6597b"
"source": [
"## Step 0: Install the metasynth package and import required packages"
]
},
{
"cell_type": "code",
Expand Down Expand Up @@ -105,22 +105,22 @@
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"df.describe()"
],
"id": "c72c2acb55fca193",
"metadata": {
"collapsed": false
},
"id": "c72c2acb55fca193"
"outputs": [],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"source": [],
"id": "5df3f1e974e84da4",
"metadata": {
"collapsed": false
},
"id": "5df3f1e974e84da4"
"source": []
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -162,26 +162,26 @@
},
{
"cell_type": "markdown",
"source": [
"Alternatively, we can preview the MetaFrame as it would be output to a file"
],
"id": "aeb3edae1eedf4b8",
"metadata": {
"collapsed": false
},
"id": "aeb3edae1eedf4b8"
"source": [
"Alternatively, we can preview the MetaFrame as it would be output to a file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbb6f59f1d439189",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"json_preview = repr(mf)\n",
"print(json_preview)"
],
"metadata": {
"collapsed": false
},
"id": "cbb6f59f1d439189"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -226,25 +226,25 @@
{
"cell_type": "code",
"execution_count": null,
"id": "c5eac7eeb3326f03",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#load previously exported MetaFrame (.json) file\n",
"mf = MetaFrame.from_json(file_path)"
],
"metadata": {
"collapsed": false
},
"id": "c5eac7eeb3326f03"
]
},
{
"cell_type": "markdown",
"source": [
"Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data."
],
"id": "85201666a67a73fd",
"metadata": {
"collapsed": false
},
"id": "85201666a67a73fd"
"source": [
"Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data."
]
},
{
"cell_type": "code",
Expand Down Expand Up @@ -423,7 +423,7 @@
"# To create a regex distribution, you need a list of tuples, where each tuple is an element.\n",
"# The first part of the tuple is a string representation of the regex, while the second is the proportion of the\n",
"# time the regex element is used.\n",
"cabin_distribution = RegexDistribution(r\"[ABCDEF]\\d{2,3}\") # Add the r so that it becomes a literal string.\n",
"cabin_distribution = RegexDistribution(r\"[ABCDEF][0-9]{2,3}\") # Add the r so that it becomes a literal string.\n",
"# just for completeness: data generated from this distribution will always match the regex [ABCDEF]?(\\d{2,3})?\n",
"\n",
"var_spec = {\n",
Expand Down Expand Up @@ -519,7 +519,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
"version": "3.11.1"
},
"vscode": {
"interpreter": {
Expand Down
58 changes: 33 additions & 25 deletions examples/getting_started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@
},
{
"cell_type": "markdown",
"source": [
"## Step 0: Install the metasynth package and import required packages\n",
"First, install the metasynth package in your session:"
],
"id": "f5c6597b",
"metadata": {
"collapsed": false
},
"id": "f5c6597b"
"source": [
"## Step 0: Install the metasynth package and import required packages\n",
"First, install the metasynth package in your session:"
]
},
{
"cell_type": "code",
Expand Down Expand Up @@ -152,26 +152,26 @@
},
{
"cell_type": "markdown",
"source": [
"Alternatively, we can preview the MetaFrame as it would be output to a file"
],
"id": "572128238aa26d67",
"metadata": {
"collapsed": false
},
"id": "572128238aa26d67"
"source": [
"Alternatively, we can preview the MetaFrame as it would be output to a file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42470906a0d25575",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"json_preview = repr(mf)\n",
"print(json_preview)"
],
"metadata": {
"collapsed": false
},
"id": "42470906a0d25575"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -226,26 +226,26 @@
},
{
"cell_type": "markdown",
"source": [
"Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data."
],
"id": "48bc803ba4e76a1c",
"metadata": {
"collapsed": false
},
"id": "48bc803ba4e76a1c"
"source": [
"Once a MetaFrame is loaded, synthetic data can be generated from it. The `synthesize` method takes the number of rows to be generated as parameter and returns a DataFrame with the synthetic data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ccf451c",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# generate synthetic data\n",
"mf.synthesize(5)"
],
"metadata": {
"collapsed": false
},
"id": "4ccf451c"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -275,7 +275,7 @@
" # Manually set a distribution for age \n",
" \"Age\": {\"distribution\": DiscreteUniformDistribution(20, 40)},\n",
" # Manually set a regex distribution for cabin\n",
" \"Cabin\": {\"distribution\": RegexDistribution(r\"[ABCDEF]\\d{2,3}\")}\n",
" \"Cabin\": {\"distribution\": RegexDistribution(r\"[ABCDEF][0-9]{2,3}\")}\n",
"}\n",
"\n",
"# create the high-quality metadata\n",
Expand Down Expand Up @@ -313,6 +313,14 @@
"source": [
"syn_df.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1e2244c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -331,7 +339,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
"version": "3.11.1"
},
"vscode": {
"interpreter": {
Expand Down
Loading

0 comments on commit bf7a79a

Please sign in to comment.