Arguments and Usage¶

This page is generated from the source code (config.py) and provides an overview of lbsntransform command line arguments.

Usage¶


usage: lbsntransform [-h] [--version] [-o ORIGIN] [--dry-run] [-l]
                     [--file_type FILE_TYPE] [--input_path_url INPUT_PATH_URL]
                     [--is_stacked_json] [--is_line_separated_json]
                     [--dbpassword_hllworker DBPASSWORD_HLLWORKER]
                     [--dbuser_hllworker DBUSER_HLLWORKER]
                     [--dbserveraddress_hllworker DBSERVERADDRESS_HLLWORKER]
                     [--dbname_hllworker DBNAME_HLLWORKER]
                     [-p DBPASSWORD_OUTPUT] [-u DBUSER_OUTPUT]
                     [-a DBSERVERADDRESS_OUTPUT] [-n DBNAME_OUTPUT]
                     [--dbformat_output DBFORMAT_OUTPUT]
                     [--dbpassword_input DBPASSWORD_INPUT]
                     [--dbuser_input DBUSER_INPUT]
                     [--dbserveraddress_input DBSERVERADDRESS_INPUT]
                     [--dbname_input DBNAME_INPUT]
                     [--dbformat_input DBFORMAT_INPUT] [-t TRANSFERLIMIT]
                     [--transfer_count TRANSFER_COUNT]
                     [--commit_volume COMMIT_VOLUME]
                     [--records_tofetch RECORDS_TOFETCH]
                     [--disable_transfer_reactions]
                     [--disable_reaction_post_referencing]
                     [--ignore_non_geotagged]
                     [--startwith_db_rownumber STARTWITH_DB_ROWNUMBER]
                     [--endwith_db_rownumber ENDWITH_DB_ROWNUMBER]
                     [--debug_mode DEBUG_MODE]
                     [--geocode_locations GEOCODE_LOCATIONS]
                     [--ignore_input_source_list IGNORE_INPUT_SOURCE_LIST]
                     [--mappings_path MAPPINGS_PATH]
                     [--input_lbsn_type INPUT_LBSN_TYPE]
                     [--map_full_relations] [--csv_output]
                     [--csv_allow_linebreaks] [--csv_delimiter CSV_DELIMITER]
                     [--use_csv_dictreader] [--recursive_load]
                     [--skip_until_file SKIP_UNTIL_FILE]
                     [--skip_until_record SKIP_UNTIL_RECORD] [--zip_records]
                     [--min_geoaccuracy MIN_GEOACCURACY]
                     [--include_lbsn_objects INCLUDE_LBSN_OBJECTS]
                     [--include_lbsn_bases INCLUDE_LBSN_BASES]
                     [--override_lbsn_query_schema OVERRIDE_LBSN_QUERY_SCHEMA]
                     [--hmac_key HMAC_KEY]

Arguments¶

Quick reference table¶

The quick reference table contains truncated short summaries of descriptions. Jump to individual arguments in the navigation submenu on the left side.

Short	Long	Default	Description
`-h`	`--help`		show this help message
	`--version`		show program's version
`-o`	`--origin`	`0`	Input source type (id)
	`--dry-run`		Perform a trial run
`-l`	`--file_input`		This flag enables file
	`--file_type`	`json`	Specify filetype
	`--input_path_url`	`01_Input`	Path to input folder.
	`--is_stacked_json`		Input is stacked json.
	`--is_line_separated_json`		Json is line separated
	`--dbpassword_hllworker`	`None`	Password for hllworker
	`--dbuser_hllworker`	`postgres`	Username for hllworker
	`--dbserveraddress_hllworker`	`None`	IP for hllworker db
	`--dbname_hllworker`	`None`	DB name for hllworker
`-p`	`--dbpassword_output`	`None`	Password for out-db
`-u`	`--dbuser_output`	`postgres`	Username for out-db.
`-a`	`--dbserveraddress_output`	`None`	IP for output db,
`-n`	`--dbname_output`	`None`	DB name for output db
	`--dbformat_output`	`lbsn`	Format of the out-db.
	`--dbpassword_input`	`None`	Password for input-db
	`--dbuser_input`	`postgres`	Username for input-db.
	`--dbserveraddress_input`	`None`	IP for input-db,
	`--dbname_input`	`None`	DB name for input-db,
	`--dbformat_input`	`json`	Format of the input-db
`-t`	`--transferlimit`	`None`	Abort after x records.
	`--transfer_count`	`50000`	Transfer batch limit x
	`--commit_volume`	`None`	After x commit_volume,
	`--records_tofetch`	`10000`	Fetch x records /batch
	`--disable_transfer_reactions`		Disable reactions.
	`--disable_reaction_post_referencing`		Disable reactions-refs
	`--ignore_non_geotagged`		Ignore none-geotagged.
	`--startwith_db_rownumber`	`None`	Start with db row x.
	`--endwith_db_rownumber`	`None`	End with db row x.
	`--debug_mode`	`None`	Enable debug mode.
	`--geocode_locations`	`None`	Path to loc-geocodes.
	`--ignore_input_source_list`	`None`	Path to input ignore.
	`--mappings_path`	`None`	Path mappings folder.
	`--input_lbsn_type`	`None`	Input sub-type
	`--map_full_relations`		Map full relations.
	`--csv_output`		Store to local CSV.
	`--csv_allow_linebreaks`		Disable linebreak-rem.
	`--csv_delimiter`	`,`	CSV delimiter.
	`--use_csv_dictreader`		Use csv.DictReader.
	`--recursive_load`		Recursive local sub di
	`--skip_until_file`	`None`	Skip until file x.
	`--skip_until_record`	`None`	Skip until record x.
	`--zip_records`		Zip records parallel.
	`--min_geoaccuracy`	`None`	Min geoaccuracy to use
	`--include_lbsn_objects`	`None`	lbsn objects to proces
	`--include_lbsn_bases`	`None`	lbsn bases to update
	`--override_lbsn_query_schema`	`None`	Override schema and ta
	`--hmac_key`	`None`	Override db hmac key

`-h`, `--help`¶

show this help message and exit

`--version`¶

show program's version number and exit

`-o`, `--origin`¶

(Default: 0)

Input source type (id).

Defaults to 0: LBSN

Other possible values:

1 - Instagram
2 - Flickr
21 - Flickr YFCC100M
3 - Twitter

`--dry-run`¶

Perform a trial run
with no changes made to database/output

`-l`, `--file_input`¶

This flag enables file input

(instead of reading data from a database).

To specify which files to process, see parameter --input_path_url.
To specify file types, e.g. whether to process data from json or csv, or from URLs,
see --file_type

`--file_type`¶

(Default: json)

Specify filetype

(json, csv etc.)

only applies if --file_input is used.

`--input_path_url`¶

(Default: 01_Input)

Path to input folder.

If not provided, subfolder ./01_Input/ will be used.
You can also provide a web-url, starting with http(s)
URLs will be accessed using requests.get(url, stream=True).
To separate multiple urls, use semicolon (;). In this case, see also --zip_records.

`--is_stacked_json`¶

Input is stacked json.

The typical form of json is [{json1},{json2}]
If --is_stacked_json is set, it will process stacked jsons in the form of {json1}{json2} (no comma)

`--is_line_separated_json`¶

Json is line separated

The typical form is [{json1},{json2}]
If --is_line_separated_json is set, it will process stacked jsons in the form of {json1} {json2} (with linebreak)
Unix style linebreaks (CR) will be used across platforms
Windows users, use (e.g.) notepad++ to convert from Windows style linebreaks (CRLF)

`--dbpassword_hllworker`¶

(Default: None)

Password for hllworker db

If reading data into hlldb, all HLL Worker parameters must be supplied bydefault.
You can substitute hlldb parameters here
In this case, lbsntransform will use hlldb to convert and union hll sets and to store output results
Currently, this re-use of hlldb requires to supply the same set of parameters twice
For separation of concerns, it is recommended to use a separate HLL Worker database

`--dbuser_hllworker`¶

(Default: postgres)

Username for hllworker db.

`--dbserveraddress_hllworker`¶

(Default: None)

IP for hllworker db

e.g. 111.11.11.11
Optionally add port the to use, e.g. 111.11.11.11:5432.
5432 is the default port

`--dbname_hllworker`¶

(Default: None)

DB name for hllworker db

e.g. hllworkerdb

`-p`, `--dbpassword_output`¶

(Default: None)

Password for out-db

(postgres raw/hll db)

`-u`, `--dbuser_output`¶

(Default: postgres)

Username for out-db.

Default: example-user-name2

`-a`, `--dbserveraddress_output`¶

(Default: None)

IP for output db,

e.g. 111.11.11.11
Optionally add port to use, e.g. 111.11.11.11:5432.
5432 is the default port

`-n`, `--dbname_output`¶

(Default: None)

DB name for output db

e.g. rawdb or hlldb

`--dbformat_output`¶

(Default: lbsn)

Format of the out-db.

Either hll or lbsn.
This setting affects how data is stored, either in anonymized and aggregate form (hll), or in the lbsn raw structure (lbsn).

`--dbpassword_input`¶

(Default: None)

Password for input-db

`--dbuser_input`¶

(Default: postgres)

Username for input-db.

`--dbserveraddress_input`¶

(Default: None)

IP for input-db,

e.g. 111.11.11.11
Optionally add port to use, e.g. 111.11.11.11:5432.
5432 is the default port

`--dbname_input`¶

(Default: None)

DB name for input-db,

e.g.: rawdb

`--dbformat_input`¶

(Default: json)

Format of the input-db.

Either lbsn or json
If lbsn is used, the native lbsn raw input mapping (0) will be used
If json is used, a custom mapping for json must be provided, for mapping database json's to the lbsn structure. See input mappings

`-t`, `--transferlimit`¶

(Default: None)

Abort after x records.

This can be used to limit the number of records that will be processed.
e.g. --transferlimit 10000 will process the first 10000 input records
Defaults to None (= process all)
Note that one input record can map to many output records. This number applies to the number of input records, not the output count.

`--transfer_count`¶

(Default: 50000)

Transfer batch limit x.

Defines after how many parsed records the results will be transferred to the DB.
Defaults to 50000
If you have a slow server, but a fast machine, larger values improve speed because duplicate check happens in Python, and not in Postgres coalesce;
However, larger values require more local memory. If you have a fast server, but a slow machine, try if a smaller batch --transfer_count (e.g. 5000) improves speed.

Note

Use --transferlimit to limit the total number of records transferred. --transfer_count instead defines the batch count that is used to transfer data incrementally.

`--commit_volume`¶

(Default: None)

After x commit_volume, changes (transactions) will be written to the output database (a Postgres COMMIT).

Note that updated entries in the output database are only written from the WAL buffer after a commit.

Default for rawdb: 10000
Default for hlldb: 100000

Warning

If you have concurrent writes to the DB (e.g. multiple lbsntransform processes) and if you see transaction deadlocks, reduce the commit_volume.

`--records_tofetch`¶

(Default: 10000)

Fetch x records /batch.

If retrieving data from a db (lbsn), limit the number of records to fetch at once.
Defaults to 10000

`--disable_transfer_reactions`¶

Disable reactions.

If set, processing of lbsn reactions will be skipped,
only original posts are transferred.
This is usefull to reduce processing and data footprint for some service data, e.g. for Twitter, with a large number of reactions containing little original content.

`--disable_reaction_post_referencing`¶

Disable reactions-refs.

Enable this option in args to prevent empty posts being stored due to Foreign-Key-Exists Requirement.
Possible parameters:

0 = Save Original Tweets of Retweets as posts;
1 = do not store Original Tweets of Retweets;
2 = !Not implemented: Store Original Tweets of Retweets as post_reactions

`--ignore_non_geotagged`¶

Ignore none-geotagged.

If set, posts that are not geotagged are ignored during processing.

`--startwith_db_rownumber`¶

(Default: None)

Start with db row x.

If transferring from a databse (input), this flag can be used to resume processing (e.g.) if a transfer has been aborted.

Provide a number (row-id) to start processing from live db.
If input db type is lbsn, this is the primary key, without the origin_id, (e.g. the post_guid, place_guid etc.).
This flag will only work if processing a single lbsn object (e.g. --include_lbsn_objects "post").

Example:

--startwith_db_rownumber "123456789"
will lead to the first batch-query from the DB looking like this:

SELECT * FROM topical."post"
WHERE post_guid > '123456789'
ORDER BY post_guid ASC
LIMIT 10000;

`--endwith_db_rownumber`¶

(Default: None)

End with db row x.

Provide a number (row-id) to end processing from live db

`--debug_mode`¶

(Default: None)

Enable debug mode.

`--geocode_locations`¶

(Default: None)

Path to loc-geocodes.

Provide path to a CSV file with location geocodes
CSV Header must be: lat, lng, name).
This can be used in mappings to assign coordinates (lat, lng) to use provided locations as text

`--ignore_input_source_list`¶

(Default: None)

Path to input ignore.

Provide a path to a list of input_source types that will be ignored (e.g. to ignore certain bots etc.)

`--mappings_path`¶

(Default: None)

Path mappings folder.

Provide a path to a custom folder that contains one or more input mapping modules (*.py).

Have a look at the two sample mappings in the resources folder.
See how to define custom input mappings in the docs

`--input_lbsn_type`¶

(Default: None)

Input sub-type

e.g. post, profile, friendslist, followerslist etc.
This can be used to select an appropiate mapping procedure in a single mapping module.

`--map_full_relations`¶

Map full relations.

Set to true to map full relations, e.g. many-to-many relationships, such as user_follows, user_friend, or user_mentions etc. are mapped in a separate table. Defaults to False.

`--csv_output`¶

Store to local CSV.

If set, will store all submit values to local CSV instead. Currently, this type of output is not available.

`--csv_allow_linebreaks`¶

Disable linebreak-rem.

If set, will not remove intext-linebreaks (\r or \n) in output CSVs

`--csv_delimiter`¶

(Default: ,)

CSV delimiter.

Provide the CSV delimiter to be used.
Default is comma (,).
Note: to pass tab, use variable substitution ($"\t")

`--use_csv_dictreader`¶

Use csv.DictReader.

By default, CSVs will be read line by line,
using the standard csv.reader().

This will enable csv.DictReader(),
which allows to access CSV fields by name in mappings.

A CSV with a header is required for this setting to work.

Note that csv.DictReader() may be slower than the default csv.reader().

`--recursive_load`¶

Recursive local sub dirs.

If set, process input directories recursively (default depth: 2)

`--skip_until_file`¶

(Default: None)

Skip until file x.

If local input, skip all files until file with name x appears (default: start immediately)

`--skip_until_record`¶

(Default: None)

Skip until record x.

If local input, skip all records until record x (default: start with first)

`--zip_records`¶

Zip records parallel.

Use this flag to zip records of multiple input files
e.g. List1[A,B,C], List2[1,2,3] will be combined (zipped) on read to List[A1,B2,C3]

`--min_geoaccuracy`¶

(Default: None)

Min geoaccuracy to use

Set to latlng, place, or city to limit processing of records based on mininum geoaccuracy (default: no limit)

`--include_lbsn_objects`¶

(Default: None)

lbsn objects to process

If processing from lbsn db (rawdb), provide a comma separated list of lbsn objects to include.
May contain:

origin
country
city
place
user_groups
user
post
post_reaction

Notes:

Excluded objects will not be queried, but empty objects may be created due to referenced foreign key relationships.
Defaults to origin,post

`--include_lbsn_bases`¶

(Default: None)

lbsn bases to update

If the target output type is hll, provide a comma separated list of lbsn bases to include/update/store to.

Currently supported:

hashtag
emoji
term
_hashtag_latlng
_term_latlng
_emoji_latlng
_month_hashtag
_month_hashtag_latlng
_month_latlng
monthofyear
month
dayofmonth
dayofweek
hourofday
year
month
date
timestamp
country
region
city
place
latlng
community

Bases not included will be skipped. Per default, no bases will be considered.

Example:

--include_lbsn_bases hashtag,place,date,community

This will update entries in the Postgres hlldb tables social.community. topical.hashtag, spatial.place, temporal.date and non-existing entries will be created, existing ones will be updated (a hll_union).

See the structure definition in SQL here for a full list of hlldb table structures.

Argument only allowed one time.

`--override_lbsn_query_schema`¶

(Default: None)

Override schema and table name

This can be used to redirect lbsn queries on the given object from input db to a specific schema/table such as a materialized view.

This can be usefull (e.g.) to limit processing of input data to a specific query.

Format is lbsn_type,schema.table.

Example:

--override_lbsn_query_schema post,mviews.mypostquery

Argument can be used multiple times.

`--hmac_key`¶

(Default: None)

Override db hmac key

The hmac key that is used for cryptographic hashing during creation of HLL sets. Override what is set in hllworker database here.

Remember to re-use the same hmac key for any consecutive update of HLL sets.

The crypt.salt variable can also be set (temporarily or permanently) in the hll worker database itself.
Example:

ALTER DATABASE hllworkerdb SET crypt.salt = 'CRYPTSALT';

Further information is available in the YFCC HLL tutorial.

Arguments and Usage¶

Usage¶

Arguments¶

Quick reference table¶

-h, --help¶

--version¶

-o, --origin¶

--dry-run¶

-l, --file_input¶

--file_type¶

--input_path_url¶

--is_stacked_json¶

--is_line_separated_json¶

--dbpassword_hllworker¶

--dbuser_hllworker¶

--dbserveraddress_hllworker¶

--dbname_hllworker¶

-p, --dbpassword_output¶

-u, --dbuser_output¶

-a, --dbserveraddress_output¶

-n, --dbname_output¶

--dbformat_output¶

--dbpassword_input¶

--dbuser_input¶

--dbserveraddress_input¶

--dbname_input¶

--dbformat_input¶

-t, --transferlimit¶

--transfer_count¶

--commit_volume¶

--records_tofetch¶

--disable_transfer_reactions¶

--disable_reaction_post_referencing¶

--ignore_non_geotagged¶

--startwith_db_rownumber¶

--endwith_db_rownumber¶

--debug_mode¶

--geocode_locations¶

--ignore_input_source_list¶

--mappings_path¶

--input_lbsn_type¶

--map_full_relations¶

--csv_output¶

--csv_allow_linebreaks¶

--csv_delimiter¶

--use_csv_dictreader¶

--recursive_load¶

--skip_until_file¶

--skip_until_record¶

--zip_records¶

--min_geoaccuracy¶

--include_lbsn_objects¶

--include_lbsn_bases¶

--override_lbsn_query_schema¶

--hmac_key¶

`-h`, `--help`¶

`--version`¶

`-o`, `--origin`¶

`--dry-run`¶

`-l`, `--file_input`¶

`--file_type`¶

`--input_path_url`¶

`--is_stacked_json`¶

`--is_line_separated_json`¶

`--dbpassword_hllworker`¶

`--dbuser_hllworker`¶

`--dbserveraddress_hllworker`¶

`--dbname_hllworker`¶

`-p`, `--dbpassword_output`¶

`-u`, `--dbuser_output`¶

`-a`, `--dbserveraddress_output`¶

`-n`, `--dbname_output`¶

`--dbformat_output`¶

`--dbpassword_input`¶

`--dbuser_input`¶

`--dbserveraddress_input`¶

`--dbname_input`¶

`--dbformat_input`¶

`-t`, `--transferlimit`¶

`--transfer_count`¶

`--commit_volume`¶

`--records_tofetch`¶

`--disable_transfer_reactions`¶

`--disable_reaction_post_referencing`¶

`--ignore_non_geotagged`¶

`--startwith_db_rownumber`¶

`--endwith_db_rownumber`¶

`--debug_mode`¶

`--geocode_locations`¶

`--ignore_input_source_list`¶

`--mappings_path`¶

`--input_lbsn_type`¶

`--map_full_relations`¶

`--csv_output`¶

`--csv_allow_linebreaks`¶

`--csv_delimiter`¶

`--use_csv_dictreader`¶

`--recursive_load`¶

`--skip_until_file`¶

`--skip_until_record`¶

`--zip_records`¶

`--min_geoaccuracy`¶

`--include_lbsn_objects`¶

`--include_lbsn_bases`¶

`--override_lbsn_query_schema`¶

`--hmac_key`¶