Arguments and Usage¶
This page is generated from the source code (config.py) and provides an overview of lbsntransform command line arguments.
Usage¶
usage: lbsntransform [-h] [--version] [-o ORIGIN] [--dry-run] [-l]
                     [--file_type FILE_TYPE] [--input_path_url INPUT_PATH_URL]
                     [--is_stacked_json] [--is_line_separated_json]
                     [--dbpassword_hllworker DBPASSWORD_HLLWORKER]
                     [--dbuser_hllworker DBUSER_HLLWORKER]
                     [--dbserveraddress_hllworker DBSERVERADDRESS_HLLWORKER]
                     [--dbname_hllworker DBNAME_HLLWORKER]
                     [-p DBPASSWORD_OUTPUT] [-u DBUSER_OUTPUT]
                     [-a DBSERVERADDRESS_OUTPUT] [-n DBNAME_OUTPUT]
                     [--dbformat_output DBFORMAT_OUTPUT]
                     [--dbpassword_input DBPASSWORD_INPUT]
                     [--dbuser_input DBUSER_INPUT]
                     [--dbserveraddress_input DBSERVERADDRESS_INPUT]
                     [--dbname_input DBNAME_INPUT]
                     [--dbformat_input DBFORMAT_INPUT] [-t TRANSFERLIMIT]
                     [--transfer_count TRANSFER_COUNT]
                     [--commit_volume COMMIT_VOLUME]
                     [--records_tofetch RECORDS_TOFETCH]
                     [--disable_transfer_reactions]
                     [--disable_reaction_post_referencing]
                     [--ignore_non_geotagged]
                     [--startwith_db_rownumber STARTWITH_DB_ROWNUMBER]
                     [--endwith_db_rownumber ENDWITH_DB_ROWNUMBER]
                     [--debug_mode DEBUG_MODE]
                     [--geocode_locations GEOCODE_LOCATIONS]
                     [--ignore_input_source_list IGNORE_INPUT_SOURCE_LIST]
                     [--mappings_path MAPPINGS_PATH]
                     [--input_lbsn_type INPUT_LBSN_TYPE]
                     [--map_full_relations] [--csv_output]
                     [--csv_allow_linebreaks] [--csv_delimiter CSV_DELIMITER]
                     [--use_csv_dictreader] [--recursive_load]
                     [--skip_until_file SKIP_UNTIL_FILE]
                     [--skip_until_record SKIP_UNTIL_RECORD] [--zip_records]
                     [--min_geoaccuracy MIN_GEOACCURACY]
                     [--include_lbsn_objects INCLUDE_LBSN_OBJECTS]
                     [--include_lbsn_bases INCLUDE_LBSN_BASES]
                     [--override_lbsn_query_schema OVERRIDE_LBSN_QUERY_SCHEMA]
                     [--hmac_key HMAC_KEY]
Arguments¶
Quick reference table¶
The quick reference table contains truncated short summaries of descriptions. Jump to individual arguments in the navigation submenu on the left side.
| Short | Long | Default | Description | 
|---|---|---|---|
| -h | --help | show this help message | |
| --version | show program's version | ||
| -o | --origin | 0 | Input source type (id) | 
| --dry-run | Perform a trial run | ||
| -l | --file_input | This flag enables file | |
| --file_type | json | Specify filetype | |
| --input_path_url | 01_Input | Path to input folder. | |
| --is_stacked_json | Input is stacked json. | ||
| --is_line_separated_json | Json is line separated | ||
| --dbpassword_hllworker | None | Password for hllworker | |
| --dbuser_hllworker | postgres | Username for hllworker | |
| --dbserveraddress_hllworker | None | IP for hllworker db | |
| --dbname_hllworker | None | DB name for hllworker | |
| -p | --dbpassword_output | None | Password for out-db | 
| -u | --dbuser_output | postgres | Username for out-db. | 
| -a | --dbserveraddress_output | None | IP for output db, | 
| -n | --dbname_output | None | DB name for output db | 
| --dbformat_output | lbsn | Format of the out-db. | |
| --dbpassword_input | None | Password for input-db | |
| --dbuser_input | postgres | Username for input-db. | |
| --dbserveraddress_input | None | IP for input-db, | |
| --dbname_input | None | DB name for input-db, | |
| --dbformat_input | json | Format of the input-db | |
| -t | --transferlimit | None | Abort after x records. | 
| --transfer_count | 50000 | Transfer batch limit x | |
| --commit_volume | None | After x commit_volume, | |
| --records_tofetch | 10000 | Fetch x records /batch | |
| --disable_transfer_reactions | Disable reactions. | ||
| --disable_reaction_post_referencing | Disable reactions-refs | ||
| --ignore_non_geotagged | Ignore none-geotagged. | ||
| --startwith_db_rownumber | None | Start with db row x. | |
| --endwith_db_rownumber | None | End with db row x. | |
| --debug_mode | None | Enable debug mode. | |
| --geocode_locations | None | Path to loc-geocodes. | |
| --ignore_input_source_list | None | Path to input ignore. | |
| --mappings_path | None | Path mappings folder. | |
| --input_lbsn_type | None | Input sub-type | |
| --map_full_relations | Map full relations. | ||
| --csv_output | Store to local CSV. | ||
| --csv_allow_linebreaks | Disable linebreak-rem. | ||
| --csv_delimiter | , | CSV delimiter. | |
| --use_csv_dictreader | Use csv.DictReader. | ||
| --recursive_load | Recursive local sub di | ||
| --skip_until_file | None | Skip until file x. | |
| --skip_until_record | None | Skip until record x. | |
| --zip_records | Zip records parallel. | ||
| --min_geoaccuracy | None | Min geoaccuracy to use | |
| --include_lbsn_objects | None | lbsn objects to proces | |
| --include_lbsn_bases | None | lbsn bases to update | |
| --override_lbsn_query_schema | None | Override schema and ta | |
| --hmac_key | None | Override db hmac key | 
-h, --help¶
show this help message and exit
--version¶
show program's version number and exit
-o, --origin¶
(Default: 0)
Input source type (id).
- Defaults to 0: LBSN
Other possible values:
- 1- Instagram
- 2- Flickr
- 21- Flickr YFCC100M
- 3- Twitter
--dry-run¶
Perform a trial run 
with no changes made to database/output
-l, --file_input¶
This flag enables file input
(instead of reading data from a database).
- To specify which files to process, see parameter --input_path_url.
- To specify file types, e.g. whether to process data from jsonorcsv, or from URLs,
 see--file_type
--file_type¶
(Default: json)
Specify filetype
(json, csv etc.)   
- only applies if --file_inputis used.
--input_path_url¶
(Default: 01_Input)
Path to input folder.
- If not provided, subfolder ./01_Input/will be used.
- You can also provide a web-url, starting with http(s)
- URLs will be accessed using requests.get(url, stream=True).
- To separate multiple urls, use semicolon (;). In this case, see also--zip_records.
--is_stacked_json¶
Input is stacked json.
- The typical form of json is [{json1},{json2}]
- If --is_stacked_jsonis set, it will process stacked jsons in the form of{json1}{json2}(no comma)
--is_line_separated_json¶
Json is line separated
- The typical form is [{json1},{json2}]
- If --is_line_separated_jsonis set, it will process stacked jsons in the form of{json1} {json2}(with linebreak)
- Unix style linebreaks (CR) will be used across platforms
- Windows users, use (e.g.) notepad++ to convert from Windows style linebreaks (CRLF)
--dbpassword_hllworker¶
(Default: None)
Password for hllworker db
- If reading data into hlldb, all HLL Worker parameters must be supplied bydefault.
- You can substitute hlldb parameters here
- In this case, lbsntransform will use hlldb to convert and union hll sets and to store output results
- Currently, this re-use of hlldb requires to supply the same set of parameters twice
- For separation of concerns, it is recommended to use a separate HLL Worker database
--dbuser_hllworker¶
(Default: postgres)
Username for hllworker db.
--dbserveraddress_hllworker¶
(Default: None)
IP for hllworker db
- e.g. 111.11.11.11
- Optionally add port the to use, e.g. 111.11.11.11:5432.
- 5432is the default port
--dbname_hllworker¶
(Default: None)
DB name for hllworker db
- e.g. hllworkerdb
-p, --dbpassword_output¶
(Default: None)
Password for out-db
(postgres raw/hll db)
-u, --dbuser_output¶
(Default: postgres)
Username for out-db.
Default: example-user-name2
-a, --dbserveraddress_output¶
(Default: None)
IP for output db,
- e.g. 111.11.11.11
- Optionally add port to use, e.g. 111.11.11.11:5432.
- 5432is the default port
-n, --dbname_output¶
(Default: None)
DB name for output db
- e.g. rawdborhlldb
--dbformat_output¶
(Default: lbsn)
Format of the out-db.
- Either hllorlbsn.
- This setting affects how data is stored, either in anonymized and aggregate form (hll), or in the lbsn raw structure (lbsn).
--dbpassword_input¶
(Default: None)
Password for input-db
--dbuser_input¶
(Default: postgres)
Username for input-db.
--dbserveraddress_input¶
(Default: None)
IP for input-db,
- e.g. 111.11.11.11
- Optionally add port to use, e.g. 111.11.11.11:5432.
- 5432is the default port
--dbname_input¶
(Default: None)
DB name for input-db,
- e.g.: rawdb
--dbformat_input¶
(Default: json)
Format of the input-db.
- Either lbsnorjson
- If lbsn is used, the native lbsn raw input mapping (0) will be used
- If jsonis used, a custom mapping for json must be provided, for mapping database json's to the lbsn structure. See input mappings
-t, --transferlimit¶
(Default: None)
Abort after x records.
- This can be used to limit the number of records that will be processed.
- e.g. --transferlimit 10000will process the first 10000 input records
- Defaults to None (= process all)
- Note that one input record can map to many output records. This number applies to the number of input records, not the output count.
--transfer_count¶
(Default: 50000)
Transfer batch limit x.
- Defines after how many parsed records the results will be transferred to the DB.
- Defaults to 50000
- If you have a slow server, but a fast machine, larger values improve speed because duplicate check happens in Python, and not in Postgres coalesce;
- However, larger values require more local memory. If you have a fast server, but a slow machine, try if a smaller batch --transfer_count(e.g. 5000) improves speed.
Note
Use --transferlimit to limit the 
total number of records transferred. --transfer_count 
instead defines the batch count that is used to transfer 
data incrementally.  
--commit_volume¶
(Default: None)
After x commit_volume, changes (transactions) will be written to the output database (a Postgres COMMIT).
Note that updated entries in the output database are only written from the WAL buffer after a commit.
- Default for rawdb: 10000
- Default for hlldb: 100000
Warning
If you have concurrent writes to the DB (e.g. multiple lbsntransform processes) and if you see transaction deadlocks, reduce the commit_volume.
--records_tofetch¶
(Default: 10000)
Fetch x records /batch.
- If retrieving data from a db (lbsn), limit the number of records to fetch at once.
- Defaults to 10000
--disable_transfer_reactions¶
Disable reactions.
- If set, processing of lbsn reactions will be skipped,
- only original posts are transferred.
- This is usefull to reduce processing and data footprint for some service data, e.g. for Twitter, with a large number of reactions containing little original content.
--disable_reaction_post_referencing¶
Disable reactions-refs.
Enable this option in args to prevent empty posts being stored due to Foreign-Key-Exists Requirement. 
Possible parameters:  
- 0= Save Original Tweets of Retweets as- posts;
- 1= do not store Original Tweets of Retweets;
- 2= !Not implemented: Store Original Tweets of Retweets as- post_reactions
--ignore_non_geotagged¶
Ignore none-geotagged.
If set, posts that are not geotagged are ignored during processing.
--startwith_db_rownumber¶
(Default: None)
Start with db row x.
If transferring from a databse (input), this flag can be used to resume processing (e.g.) if a transfer has been aborted.
- Provide a number (row-id) to start processing from live db.
- If input db type is lbsn, this is the primary key, without theorigin_id, (e.g. thepost_guid,place_guidetc.).
- This flag will only work if processing a single lbsn object (e.g. --include_lbsn_objects "post").
Example:
--startwith_db_rownumber "123456789"
will lead to the first batch-query from the DB looking like this:  
SELECT * FROM topical."post"
WHERE post_guid > '123456789'
ORDER BY post_guid ASC
LIMIT 10000;
--endwith_db_rownumber¶
(Default: None)
End with db row x.
Provide a number (row-id) to end processing from live db
--debug_mode¶
(Default: None)
Enable debug mode.
--geocode_locations¶
(Default: None)
Path to loc-geocodes.
- Provide path to a CSV file with location geocodes
- CSV Header must be: lat, lng, name).
- This can be used in mappings to assign coordinates (lat, lng) to use provided locations as text
--ignore_input_source_list¶
(Default: None)
Path to input ignore.
Provide a path to a list of input_source types that will be ignored (e.g. to ignore certain bots etc.)
--mappings_path¶
(Default: None)
Path mappings folder.
Provide a path to a custom folder that contains one or more input mapping modules (*.py).   
- Have a look at the two sample mappings in the resources folder.
- See how to define custom input mappings in the docs
--input_lbsn_type¶
(Default: None)
Input sub-type
- e.g. post,profile,friendslist,followerslistetc.
- This can be used to select an appropiate mapping procedure in a single mapping module.
--map_full_relations¶
Map full relations.
Set to true to map full relations, e.g. many-to-many relationships, such as user_follows, user_friend, or user_mentions etc. are mapped in a separate table. Defaults to False.
--csv_output¶
Store to local CSV.
If set, will store all submit values to local CSV instead. Currently, this type of output is not available.
--csv_allow_linebreaks¶
Disable linebreak-rem.
If set, will not remove intext-linebreaks (\r or \n) in output CSVs
--csv_delimiter¶
(Default: ,)
CSV delimiter.
- Provide the CSV delimiter to be used.
- Default is comma (,).
- Note: to pass tab, use variable substitution ($"\t")
--use_csv_dictreader¶
Use csv.DictReader.
By default, CSVs will be read line by line,
using the standard csv.reader().  
This will enable csv.DictReader(),
which allows to access CSV fields by name in mappings.  
A CSV with a header is required for this setting to work.
Note that csv.DictReader() may be slower than the default csv.reader().
--recursive_load¶
Recursive local sub dirs.
If set, process input directories recursively (default depth: 2)
--skip_until_file¶
(Default: None)
Skip until file x.
If local input, skip all files until file with name x appears (default: start immediately)
--skip_until_record¶
(Default: None)
Skip until record x.
If local input, skip all records until record x (default: start with first)
--zip_records¶
Zip records parallel.
- Use this flag to zip records of multiple input files
- e.g. List1[A,B,C],List2[1,2,3]will be combined (zipped) on read toList[A1,B2,C3]
--min_geoaccuracy¶
(Default: None)
Min geoaccuracy to use
Set to latlng, place, or city to limit processing of records based on mininum geoaccuracy (default: no limit)
--include_lbsn_objects¶
(Default: None)
lbsn objects to process
If processing from lbsn db (rawdb), provide a comma separated list of lbsn objects to include. 
May contain:  
- origin
- country
- city
- place
- user_groups
- user
- post
- post_reaction
Notes:
- Excluded objects will not be queried, but empty objects may be created due to referenced foreign key relationships.
- Defaults to origin,post
--include_lbsn_bases¶
(Default: None)
lbsn bases to update
If the target output type is hll, provide a comma separated list of lbsn bases to include/update/store to.   
Currently supported:
- hashtag
- emoji
- term
- _hashtag_latlng
- _term_latlng
- _emoji_latlng
- _month_hashtag
- _month_hashtag_latlng
- _month_latlng
- monthofyear
- month
- dayofmonth
- dayofweek
- hourofday
- year
- month
- date
- timestamp
- country
- region
- city
- place
- latlng
- community
Bases not included will be skipped. Per default, no bases will be considered.
Example:
--include_lbsn_bases hashtag,place,date,community
This will update entries in the Postgres hlldb tables social.community. topical.hashtag, spatial.place, temporal.date and non-existing entries will be created, existing ones will be updated (a hll_union).   
See the structure definition in SQL here for a full list of hlldb table structures.
Argument only allowed one time.
--override_lbsn_query_schema¶
(Default: None)
Override schema and table name
This can be used to redirect lbsn queries on the given object from input db to a specific schema/table such as a materialized view.
This can be usefull (e.g.) to limit processing of input data to a specific query.
Format is lbsn_type,schema.table.  
Example:
--override_lbsn_query_schema post,mviews.mypostquery
Argument can be used multiple times.
--hmac_key¶
(Default: None)
Override db hmac key
The hmac key that is used for cryptographic hashing during creation of HLL sets. Override what is set in hllworker database here.
Remember to re-use the same hmac key for any consecutive update of HLL sets.
The crypt.salt variable can also be set (temporarily or permanently) in the hll worker database itself. 
Example:  
ALTER DATABASE hllworkerdb SET crypt.salt = 'CRYPTSALT';
Further information is available in the YFCC HLL tutorial.