编译原理和技术 Lab 1 Lexical Analysis

2023-05-16

Lab 1 Lexical Analysis

1 Goal

You are given:

  1. A public repository of a incomplete project on Gitlab. The URL is http://210.45.114.146:880/staff/pl0compiler.
  2. This Guide.

After this project, you should:

  1. Learn to cooperate with Git
  2. Complete a small lexical analyzer for PL/0 language with other members in your group.
  3. Write a report for your participation and contributions.

2 Contents

2.1 PL/0 language tokens

Tokens in PL/0 are divided into 4 types: identifiers, numbers, symbols and reserved words.

2.1.1 Identifiers

Identifiers are strings which programmers define to represent variable and procedure names. In PL/0, there are no restrictions for the identifiers. However, arbitary names for variables and procedures may lead to unpredictable results. You are REQUIRED to ensure that identifiers, which are accepted by the lexical analysis module, contains ONLY LITERAL LETTERS, NUMBERS AND UNDERLINE(_). Furthermore, numbers SHOULD NOT be the first character of an identifier.

You can refer to C programming language if you do not know how to understand this.

2.1.2 Numbers

Numbers in PL/0 are ALWAYS integers ranging from -999999999 to 999999999 (which can be easier to be represented in C). However, in lexical analyzer integers are recognized as unsigned integers ranging from 0 to 999999999. The sign - should be recognized as a symbol (which means “minus”).

2.1.3 Symbols

Symbols accepted are: + - * / = != < <= > >= ( ) , ; . :=.

+ - * / means 4 basic arithmetic operations. = here, which does not means assignment to variables (The correct choice is :=), together with != < <= > >= are comparison operators. ( ) are parentheses. Other symbols are punctuations.

2.1.4 Reserved Words

Reserved words are here: var, const, procedure, begin, end, if, then, do, while, call, odd. Case sensitive.

2.1.5 Comments

Comments are dropped in lexical analysis stage. They are not regarded as tokens.

Line comments starts with double slash (//), on the right of which in this line any input source code are dropped. You should read from left to right for each line. Whenever // is detected, no more words need to be read.

Block comments starts with /* and ends with */. Whenever /* is detected, all the words read are comments and should be dropped, until the next */ is read. An unterminated /* should be reported as an error, while an unexpected */ should be recognized as two tokens (* and /). For example, the following source code does not mean more words as a comment:

/* var x; */ more words */

In this case, the words in more words should not be dropped, and the last two characters */ should be recognized as token * and token / here.

In the code above, when the first */ is detected right after var x;, the comment ends. The second */ near the end of the line is invalid and should be reported. (More talk is here.)

2.1.6 Splitting

Tokens are recommended to be splitted by ' ', '\t' and '\n'. However, this is not required. Splitting of tokens follows:

  1. Symbols can always closely stand by other tokens (including other symbols). The longest match principle is used when multiple symbols is together.

a+b should be recognized as three tokens: identifier a, symbol + and identifier b.

  1. Numbers are not accepted when next to identifiers and reserved words as token type “number” here.
    2.1. Numbers cannot appear before reserved words and identifiers without splitters. This should reported as an error.
    2.2. Numbers can appear in identifiers with literal letters (cannot be the first character in the identifier).
  2. Identifiers and reserved keywords are regarded as words. Words MUST be splitted.

2.2 PL/0 Compiler Architecture

The PL/0 compiler follows the 4 phases of compilation: lexical analysis, syntax analysis, semantic analysis, code generation. The 4 parts are designed to be 4 modules to make up for the entire compiler. In this experiment, we are only concerned with its lexical analysis, thus we only focus on the lexical analyzer module.

2.2.1 Project files

The project manages files by CMAKE (Minimum version is 2.8). Files are arranged as below:

(root directory)
-- + pl0compiler
   ├ -- + common
   │    ├ --   CMakeLists.txt  (for cmake in subdirectory)
   │    ├ --   common.c  (implementation of functions in common.h)
   │    └ --   common.h  (initial structure for the whole compiler, declaration for global variables)
   ├ -- + examples  (different examples for PL/0 program)
   ├ -- + lex
   │    ├ --   CMakeLists.txt  (for cmake in subdirectory)
   │    ├ --   pl0_lex.c  (implementation of functions in pl0_lex.h)
   │    └ --   pl0_lex.h  (lexical analyzer)
   ├ --   .gitignore
   ├ --   CMakeLists.txt  (for cmake)
   ├ --   README.md
   ├ --   do_cmake.sh  (shell script for automatically do cmake)
   └ --   test.c  (for testing)

2.2.2 Lexical analyzer workflow

First we dive into pl0_lex.h to see the structure of the analyzer.

50	typedef struct _tPL0Lex {
51		/* Parent return pointer */
52		struct _tPL0Compiler * compiler;
53
54		/* For output */
55		PL0TokenType last_token_type;
56		char last_id[MAX_ID_LEN + 1];
57		int last_num;
58
59		/** --------------------------
60		 * TODO: Your variables here
61		 */
62	} PL0Lex;

The first pointer compiler is used to link back to the compiler which contains the lexical analyzer. It does not matter much in this project.

The next 3 variables are very important, which play the key roles in token streams for the syntax analyzer. As you should know, the interface exposed by the lexical analyzer to the syntax analyzer is get_token(). The syntax analyzer invokes get_token() function and then fetches a token from the lexical analyzer. Each token is either an identifier, a number, a symbol or a reserved keyword. That is, a token should have its TYPE. Then how do we represent a token?

The last_token_type variable holds the type for the last token read. We defined 32 token types here. Each symbol and reserved keyword has its own type here.

The last_id variable contains the string for the last identifier read. The maximum length for an identifier is 10 bytes. It is valid ONLY IF the last token read is an identifier.

The last_num variable is similar to last_id. We use a C-style integer to represent the last number read. Remember also it is valid ONLY IF the last last token read is a number.

As we mentioned above, the valid range for a number is 0 to 999999999. That is, the length ofa number is no more than 9.

Now when the function get_token is invoked, the three variables should be set (This is for you to implement). The syntax analyzer recognizes each token by first distinguishing the last_token_type variable.

If this variable is set to TOKEN_IDENTIFIER, then the token is an identifier. The last_id variable contains the name for the identifier.

If this variable is set to TOKEN_NUMBER, then the token is a number, whose value is set in last_num.

In other cases, a token is a symbol or a reserved keyword. If a keyword is recognized, the last_token_type is set to a value larger than TOKEN_RESWORDS. The syntax analyzer is able to tell if is in this case by simply comparing the two values. Vice versa for TOKEN_SYMBOL.

Variables in enumeration of C is automatically arranged by the ascending order of natural numbers (0, 1, 2, 3, …). As a result, values of enumeration items can be compared.

Last but not least, if the source code contains lexical errors, the lexical analyzer set the last_token_type to TOKEN_NULL when the error is read.

2.2.3 Implementation of lexical module

The function for get_token() is PL0Lex_get_token, which accepts 1 parameter (the pointer of a lexical analyzer). The return value is boolean, TRUE for successfully read, FALSE for failures (In most cases, it means the end of the source code file, otherwise it should be I/O errors).

This function should return TRUE EVEN IF an invalid token is read. When this happens, the invoker knows the problem when it finds the last_token_type is set to TOKEN_NULL.
In later experiments, we will handle errors. Now when you find errors in the source code file, just print it out in the standard output.

This function is for you to implement. Tips: remember to drop the comments (Refer to section 2.1.5 for details).

Furthermore, location should be detected for each token. A token can be described by its line number and the index of its first and last letter in the line. For example, in the PL/0 program below:

var a;
begin
	a := -1;
end.

The first token is var. Then last_token_type should be set to TOKEN_VAR. Now we need more variables for the location of the token (line number is 1 and location in the line is 0 to 2). THIS IS FOR YOU TO IMPLEMENT. Add variables to lexical analyzer structure and make it!

2.2.4 Constants list

You are offered two lists of constants (declared in pl0_lex.h and defined in pl0_lex.c), which contains the strings of reserved words and symbols. What’s more, the lists follows the same order for the token types in the type enumeration. The index of the list is useful when a token is detected as a reserved word or a symbol and should set the last_token_type variable.

2.2.5 Testing part

A snippet is pre-defined to help you test your implementation of lexical analyzer (written as test.c. Dive in and modify as you like.

The default program does not accept invalid tokens (when the last_token_type is set to TOKEN_NULL. Change it if necessary.

The location of each token is not shown by default. You need to implement this depending on your data structure of your lexical analyzer. The preferred style of output is (line number):(first letter index)-(last letter index). For the given program in section 2.2.3, the output should be:

Reserved word:	var		1:0-2
Identifier:		a		1:4-4
Symbol:			;		1:5-5
Reserved word:	begin	2:0-4
...

Alignments is optional but recommended.

2.3 Cooperating with Git

Now it is the first time for all the members to cooperate in your team.

2.3.1 LEADER: Creating your group’s fork

Only leaders of all the groups need to do this part.

First open the public project URL (http://210.45.114.146:880/staff/pl0compiler) in your browser. Make sure you are logged in before doing anything.

Now click Fork button on the main panel. Then you are asked to select a namespace. Choose your group and wait. When the progress is finished, it will appear on the screen.

2.3.2 Branching

It is not recommended to directly commit on the master branch. Use your own branches to develop.

Once the project is forked into your group’s namespace, all members of your team will find it under the “Your projects” section. First you should clone the remote repository onto your host with git clone. The cloning URL can be found on the web pages on the server.

Once you have cloned to your computer, do not make changes to the master branch of the source code directly. Use your own branches with git branch and git checkout to shift to your branch to work. This helps when multiple people works in parallel.

2.3.3 Committing

NEVER regard commits as saving of your progress. A commit should be checkpoints of your work, that is, a commit should be a snapshot of output of a working stage. For example, part of work is done, or is decided to be delayed.

When a commit is decided to happen, make sure the codes are clear and graceful. Maybe there are BUGS or the codes are incomplete, it’s OK. Do not leave the codes in a mess when committing, even if you decide to stop working. In the future, a rollback may happen, and others (or even yourself) will be confused by messy source code.

2.3.4 Merging

When the work on your branch is done, you need to merge your branch onto the master branch.

If your role is maintainer or higher in the group, you are able to directly push to the master branch. In this case, merge your branch to the local master branch, then push to the remote repository.

If your role is developer in your group, then you cannot push to master branch, since the branch is protected by default. You have to start a merge request on the web pages.

3 Instructions for lab 1

  1. (LEADER) Create your group’s fork repository of the public repository.
  2. Each group need to read the guide, especially section 2.2.3. The output should meet the requirement in section 2.2.5.
  3. Each member in the group shares part of work and complete.
  4. Finally the project contains the full implementation, with no errors or bugs, which gives a valid output of any input source code file of PL/0.
  5. Create a directory in the repository with name docs, then create a subdirectory lab-1. Everyone writes his own report for this project with a markdown file describing his own work and problems. Each markdown file should be named with the student ID of each member (e.g. PB16XXXXXX.md).

Each problem should be described as follows: what it is and how it happens, the possible solutions, how you work to make it done, and what you have learnt.

  1. (LEADER) Remember to describe how your team cooperate in the project in your own report.
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

编译原理和技术 Lab 1 Lexical Analysis 的相关文章

  • Win11暂停更新点不了怎么办?Win11暂停更新是灰色的如何解决?

    Win11暂停更新点不了怎么办 xff1f Win11暂停更新是灰色的如何解决 xff1f 有很多朋友发现了这个情况 xff0c 原先自己设置了暂停更新 xff0c 但是等到突然某一天系统就会开始更新 xff0c 这时用户想要再点暂停更新就
  • Win11玩绝地求生(PUBG)崩溃怎么办?Win11玩绝地求生崩溃解决方法

    绝地求生 xff08 PUBG xff09 是一款非常有趣射击类游戏 xff0c 哪怕升级Win11系统也有很多小伙伴都在体验 xff0c 但有不少小伙伴在Win11系统更新完之后发现经常会出现崩溃或者闪退的情况 xff0c 很多小伙伴不清
  • 重装系统后没声音如何解决

    重装系统之后不少用户总是遇到各种各样的问题 xff0c 例如说电脑重装系统后没声音 xff0c 却不知道应该怎么解决 今天 xff0c 我们就来看看重装系统后没有声音怎么办的解决方法 工具 原料 xff1a 系统版本 xff1a windo
  • OpenStack之Region、Available Zone、Host Aggregates

    OpenStack之Region Available Zone Host Aggregates 亚马逊AWS是公共云计算的先驱 xff0c 一些云计算中重要的产品设计和基础概念可以说都是亚马逊引入的 这其中有两个非常重要的概念 xff1a
  • idea安装copilot

    目录 1 申请资格 2 安装插件 3 使用Copilot 现在已经要收费了 xff0c 申请资格变成购买了 10美金一个月 不过如果是学生的话可以进行学生认证 xff0c 使用学生认证来免费使用 非学生的话如果想要使用可以搞个github学
  • 【Leetcode】342. Power of Four(二进制计算)(判定是否为4的幂次方)

    Given an integer signed 32 bits write a function to check whether it is a power of 4 Example 1 Input 16 Output true Exam
  • signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xc

    span class hljs number 02 span span class hljs subst span span class hljs number 02 span span class hljs number 00 span
  • 【剑指offer】数字在排序数组中出现的次数

    统计一个数字在排序数组中出现的次数 解题思路 xff1a 遍历查找不是本题的最优解 xff0c 既然给出的是有序数组 xff0c 所以我们只需要找到目标的左侧和右侧的索引即可 所以我们可以找到本数组当中key 43 0 5和key 0 5的
  • 【剑指offer】栈的压入、弹出序列(Python中List模拟栈队列操作)

    题目描述 输入两个整数序列 xff0c 第一个序列表示栈的压入顺序 xff0c 请判断第二个序列是否可能为该栈的弹出顺序 假设压入栈的所有数字均不相等 例如序列1 2 3 4 5是某栈的压入顺序 xff0c 序列4 5 3 2 1是该压栈序
  • 【Python】pandas合并多个CSV表,去重表头

    我们有三个子表 xff0c 每个表都有表头但是没有每行的索引 xff0c 每一个表在csv文件中结构如下 xff1a name age x 65 y 77 z 10 通过Pandas打开 data 61 pd read csv r 39 t
  • 【剑指offer】链表找环的入口

    给一个链表 xff0c 若其中包含环 xff0c 请找出该链表的环的入口结点 xff0c 否则 xff0c 输出null 解题思路 xff1a 在链表判环的基础上进行优化 追击问题 xff0c 一快一慢可以再环中相遇 p1 61 p1 ne
  • 【环境配置】Ubuntu系统下GPU驱动、CUDA、cuDNN和Pytorch安装

    一 Pytorch PyTorch是一个开源的深度学习框架 xff0c 该框架由Facebook人工智能研究院的Torch7团队开发 xff0c 它的底层基于Torch xff0c 但实现与运用全部是由python来完成 该框架主要用于人工
  • FFmpeg视频裁剪

    使用FFmpeg对视频数据进行裁剪 xff0c 由于精度要求并不严格 xff0c 亲测后并未出现导出视频开始画面静止现象 ss 表示开始时间 t 表示需要裁剪的时长 ffmpeg ss 00 00 10 t 00 00 05 i a mp4
  • 将XYWH数据转换为YOLO数据格式

    1 YOLO数据格式介绍 YOLO标签的数据是一个相对值 xff0c 而不是绝对值 YOLO标签数据为原始图像对应的txt文件 xff0c 每一张图像对应一个txt xff0c 其中包含了多条标签信息 数据格式表示为 xff1a 1 xff
  • 【目标检测】边界框回归与variances参数的作用

    本文主要讨论在目标检测中 xff0c 对于边界框Bbox的回归 xff0c 以及variances参数的作用 1 边界框回归 针对目标检测问题 xff0c 由于存在Anchor xff08 固定的参考框 xff09 xff0c 网络模型需要
  • Python提取指定颜色区域并填充闭合区间

    xff08 这两天让医生标图 xff0c 对病灶进行分割标注 xff0c 结果所有图都是在原图上画的红线 xff0c 所以需要自己另外生成对应的mask xff09 算法目标 筛选出图像中的红色区域生成一张带有边界的二值图 xff0c 对闭
  • 【Python】文件、文件夹--复制、剪切、删除

    1 针对单个文件 复制文件 shutil copyfile src dst xff1a dst必须是完整的目标文件名 xff0c 可以在复制过程中重命名 shutil copy src dst xff1a dst可以是文件 xff0c 也可
  • Ubuntu 16.04 远程桌面

    一 设置Ubuntu 16 04 允许进行远程控制 二 安装vncserver sudo apt span class token operator span get install xrdp vnc4server xbase span c
  • 【Python】调用上级目录下的模块

    src main py utils io py data test py 在test py文件中如果想要调用 main py或者utils io py中的方法 xff0c 需要将src的路径加入到test py文件的包导入路径当中 xff0
  • 【环境配置】conda环境迁移

    由于公司内网服务器不能连接外网 xff0c 配置Python环境可能需要通过另外的设备进行迁移 xff1b 最方便的方法还是能够将现有环境直接进行打包 xff0c 新机器上直接conda activate就可以 xff1b 因此需要在两台机

随机推荐